Opened 6 years ago

Closed 2 weeks ago

#9896 closed bug (not reproducible)

DVD file copy stalls at ca. 1 GB, pegs CPUs

Reported by: ttcoder Owned by: nobody
Priority: normal Milestone: Unscheduled
Component: File Systems/UDF Version: R1/Development
Keywords: Cc: korli
Blocked By: Blocking: #9181
Has a Patch: no Platform: All

Description

I've tried 4 times now: it does not matter if I use Tracker or Terminal & cp, can't copy big files from DVD. If I reboot to Windows the very same DVD works 100% so that would rule out a bad sector/faulty media I guess.

Symptoms:

  • after copying ca. 1 GB (roughly 50% of the 2 GB file), the DVD drive stops, its light switches off
  • CPUs peg up, at ca. 60% usage. (if I try to immediately run another test, I eventually end up with 100% CPU usage).
  • can't Ctrl-C the /bin/cp process
  • can't kill it in ProcessController either.
  • can't attach-for-debugging to it (after running Debugger), though that might just be me being a newbie with the new debugger.

I basically 'have' to reboot at this point, though apart from pegged CPU the system seems stable (I'm writing this in W+ without having rebooted yet).

The syslog includes a couple statements made with echo > /dev/dprint for clarity.

Attachments (4)

syslog_STALL_with_some_dprintf_echo.txt (114.6 KB) - added by ttcoder 6 years ago.
No error messages shown after 'cp' is launched, a few at mounting time though.
syslog_first UDF trace.txt (512.0 KB) - added by ttcoder 6 years ago.
(first and last of ca. 4 syslogs)
syslog_last UDF trace.txt (331.0 KB) - added by ttcoder 6 years ago.
(if needed, I can try a more "clean" capture of the 4 successive syslogs)
block_for_9896.raw (1.0 MB) - added by ttcoder 6 years ago.

Download all attachments as: .zip

Change History (22)

Changed 6 years ago by ttcoder

No error messages shown after 'cp' is launched, a few at mounting time though.

comment:1 Changed 6 years ago by anevilyak

Assuming you're able to drop into KDL at this point, it would be interesting to see the backtrace of that particular thread. Not being able to attach to it would imply it's likely blocked in the kernel somewhere.

comment:2 Changed 6 years ago by ttcoder

Seems to be object_cache_alloc; not sure if I have to do something other than 'bt' when the prompt comes up, or if I shoul drop into KDL multiple times to see if the backtrace changes over time?

(great that this stuff is written to the syslog BTW, saves me doing some photo grabbing :-)

KERN: Keyboard Requested Halt.
KERN: Welcome to Kernel Debugging Land...
KERN: Thread 765 "cp" running on CPU 0
KERN: kdebug> kdebug> kdebug> kdebug> btstack trace for thread 765 "cp"
KERN:     kernel stack: 0x81b06000 to 0x81b0a000
KERN:       user stack: 0x624d4000 to 0x634d4000
KERN: frame               caller     <image>:function + offset
KERN:  0 81b08fb4 (+  48) 80095b0c   <kernel_x86> invoke_command_trampoline([34m0x81b0904c[0m) + 0x1c
KERN:  1 81b08fe4 (+  12) 801257e2   <kernel_x86> arch_debug_call_with_fault_handler + 0x1b
KERN:  2 81b08ff0 (+  48) 80094a86   <kernel_x86> debug_call_with_fault_handler + 0x5e
KERN:  3 81b09020 (+  64) 80095d65   <kernel_x86> invoke_debugger_command + 0xb9
KERN:  4 81b09060 (+  64) 80095b91   <kernel_x86> invoke_pipe_segment(debugger_command_pipe*: [34m0xcce38030[0m, int32: [34m0[0m, [34m0x0[0m [31m"<NULL>"[0m) + 0x79
KERN:  5 81b090a0 (+  64) 80095ed0   <kernel_x86> invoke_debugger_command_pipe + 0x9c
KERN:  6 81b090e0 (+  48) 80097b6c   <kernel_x86> ExpressionParser<[32m0x81b09190[0m>::_ParseCommandPipe([34m0x81b0918c[0m) + 0x234
KERN:  7 81b09110 (+  64) 80096d8c   <kernel_x86> ExpressionParser<[32m0x81b09190[0m>::EvaluateCommand([34m0x801a1f40[0m [36m"bt"[0m, [34m0x81b0918c[0m) + 0x2d0
KERN:  8 81b09150 (+ 224) 800991b8   <kernel_x86> evaluate_debug_command + 0x80
KERN:  9 81b09230 (+  64) 80093312   <kernel_x86> kernel_debugger_loop([34m0x80171b7e[0m [36m"Keyboard Requested Halt."[0m, [34m0x0[0m [31m"<NULL>"[0m, [34m0x0[0m [31m"<NULL>"[0m, int32: [34m0[0m) + 0x32a
KERN: 10 81b09270 (+  48) 80093567   <kernel_x86> kernel_debugger_internal([34m0x80171b7e[0m [36m"Keyboard Requested Halt."[0m, [34m0x0[0m [31m"<NULL>"[0m, [34m0x0[0m [31m"<NULL>"[0m, int32: [34m0[0m) + 0x53
KERN: 11 81b092a0 (+  48) 80094e48   <kernel_x86> kernel_debugger + 0x28
KERN: 12 81b092d0 (+  48) 80094986   <kernel_x86> debug_emergency_key_pressed + 0x2a
KERN: 13 81b09300 (+  64) d175fee3   </boot/system/add-ons/kernel/bus_managers/ps2> keyboard_handle_int(ps2_dev*: [34m0xd1765d3c[0m) + 0x11f
KERN: 14 81b09340 (+  64) d175f914   </boot/system/add-ons/kernel/bus_managers/ps2> ps2_dev_handle_int + 0x218
KERN: 15 81b09380 (+  48) d175efca   </boot/system/add-ons/kernel/bus_managers/ps2> ps2_interrupt(NULL) + 0xee
KERN: 16 81b093b0 (+  48) 800597aa   <kernel_x86> int_io_interrupt_handler + 0x5e
KERN: 17 81b093e0 (+  64) 801342ed   <kernel_x86> x86_hardware_interrupt + 0x6d
KERN: 18 81b09420 (+  12) 80128290   <kernel_x86> int_bottom + 0x30
KERN: kernel iframe at 0x81b0942c (end = 0x81b0947c)
KERN:  eax 0xdfdfe880    ebx 0xdfdfe880     ecx 0x60        edx 0x60
KERN:  esi 0x180         edi 0x82807dc0     ebp 0x81b094a4  esp 0x81b09460
KERN:  eip 0x801000bf eflags 0x210246  
KERN:  vector: 0x21, error code: 0x0
KERN: 19 81b0942c (+ 120) 801000bf   <kernel_x86> object_cache_alloc + 0x32b
KERN: 20 81b094a4 (+  48) 800f8aa3   <kernel_x86> block_alloc(uint32: [34m0x180[0m ([34m384[0m), uint32: [34m0x0[0m ([34m0[0m), uint32: [34m0x0[0m ([34m0[0m)) + 0x73
KERN: 21 81b094d4 (+  48) 800f8de4   <kernel_x86> realloc + 0x38
KERN: 22 81b09504 (+  48) 8004e858   <kernel_x86> FileMap<[32m0xdf02bea0[0m>::_MakeSpace(uint32: [34m0x9[0m ([34m9[0m)) + 0x88
KERN: 23 81b09534 (+  64) 8004e8d6   <kernel_x86> FileMap<[32m0xdf02bea0[0m>::_Add(file_io_vec*: [34m0x81b095b4[0m, uint32: [34m0x8[0m ([34m8[0m), [34m0x81b095ac[0m) + 0x36
KERN: 24 81b09574 (+ 192) 8004eccb   <kernel_x86> FileMap<[32m0xdf02bea0[0m>::_Cache(int64: [34m1073676288[0m, int64: [34m65536[0m) + 0xf7
KERN: 25 81b09634 (+  80) 8004eed8   <kernel_x86> FileMap<[32m0xdf02bea0[0m>::Translate(int64: [34m1073676288[0m, uint32: [34m0x10000[0m ([34m65536[0m), file_io_vec*: [34m0x81b09764[0m, [34m0x81b09760[0m, uint32: [34m0x800[0m ([34m2048[0m)) + 0x100
KERN: 26 81b09684 (+  48) 8004f49a   <kernel_x86> file_map_translate + 0x2e
KERN: 27 81b096b4 (+  64) 81afd6b6   </boot/system/add-ons/kernel/file_systems/udf> iterative_io_get_vecs_hook([34m0xdf00cd80[0m, IORequest*: [34m0x81b098dc[0m, int64: [34m1073676288[0m, uint32: [34m0x10000[0m ([34m65536[0m), file_io_vec*: [34m0x81b09764[0m, [34m0x81b09760[0m) + 0x3e
KERN: 28 81b096f4 (+ 240) 800e8b06   <kernel_x86> do_iterative_fd_io_iterate([34m0xd26578a8[0m, IORequest*: [34m0x81b098dc[0m, [34m0x81b09813[0m) + 0x8a
KERN: 29 81b097e4 (+  64) 800e96cb   <kernel_x86> do_iterative_fd_io + 0x16b
KERN: 30 81b09824 (+  64) 81afdee9   </boot/system/add-ons/kernel/file_systems/udf> udf_io(fs_volume*: [34m0xdf02ac00[0m, fs_vnode*: [34m0xd23ff590[0m, NULL, IORequest*: [34m0x81b098dc[0m) + 0x69
KERN: 31 81b09864 (+  64) 800e9264   <kernel_x86> vfs_vnode_io + 0x2c
KERN: 32 81b098a4 (+ 224) 800deda5   <kernel_x86> vfs_read_pages + 0x69
KERN: 33 81b09984 (+ 784) 8004c446   <kernel_x86> read_into_cache(file_cache_ref*: [34m0xdf02a780[0m, NULL, int64: [34m1073676288[0m, int32: [34m0[0m, uint32: [34m0x17eb000[0m, uint32: [34m0x10000[0m ([34m65536[0m), [34mtrue[0m, vm_page_reservation*: [34m0x81b09d58[0m, uint32: [34m0x0[0m ([34m0[0m)) + 0x1ee
KERN: 34 81b09c94 (+ 208) 8004d949   <kernel_x86> cache_io([34m0xdf02a780[0m, NULL, int64: [34m1073737728[0m, uint32: [34m0x17fa000[0m, [34m0x81b09f2c[0m, [34mfalse[0m) + 0x701
KERN: 35 81b09d64 (+  96) 8004e38e   <kernel_x86> file_cache_read + 0x36
KERN: 36 81b09dc4 (+ 128) 81af8064   </boot/system/add-ons/kernel/file_systems/udf> Icb<[32m0xdf00cd80[0m>::Read(int64: [34m1073676288[0m, [34m0x17eb000[0m, [34m0x81b09f2c[0m, NULL) + 0x88
KERN: 37 81b09e44 (+  64) 81afde77   </boot/system/add-ons/kernel/file_systems/udf> udf_read(fs_volume*: [34m0xdf02ac00[0m, fs_vnode*: [34m0xd23ff590[0m, [34m0xd22cc340[0m, int64: [34m1073676288[0m, [34m0x17eb000[0m, [34m0x81b09f2c[0m) + 0x33
KERN: 38 81b09e84 (+  64) 800e0ddb   <kernel_x86> file_read(file_descriptor*: [34m0xdf02a540[0m, int64: [34m1073676288[0m, [34m0x17eb000[0m, [34m0x81b09f2c[0m) + 0x67
KERN: 39 81b09ec4 (+  80) 800ccbad   <kernel_x86> common_user_io(int32: [34m3[0m, int64: [34m-1[0m, [34m0x17eb000[0m, uint32: [34m0x10000[0m ([34m65536[0m), [34mfalse[0m) + 0x181
KERN: 40 81b09f14 (+  48) 800cd070   <kernel_x86> _user_read + 0x28
KERN: 41 81b09f44 (+ 100) 80128480   <kernel_x86> handle_syscall + 0xcd
KERN: user iframe at 0x81b09fa8 (end = 0x81b0a000)
KERN:  eax 0x8a          ebx 0x10212cc      ecx 0x634d2a04  edx 0x61738114
KERN:  esi 0x10000       edi 0x0            ebp 0x634d2a40  esp 0x81b09fdc
KERN:  eip 0x61738114 eflags 0x3202    user esp 0x634d2a04
KERN:  vector: 0x63, error code: 0x0
KERN: 42 81b09fa8 (+   0) 61738114   <commpage> commpage_syscall + 0x04
KERN: 43 634d2a40 (+ 384) 0030cd6b   <_APP_> main (nearest) + 0x1737
KERN: 44 634d2bc0 (+ 480) 0030eb7b   <_APP_> src_info_init (nearest) + 0x145f
KERN: 45 634d2da0 (+  80) 0030f431   <_APP_> copy + 0x89
KERN: 46 634d2df0 (+ 224) 0030b2c9   <_APP_> usage (nearest) + 0xbd5
KERN: 47 634d2ed0 (+ 240) 0030c037   <_APP_> main + 0xa03
KERN: 48 634d2fc0 (+  48) 0030a5cf   <_APP_> _start + 0x5b
KERN: 49 634d2ff0 (+  48) 00c745ae   </boot/system/runtime_loader@0x00c65000> <unknown> + 0xf5ae
KERN: 50 634d3020 (+   0) 61738250   <commpage> commpage_thread_exit + 0x00
KERN: kdebug> kdebug> continue KERN: Last message repeated 2 times.

comment:3 Changed 6 years ago by anevilyak

Component: - GeneralFile Systems/UDF

That could be a problem in either the object cache or in the UDF add-on itself. Assigning to the latter for now though, as similarly-sized copies work fine for other media/filesystems.

comment:4 Changed 6 years ago by bonefish

The infinite loop can be at any level in the stack trace. So entering KDL multiple times to do a stack trace would indeed help to possibly rule out the functions in the upper stack. Fortunately the kernel debugger has an even more helpful tool for this case: breakpoint support. The "caller" column shows the return addresses, so setting a breakpoint at such an address can conclusively determine whether the callee returns to the caller.

The only caveat of the method is that breakpoints cannot be restricted to certain teams or threads, so setting a breakpoint in a commonly used function (like in the memory allocation code) will likely not be very helpful, since it may repeatedly be triggered by other threads. The four UDF functions we see in the stack trace should be pretty good locations, though. The file map related kernel code might be OK, too. It is also used by BFS, but it shouldn't be hit very often (possibly only for the syslog writes), as long as you avoid doing stuff with files (e.g. in Tracker).

The strategy as such is rather simple:

  1. Enter KDL and get a stack trace of the misbehaving thread.
  2. Pick one of the caller addresses, set a breakpoint via breakpoint <address>, and continue.
  3. If the breakpoint is hit (KDL will be entered with a respective message):
    • If it is in the wrong thread, just continue. If the breakpoint keeps being hit by the wrong threads, clear the breakpoint (breakpoint <address> clear -- use the command history to avoid address typos) and try 2. again with a different address.
    • If it is in the right thread, clear the breakpoint and try 2. again with an address lower in the stack trace.
    If the breakpoint is not hit, clear the breakpoint and try 2. again with an address higher in the stack trace.

The goal of the exercise is to find the topmost caller that is not being returned to or at least narrow it down (i.e. find a lower boundary).

comment:5 Changed 6 years ago by ttcoder

I'm returning with what I think is a solid pin-point on the "stuck" function, though you guys need to interpret it ("caller" means the breakpoint pertains to the line below which it is written, off-by-one, or to the same line?).

It didn't take long before I narrowed things down to this block:

KERN: 24 81b09574 (+ 192) 8004eccb   <kernel_x86> FileMap<[32m0xdf02bea0[0m>::_Cache(int64: [34m1073676288[0m, int64: [34m65536[0m) + 0xf7
KERN: 25 81b09634 (+  80) 8004eed8   <kernel_x86> FileMap<[32m0xdf02bea0[0m>::Translate(int64: [34m1073676288[0m, uint32: [34m0x10000[0m ([34m65536[0m), file_io_vec*: [34m0x81b09764[0m, [34m0x81b09760[0m, uint32: [34m0x800[0m ([34m2048[0m)) + 0x100
KERN: 26 81b09684 (+  48) 8004f49a   <kernel_x86> file_map_translate + 0x2e

So I did..

  • breakpoint 0x8004eccb

Which was hit all the time (Desktop did not even have time to redraw after a "continue", I was dropped right back to KDL).

Then I cleared it and did...

  • breakpoint 0x8004eed8

Which was hit every 4 seconds or so, by a different thread (page_writer), but never ever by the "cp" thread.

comment:6 Changed 6 years ago by bonefish

Thanks, that pinpoints it indeed. The function name matches the caller address on the same line, so FileMap::_Cache() doesn't return. It only has a single loop that repeatedly calls vfs_get_file_map() (which just calls the file system hook) and adds the retrieved vectors until it hits the end of the file or the maximum cache size. So either the retrieved vectors aren't handled correctly or UDF's hook doesn't work correctly -- returning B_OK together with no vectors or an empty one. I suppose an assert could be added to catch the latter case at least.

comment:7 Changed 6 years ago by anevilyak

Cc: korli added

From looking at http://cgit.haiku-os.org/haiku/tree/src/add-ons/kernel/file_systems/udf/Icb.cpp#n498 , the empty case looks to be a distinct possibility. Perhaps korli can shed some light, as he most recently touched the UDF code.

comment:8 Changed 6 years ago by korli

An empty extent at the end of the file could definitely trigger such a behavior. FileMap::_Cache() should btw checks whether the offset has been advanced in the loop. It would interesting to see the log generated by activating it here.

Changed 6 years ago by ttcoder

Attachment: syslog_first UDF trace.txt added

(first and last of ca. 4 syslogs)

Changed 6 years ago by ttcoder

Attachment: syslog_last UDF trace.txt added

(if needed, I can try a more "clean" capture of the 4 successive syslogs)

comment:9 Changed 6 years ago by korli

It looks like a descriptor chain is occurring in the very same disk block, which we don't seem to handle correctly. ttcoder, is it possible to dump this dvd and provide it online for testing purposes?

comment:10 Changed 6 years ago by ttcoder

At 30 KB/s upload rate it's gonna take forever :-b... Better mail/Colissimo it to you some day.. I suppose it wouldn't do much good to send just a part of it, extracted with a clever combination of /bin/head and /bin/tail.. ?

No big deal on fixing this bug as far as I'm concerned; though that particular DVD was burnt in Windows 7 with a recent version of Nero so I suppose it could occur again for other people in the future with modern (strike that... with "recent") gear, so I'll try to help.

comment:11 Changed 6 years ago by korli

Slow upload rate :) Anyway something like "dd if=/dev/../../dvd.raw of=block.raw bs=2K skip=524553" would normally do if the block size is 2K...

Changed 6 years ago by ttcoder

Attachment: block_for_9896.raw added

comment:12 Changed 6 years ago by ttcoder

You bet it is slow, but I've been on dial-up for so long I'm still grateful it's that "fast"!

Uploaded the first MB from within windows (webpositive consistently stalls/stops after a few seconds, dang).

comment:13 Changed 4 years ago by ttcoder

Could be a dupe of previous/older ticket #9181

comment:14 Changed 4 years ago by pulkomandy

Blocking: 9181 added

comment:15 Changed 4 years ago by waddlesplash

Milestone: R1Unscheduled

Moving UDF tickets out of R1

comment:16 Changed 2 weeks ago by cocobean

@ttcoder - Please retest at hrev53043+ or R1B1.

comment:17 Changed 2 weeks ago by ttcoder

I don't think I own that DVD any more -- suggest filing as not reproducible

comment:18 Changed 2 weeks ago by waddlesplash

Resolution: not reproducible
Status: newclosed
Note: See TracTickets for help on using tickets.