Opened 15 years ago
Closed 15 years ago
#5489 closed bug (fixed)
PANIC: vm_page_fault: unhandled page fault in kernel at 0x70616d6d, ip 0x800d2723
Reported by: | aldeck | Owned by: | bonefish |
---|---|---|---|
Priority: | normal | Milestone: | R1 |
Component: | System/Kernel | Version: | R1/Development |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Platform: | All |
Description
PANIC: vm_page_fault: unhandled page fault in kernel at 0x70616d6d, ip 0x800d2723
While building HaikuLauncher with j4 on a quad core.
Haiku hrev35621, gcc2/4 hybdrid
Attached the serial dump of the kdebug session (following anevilyak's and mmlr's instructions over irc).
Attachments (2)
Change History (11)
by , 15 years ago
Attachment: | kdebug_session.txt added |
---|
follow-up: 5 comment:1 by , 15 years ago
comment:2 by , 15 years ago
This is the faulting line in object_cache_alloc(): http://dev.haiku-os.org/browser/haiku/trunk/src/system/kernel/slab/Slab.cpp?rev=35621#L666. The source->free field contained the pointer 0x70616d6d, which would not even be a kernel address. The source pointer (esi: 0xd206ffe4) looks OK (28 bytes before the end of the page -- the slab structure being part of the slab itself), so I'd assume that one is OK. The source->free pointer can't have been written there the regular way, since that would have dereferenced it as well, so someone probably overwrote the slab structure. 0x70616d6d translates to 'mmap', which could be part of a string. It would have been interesting to see a bit more of the surrounding memory.
The object cache is for vm_page_mappings objects. I'd consider it rather unlikely that a user of those would overwrite them (they just consist of two list links and two pointers). Since that is 447 KB into a slab area, either someone got a page mapping object address from invalid/uninitialized memory and messes with it or the user of a neighboring slab has written beyond its allocation (somewhat more likely backwards). So the caches the neighboring slabs belong to would have been interesting, too. Candidates are only small object caches with 4 KB slabs, which reduces the list to the following:
0x81803220 block cache: 32 32 12288 0 329 381 c0000000 0x81803330 block cache: 48 48 4096 0 61 84 80000000 0x81803420 block cache: 64 64 4096 0 5 63 c0000000 0x81803510 block cache: 80 80 4096 0 39 50 80000000 0x818037e0 block cache: 128 128 4096 0 4 31 c0000000 0x818038d0 block cache: 160 160 4096 0 14 25 80000000 0x81803ba0 block cache: 256 256 4096 0 6 15 c0000000 0x81807c00 swapblock 148 622592 152 0 4104 0 0x81807b00 mbufs 256 131072 0 419 480 0
The "block cache:..." caches are used only for slab internal allocations (since USE_SLAB_ALLOCATOR_FOR_MALLOC doesn't seem to be enabled), which should only be for the object caches objects themselves, the slab structures for hashed object caches, the hash tables for those, and the object depot objects and magazines. Nothing where I see potential overwriting happen, at least not with such a value. The "swapblock" objects are apparently unused at that point. No idea how the "mbufs" are used in the FreeBSD networking code and whether there could be any problem there.
To sum my ramblings up: Most likely a problem of slab structure corruption by overwritten memory, probably by a slab user or the slab implementation itself (the latter less likely IMO). Definitely doesn't sound good.
In case anyone runs into the same situation. The memory around the slab structure would be of interest (e.g. "db ($esi - 32) 128") as would neighboring slabs ("slab_meta_chunk $esi").
comment:3 by , 15 years ago
comment:4 by , 15 years ago
Regarding mutex_lock.cap: The bug in the "info" command is fixed in hrev35657. The double lock of the mutex is just a coincidence. The underlying problem is the page fault in object_cache_alloc(). If faults at 0x70616000, which could also be a part of a string (or the end of one and start of another): "\0`ap". Grepping through our sources turned nothing up that would match, but it might come from somewhere else.
I still think overwritten memory is the most likely cause. Axel's infinite loop problem might have the same cause, so I would first let him examine it, since he can easily reproduce it.
comment:5 by , 15 years ago
Replying to axeld:
FWIW my system has been very unstable after the latest magazine cleanups. Disabling the magazines cured the problem.
Come to think of it, that actually strongly hints towards a problem in the object depot code. Any overwriting of memory that happens outside that code, would still happen, if object depots are disabled. So disabling them shouldn't really avoid any problem. An alternative explanation to hrev35601 triggering the bug because objects are now returned to the slab would be that hrev35601 actually introduced the bug in the first place. :-) That explanation has the advantage of also explaining why disabling the object depot disables the bug. I guess it doesn't harm to review the object depot code once more.
follow-up: 7 comment:6 by , 15 years ago
Component: | System/Kernel → Network & Internet/Stack |
---|
Following Axel's hint to reproduce the bug (copy several GB of files over network), I tried a scp -r localhost:... ...
with my mp3 collection and reliably run into something like the following very quickly:
PANIC: vm_page_fault: unhandled page fault in kernel space at 0x4, ip 0x800c6e28 Welcome to Kernel Debugging Land... Thread 466 "sshd" running on CPU 2 kdebug> bt stack trace for thread 466 "sshd" kernel stack: 0x822a1000 to 0x822a5000 user stack: 0x7efef000 to 0x7ffef000 frame caller <image>:function + offset 0 822a49e8 (+ 32) 800724f5 <kernel_x86> invoke_command_trampoline(void*: 0x822a4a68) + 0x0015 1 822a4a08 (+ 12) 800e74e0 <kernel_x86>:arch_debug_call_with_fault_handler + 0x001b 2 822a4a14 (+ 48) 800703a2 <kernel_x86>:debug_call_with_fault_handler + 0x0051 3 822a4a44 (+ 64) 80072872 <kernel_x86>:invoke_debugger_command + 0x00bb 4 822a4a84 (+ 48) 8007298f <kernel_x86> invoke_pipe_segment(debugger_command_pipe*: 0x8013e9a2, int32: 0, char*: NULL) + 0x0083 5 822a4ab4 (+ 32) 80072a57 <kernel_x86>:invoke_debugger_command_pipe + 0x008b 6 822a4ad4 (+ 128) 800767e2 <kernel_x86> ExpressionParser<0x822a4ba4>::_ParseCommandPipe(int&: 0x822a4ba0) + 0x0aae 7 822a4b54 (+ 48) 80078fab <kernel_x86> ExpressionParser<0x822a4ba4>::EvaluateCommand(char const*: 0x8013e9a0 "bt", int&: 0x822a4ba0) + 0x06df 8 822a4b84 (+ 192) 80079124 <kernel_x86>:evaluate_debug_command + 0x0084 9 822a4c44 (+ 64) 8007106e <kernel_x86> kernel_debugger_loop(char const*: 0x2 "<???>", char const*: 0x8013296b "PANIC: ", char*: 0x822a4cb4, int32:c 10 822a4c84 (+ 48) 80071232 <kernel_x86> kernel_debugger_internal(char const*: 0x2 "<???>", char const*: 0x2 "<???>", char*: 0x822a4cd4, int32: -214c 11 822a4cb4 (+ 32) 80071415 <kernel_x86>:panic + 0x0023 12 822a4cd4 (+ 64) 800cc2f1 <kernel_x86>:vm_page_fault + 0x00f7 13 822a4d14 (+ 80) 800e2c70 <kernel_x86> page_fault_exception(iframe*: 0x822a4d70) + 0x0171 14 822a4d64 (+ 12) 800e788d <kernel_x86>:int_bottom + 0x003d kernel iframe at 0x822a4d70 (end = 0x822a4dc0) eax 0x82fdd018 ebx 0x82fdd018 ecx 0x0 edx 0x828c001c esi 0x828ca7a8 edi 0x828ca780 ebp 0x822a4dc0 esp 0x822a4da4 eip 0x800c6e28 eflags 0x210202 vector: 0xe, error code: 0x2 15 822a4d70 (+ 80) 800c6e28 <kernel_x86>:list_remove_link + 0x000b 16 822a4dc0 (+ 20) 800c6f2f <kernel_x86>:list_remove_head_item + 0x0019 17 822a4dd4 (+ 48) 8101a78c </boot/system/add-ons/kernel/network/stack> free_buffer(net_buffer*: 0x828ca780) + 0x0024 18 822a4e04 (+ 128) 8101c0ee </boot/system/add-ons/kernel/network/stack> socket_send(net_socket*: 0xce554000, msghdr*: NULL, void const*: 0x183eca1e,0 19 822a4e84 (+ 48) 81023447 </boot/system/add-ons/kernel/network/stack> stack_interface_send(net_socket*: 0xce554000, void const*: 0x183e4a1e, uint3d 20 822a4eb4 (+ 32) 800a6546 <kernel_x86> socket_write(file_descriptor*: 0xd095f320, int64: 86363000, void const*: 0x183e4a1e, unsigned long*: 0x822af 21 822a4ed4 (+ 80) 800a0e9f <kernel_x86> common_user_io(int32: 406735390, int64: 4295012218, void*: 0xd0376800, uint32: 0x822a4fa8, true) + 0x00ff 22 822a4f24 (+ 32) 800a0f3b <kernel_x86>:_user_write + 0x001c 23 822a4f44 (+ 100) 800e7ad1 <kernel_x86>:handle_syscall + 0x00be user iframe at 0x822a4fa8 (end = 0x822a5000) eax 0x85 ebx 0x4b758c ecx 0x7ffeec20 edx 0xffff0114 esi 0x0 edi 0x183e4a1e ebp 0x7ffeec5c esp 0x822a4fdc eip 0xffff0114 eflags 0x200203 user esp 0x7ffeec20 vector: 0x63, error code: 0x0 24 822a4fa8 (+ 0) ffff0114 <commpage>:commpage_syscall + 0x0004 25 7ffeec5c (+ 48) 0022d78d <sshd>:roaming_write + 0x0029 26 7ffeec8c (+ 48) 0023d92f <sshd>:packet_write_poll + 0x0067 27 7ffeecbc (+ 80) 00217732 <sshd>:dump_config (nearest) + 0x1222 28 7ffeed0c (+ 80) 002180ae <sshd>:server_loop2 + 0x00fa 29 7ffeed5c (+ 48) 0022044b <sshd>:session_setup_x11fwd (nearest) + 0x028f 30 7ffeed8c (+ 48) 0021bf1a <sshd>:do_authenticated + 0x0062 31 7ffeedbc (+ 432) 00211482 <sshd>:main + 0x18b6 32 7ffeef6c (+ 48) 0020db3f <sshd>:_start + 0x005b 33 7ffeef9c (+ 64) 00105367 </boot/system/runtime_loader@0x00100000>:unknown + 0x5367 34 7ffeefdc (+ 0) 7ffeefec 5290:sshd_main_stack@0x7efef000 + 0xffffec kdebug> call 16 -1 thread 466, sshd 822a4dc0 800c6f2f <kernel_x86>:list_remove_head_item(0x828ca7a8) kdebug> dw 0x828ca7a8 [0x828ca7a8] .�...�.......... 82fdd018 82fdd018 00000000 80f00000 kdebug> dw 0x82fdd018 [0x82fdd018] .........�...�.. 00000000 828c001c 82fdd000 82fdd000 kdebug> traced 0 -2 -1 #0x82fdd000 1878673. [ 466] 140281514: object cache alloc: cache: 0x828a6dc0, flags: 0x0 -> object: 0x82fdd000 1878688. [ 466] 140281522: object cache free: cache: 0x828a6dc0, object: 0x82fdd000 printed 2 entries within range 1878673 to 1879633 (961 of 1879633 total, 1879516 ever)
So obviously net buffer data nodes are used after they have been freed, which would definitely explain the memory overwriting problems. I haven't checked, but I doubt that disabling the object depots would have any effect (other than speeding up the crash). So there might be different issues.
comment:7 by , 15 years ago
Replying to bonefish:
I haven't checked, but I doubt that disabling the object depots would have any effect (other than speeding up the crash).
Looks like I'm wrong. Disabling the object depots seems to make it impossible to reproduce.
comment:8 by , 15 years ago
Component: | Network & Internet/Stack → System/Kernel |
---|---|
Owner: | changed from | to
Status: | new → in-progress |
PANIC: vm_page_fault: unhandled page fault in kernel space at 0x0, ip 0x800c6f0f Welcome to Kernel Debugging Land... Thread 254 "sshd" running on CPU 3 kdebug> bt stack trace for thread 254 "sshd" kernel stack: 0x82286000 to 0x8228a000 user stack: 0x7efef000 to 0x7ffef000 frame caller <image>:function + offset 0 8228982c (+ 32) 800724f5 <kernel_x86> invoke_command_trampoline(void*: 0x822898ac) + 0x0015 1 8228984c (+ 12) 800e74e0 <kernel_x86>:arch_debug_call_with_fault_handler + 0x001b 2 82289858 (+ 48) 800703a2 <kernel_x86>:debug_call_with_fault_handler + 0x0051 3 82289888 (+ 64) 80072872 <kernel_x86>:invoke_debugger_command + 0x00bb 4 822898c8 (+ 48) 8007298f <kernel_x86> invoke_pipe_segment(debugger_command_pipe*: 0x8013e5a2, int32: 0, char*: NULL) + 0x0083 5 822898f8 (+ 32) 80072a57 <kernel_x86>:invoke_debugger_command_pipe + 0x008b 6 82289918 (+ 128) 800767e2 <kernel_x86> ExpressionParser<0x822899e8>::_ParseCommandPipe(int&: 0x822899e4) + 0x0aae 7 82289998 (+ 48) 80078fab <kernel_x86> ExpressionParser<0x822899e8>::EvaluateCommand(char const*: 0x8013e5a0 "bt", int&: 0x822899e4) + 0x06df 8 822899c8 (+ 192) 80079124 <kernel_x86>:evaluate_debug_command + 0x0084 9 82289a88 (+ 64) 8007106e <kernel_x86> kernel_debugger_loop(char const*: 0x3 "<???>", char const*: 0x8013296b "PANIC: ", char*: 0x82289af8, int32:c 10 82289ac8 (+ 48) 80071232 <kernel_x86> kernel_debugger_internal(char const*: 0x3 "<???>", char const*: 0x82289b44 "t.�.�l..", char*: 0x82289b18, ic 11 82289af8 (+ 32) 80071415 <kernel_x86>:panic + 0x0023 12 82289b18 (+ 64) 800cc2f1 <kernel_x86>:vm_page_fault + 0x00f7 13 82289b58 (+ 80) 800e2c70 <kernel_x86> page_fault_exception(iframe*: 0x82289bb4) + 0x0171 14 82289ba8 (+ 12) 800e788d <kernel_x86>:int_bottom + 0x003d kernel iframe at 0x82289bb4 (end = 0x82289c04) eax 0xd4828818 ebx 0x81026cfc ecx 0xd4828818 edx 0x0 esi 0xd4828818 edi 0xd4828818 ebp 0x82289c04 esp 0x82289be8 eip 0x800c6f0f eflags 0x210286 vector: 0xe, error code: 0x2 15 82289bb4 (+ 80) 800c6f0f <kernel_x86>:list_insert_item_before + 0x0031 16 82289c04 (+ 64) 8101a29b </boot/system/add-ons/kernel/network/stack> prepend_size(net_buffer*: 0x828dc280, uint32: 0x20 (32), void**: 0x82289cac)d 17 82289c44 (+ 160) 816ad32c </boot/system/add-ons/kernel/network/protocols/tcp> add_tcp_header(net_address_module_info*: 0x810317a0, tcp_segment_heaa 18 82289ce4 (+ 112) 816ae2e7 </boot/system/add-ons/kernel/network/protocols/tcp> TCPEndpoint<0xcfd5a000>::_SendQueued(false, uint32: 0x3fd8 (16344)) 1 19 82289d54 (+ 32) 816ae478 </boot/system/add-ons/kernel/network/protocols/tcp> TCPEndpoint<0xcfd5a000>::_SendQueued(false) + 0x0028 20 82289d74 (+ 112) 816b0441 </boot/system/add-ons/kernel/network/protocols/tcp> TCPEndpoint<0xcfd5a000>::SendData(net_buffer*: 0x828d4a00) + 0x02f3 21 82289de4 (+ 32) 816accd3 </boot/system/add-ons/kernel/network/protocols/tcp> tcp_send_data(net_protocol*: 0xcfd5a000, net_buffer*: 0x828d4a00) + e 22 82289e04 (+ 128) 8101c0cf </boot/system/add-ons/kernel/network/stack> socket_send(net_socket*: 0xcfd59000, msghdr*: NULL, void const*: 0x183ccf91,1 23 82289e84 (+ 48) 81023447 </boot/system/add-ons/kernel/network/stack> stack_interface_send(net_socket*: 0xcfd59000, void const*: 0x183c4f91, uint3d 24 82289eb4 (+ 32) 800a6546 <kernel_x86> socket_write(file_descriptor*: 0x83058fa0, int64: 216645467, void const*: 0x183c4f91, unsigned long*: 0x822f 25 82289ed4 (+ 80) 800a0e9f <kernel_x86> common_user_io(int32: 406605713, int64: 4295010439, void*: 0xd1b5f000, uint32: 0x82289fa8, true) + 0x00ff 26 82289f24 (+ 32) 800a0f3b <kernel_x86>:_user_write + 0x001c 27 82289f44 (+ 100) 800e7ad1 <kernel_x86>:handle_syscall + 0x00be user iframe at 0x82289fa8 (end = 0x8228a000) eax 0x85 ebx 0x4b758c ecx 0x7ffeec20 edx 0xffff0114 esi 0x0 edi 0x183c4f91 ebp 0x7ffeec5c esp 0x82289fdc eip 0xffff0114 eflags 0x200203 user esp 0x7ffeec20 vector: 0x63, error code: 0x0 28 82289fa8 (+ 0) ffff0114 <commpage>:commpage_syscall + 0x0004 29 7ffeec5c (+ 48) 0022d78d <sshd>:roaming_write + 0x0029 30 7ffeec8c (+ 48) 0023d92f <sshd>:packet_write_poll + 0x0067 31 7ffeecbc (+ 80) 00217732 <sshd>:dump_config (nearest) + 0x1222 32 7ffeed0c (+ 80) 002180ae <sshd>:server_loop2 + 0x00fa 33 7ffeed5c (+ 48) 0022044b <sshd>:session_setup_x11fwd (nearest) + 0x028f 34 7ffeed8c (+ 48) 0021bf1a <sshd>:do_authenticated + 0x0062 35 7ffeedbc (+ 432) 00211482 <sshd>:main + 0x18b6 36 7ffeef6c (+ 48) 0020db3f <sshd>:_start + 0x005b 37 7ffeef9c (+ 64) 00105367 </boot/system/runtime_loader@0x00100000>:unknown + 0x5367 38 7ffeefdc (+ 0) 7ffeefec 3506:sshd_main_stack@0x7efef000 + 0xffffec kdebug> call 15 -3 thread 254, sshd 82289bb4 800c6f0f <kernel_x86>:list_insert_item_before(0x828dc2a8, 0xd4828818, 0xd4828818) kdebug> dw 0x828dc2a8 [0x828dc2a8] ...�...�.......� d4828818 d4828818 00000000 d4828800 kdebug> traced 0 -10 -1 #0xd4828800 2718494. [ 105] 69867173: object cache alloc: cache: 0x828a4780, flags: 0x0 -> object: 0xd4828800 2718502. [ 105] 69867184: object cache free: cache: 0x828a4780, object: 0xd4828800 2719468. [ 103] 69869292: object cache alloc: cache: 0x828a4780, flags: 0x0 -> object: 0xd4828800 2719722. [ 105] 69869745: object cache free: cache: 0x828a4780, object: 0xd4828800 2719724. [ 105] 69869746: object cache alloc: cache: 0x828a4780, flags: 0x0 -> object: 0xd4828800 2719915. [ 105] 69870016: object cache free: cache: 0x828a4780, object: 0xd4828800 2722597. [ 254] 69875083: object cache alloc: cache: 0x828a4780, flags: 0x0 -> object: 0xd4828800 2722598. [ 254] 69875084: object cache alloc: cache: 0x828a4780, flags: 0x0 -> object: 0xd4828800 2722599. [ 254] 69875085: object cache free: cache: 0x828a4780, object: 0xd4828800 2722600. [ 254] 69875085: object cache alloc: cache: 0x828a4780, flags: 0x0 -> object: 0xd4828800 printed 10 entries within range 2718494 to 2722744 (4251 of 2722744 total, 2722548 ever)
Ah, sorry, should have checked that earlier. The net stack isn't to blame after all. The object cache can apparently give out objects twice when depots are enabled. I guess that would be a safe time to start betting body parts on a bug in the object depot implementation. ;-)
FWIW my system has been very unstable after the latest magazine cleanups. Disabling the magazines cured the problem.
Since the previous net_buffer problems are gone, too, I suspect there is still something fishy with the slab implementation (off by one error somewhere, maybe), that shows itself differently now that the magazines have been limited, and are freed when needed.
At least the net_buffer implementation seems to work just fine. I currently can easily reproduce (but only on my main system, not in emulation) a different slab problem that causes an endless loop in object_cache_alloc() -- object_cache_reserve_internal() obviously returns B_OK even though the partial and empty lists are empty. I'll try to look into this by Monday, if no one beats me to it.