Opened 9 years ago

Closed 9 years ago

#5489 closed bug (fixed)

PANIC: vm_page_fault: unhandled page fault in kernel at 0x70616d6d, ip 0x800d2723

Reported by: aldeck Owned by: bonefish
Priority: normal Milestone: R1
Component: System/Kernel Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Has a Patch: no Platform: All

Description

PANIC: vm_page_fault: unhandled page fault in kernel at 0x70616d6d, ip 0x800d2723

While building HaikuLauncher with j4 on a quad core.

Haiku hrev35621, gcc2/4 hybdrid
Attached the serial dump of the kdebug session (following anevilyak's and mmlr's instructions over irc).

Attachments (2)

kdebug_session.txt (15.8 KB) - added by aldeck 9 years ago.
mutex_lock.cap (14.8 KB) - added by mmadia 9 years ago.
serial output for comment:3

Download all attachments as: .zip

Change History (11)

Changed 9 years ago by aldeck

Attachment: kdebug_session.txt added

comment:1 Changed 9 years ago by axeld

FWIW my system has been very unstable after the latest magazine cleanups. Disabling the magazines cured the problem.

Since the previous net_buffer problems are gone, too, I suspect there is still something fishy with the slab implementation (off by one error somewhere, maybe), that shows itself differently now that the magazines have been limited, and are freed when needed.

At least the net_buffer implementation seems to work just fine. I currently can easily reproduce (but only on my main system, not in emulation) a different slab problem that causes an endless loop in object_cache_alloc() -- object_cache_reserve_internal() obviously returns B_OK even though the partial and empty lists are empty. I'll try to look into this by Monday, if no one beats me to it.

comment:2 Changed 9 years ago by bonefish

This is the faulting line in object_cache_alloc(): http://dev.haiku-os.org/browser/haiku/trunk/src/system/kernel/slab/Slab.cpp?rev=35621#L666. The source->free field contained the pointer 0x70616d6d, which would not even be a kernel address. The source pointer (esi: 0xd206ffe4) looks OK (28 bytes before the end of the page -- the slab structure being part of the slab itself), so I'd assume that one is OK. The source->free pointer can't have been written there the regular way, since that would have dereferenced it as well, so someone probably overwrote the slab structure. 0x70616d6d translates to 'mmap', which could be part of a string. It would have been interesting to see a bit more of the surrounding memory.

The object cache is for vm_page_mappings objects. I'd consider it rather unlikely that a user of those would overwrite them (they just consist of two list links and two pointers). Since that is 447 KB into a slab area, either someone got a page mapping object address from invalid/uninitialized memory and messes with it or the user of a neighboring slab has written beyond its allocation (somewhat more likely backwards). So the caches the neighboring slabs belong to would have been interesting, too. Candidates are only small object caches with 4 KB slabs, which reduces the list to the following:

0x81803220        block cache: 32       32    12288      0      329      381 c0000000
0x81803330        block cache: 48       48     4096      0       61       84 80000000
0x81803420        block cache: 64       64     4096      0        5       63 c0000000
0x81803510        block cache: 80       80     4096      0       39       50 80000000
0x818037e0       block cache: 128      128     4096      0        4       31 c0000000
0x818038d0       block cache: 160      160     4096      0       14       25 80000000
0x81803ba0       block cache: 256      256     4096      0        6       15 c0000000
0x81807c00              swapblock      148   622592    152        0     4104        0
0x81807b00                  mbufs      256   131072      0      419      480        0

The "block cache:..." caches are used only for slab internal allocations (since USE_SLAB_ALLOCATOR_FOR_MALLOC doesn't seem to be enabled), which should only be for the object caches objects themselves, the slab structures for hashed object caches, the hash tables for those, and the object depot objects and magazines. Nothing where I see potential overwriting happen, at least not with such a value. The "swapblock" objects are apparently unused at that point. No idea how the "mbufs" are used in the FreeBSD networking code and whether there could be any problem there.

To sum my ramblings up: Most likely a problem of slab structure corruption by overwritten memory, probably by a slab user or the slab implementation itself (the latter less likely IMO). Definitely doesn't sound good.

In case anyone runs into the same situation. The memory around the slab structure would be of interest (e.g. "db ($esi - 32) 128") as would neighboring slabs ("slab_meta_chunk $esi").

comment:3 Changed 9 years ago by mmadia

While using WebPostive hrev241 in hrev35650-24, PANIC: _mutex_lock(): double lock of 0x81008e28 by thread 629 , with thread 629 belonging to WebPositive. It was suggested to attach the serial debug output to this ticket.

Changed 9 years ago by mmadia

Attachment: mutex_lock.cap added

serial output for comment:3

comment:4 Changed 9 years ago by bonefish

Regarding mutex_lock.cap: The bug in the "info" command is fixed in hrev35657. The double lock of the mutex is just a coincidence. The underlying problem is the page fault in object_cache_alloc(). If faults at 0x70616000, which could also be a part of a string (or the end of one and start of another): "\0`ap". Grepping through our sources turned nothing up that would match, but it might come from somewhere else.

I still think overwritten memory is the most likely cause. Axel's infinite loop problem might have the same cause, so I would first let him examine it, since he can easily reproduce it.

comment:5 in reply to:  1 Changed 9 years ago by bonefish

Replying to axeld:

FWIW my system has been very unstable after the latest magazine cleanups. Disabling the magazines cured the problem.

Come to think of it, that actually strongly hints towards a problem in the object depot code. Any overwriting of memory that happens outside that code, would still happen, if object depots are disabled. So disabling them shouldn't really avoid any problem. An alternative explanation to hrev35601 triggering the bug because objects are now returned to the slab would be that hrev35601 actually introduced the bug in the first place. :-) That explanation has the advantage of also explaining why disabling the object depot disables the bug. I guess it doesn't harm to review the object depot code once more.

comment:6 Changed 9 years ago by bonefish

Component: System/KernelNetwork & Internet/Stack

Following Axel's hint to reproduce the bug (copy several GB of files over network), I tried a scp -r localhost:... ... with my mp3 collection and reliably run into something like the following very quickly:

PANIC: vm_page_fault: unhandled page fault in kernel space at 0x4, ip 0x800c6e28

Welcome to Kernel Debugging Land...
Thread 466 "sshd" running on CPU 2
kdebug> bt
stack trace for thread 466 "sshd"
    kernel stack: 0x822a1000 to 0x822a5000
      user stack: 0x7efef000 to 0x7ffef000
frame               caller     <image>:function + offset
 0 822a49e8 (+  32) 800724f5   <kernel_x86> invoke_command_trampoline(void*: 0x822a4a68) + 0x0015
 1 822a4a08 (+  12) 800e74e0   <kernel_x86>:arch_debug_call_with_fault_handler + 0x001b
 2 822a4a14 (+  48) 800703a2   <kernel_x86>:debug_call_with_fault_handler + 0x0051
 3 822a4a44 (+  64) 80072872   <kernel_x86>:invoke_debugger_command + 0x00bb
 4 822a4a84 (+  48) 8007298f   <kernel_x86> invoke_pipe_segment(debugger_command_pipe*: 0x8013e9a2, int32: 0, char*: NULL) + 0x0083
 5 822a4ab4 (+  32) 80072a57   <kernel_x86>:invoke_debugger_command_pipe + 0x008b
 6 822a4ad4 (+ 128) 800767e2   <kernel_x86> ExpressionParser<0x822a4ba4>::_ParseCommandPipe(int&: 0x822a4ba0) + 0x0aae
 7 822a4b54 (+  48) 80078fab   <kernel_x86> ExpressionParser<0x822a4ba4>::EvaluateCommand(char const*: 0x8013e9a0 "bt", int&: 0x822a4ba0) + 0x06df
 8 822a4b84 (+ 192) 80079124   <kernel_x86>:evaluate_debug_command + 0x0084
 9 822a4c44 (+  64) 8007106e   <kernel_x86> kernel_debugger_loop(char const*: 0x2 "<???>", char const*: 0x8013296b "PANIC: ", char*: 0x822a4cb4, int32:c
10 822a4c84 (+  48) 80071232   <kernel_x86> kernel_debugger_internal(char const*: 0x2 "<???>", char const*: 0x2 "<???>", char*: 0x822a4cd4, int32: -214c
11 822a4cb4 (+  32) 80071415   <kernel_x86>:panic + 0x0023
12 822a4cd4 (+  64) 800cc2f1   <kernel_x86>:vm_page_fault + 0x00f7
13 822a4d14 (+  80) 800e2c70   <kernel_x86> page_fault_exception(iframe*: 0x822a4d70) + 0x0171
14 822a4d64 (+  12) 800e788d   <kernel_x86>:int_bottom + 0x003d
kernel iframe at 0x822a4d70 (end = 0x822a4dc0)
 eax 0x82fdd018     ebx 0x82fdd018      ecx 0x0          edx 0x828c001c
 esi 0x828ca7a8     edi 0x828ca780      ebp 0x822a4dc0   esp 0x822a4da4
 eip 0x800c6e28  eflags 0x210202
 vector: 0xe, error code: 0x2
15 822a4d70 (+  80) 800c6e28   <kernel_x86>:list_remove_link + 0x000b
16 822a4dc0 (+  20) 800c6f2f   <kernel_x86>:list_remove_head_item + 0x0019
17 822a4dd4 (+  48) 8101a78c   </boot/system/add-ons/kernel/network/stack> free_buffer(net_buffer*: 0x828ca780) + 0x0024
18 822a4e04 (+ 128) 8101c0ee   </boot/system/add-ons/kernel/network/stack> socket_send(net_socket*: 0xce554000, msghdr*: NULL, void const*: 0x183eca1e,0
19 822a4e84 (+  48) 81023447   </boot/system/add-ons/kernel/network/stack> stack_interface_send(net_socket*: 0xce554000, void const*: 0x183e4a1e, uint3d
20 822a4eb4 (+  32) 800a6546   <kernel_x86> socket_write(file_descriptor*: 0xd095f320, int64: 86363000, void const*: 0x183e4a1e, unsigned long*: 0x822af
21 822a4ed4 (+  80) 800a0e9f   <kernel_x86> common_user_io(int32: 406735390, int64: 4295012218, void*: 0xd0376800, uint32: 0x822a4fa8, true) + 0x00ff
22 822a4f24 (+  32) 800a0f3b   <kernel_x86>:_user_write + 0x001c
23 822a4f44 (+ 100) 800e7ad1   <kernel_x86>:handle_syscall + 0x00be
user iframe at 0x822a4fa8 (end = 0x822a5000)
 eax 0x85           ebx 0x4b758c        ecx 0x7ffeec20   edx 0xffff0114
 esi 0x0            edi 0x183e4a1e      ebp 0x7ffeec5c   esp 0x822a4fdc
 eip 0xffff0114  eflags 0x200203   user esp 0x7ffeec20
 vector: 0x63, error code: 0x0
24 822a4fa8 (+   0) ffff0114   <commpage>:commpage_syscall + 0x0004
25 7ffeec5c (+  48) 0022d78d   <sshd>:roaming_write + 0x0029
26 7ffeec8c (+  48) 0023d92f   <sshd>:packet_write_poll + 0x0067
27 7ffeecbc (+  80) 00217732   <sshd>:dump_config (nearest) + 0x1222
28 7ffeed0c (+  80) 002180ae   <sshd>:server_loop2 + 0x00fa
29 7ffeed5c (+  48) 0022044b   <sshd>:session_setup_x11fwd (nearest) + 0x028f
30 7ffeed8c (+  48) 0021bf1a   <sshd>:do_authenticated + 0x0062
31 7ffeedbc (+ 432) 00211482   <sshd>:main + 0x18b6
32 7ffeef6c (+  48) 0020db3f   <sshd>:_start + 0x005b
33 7ffeef9c (+  64) 00105367   </boot/system/runtime_loader@0x00100000>:unknown + 0x5367
34 7ffeefdc (+   0) 7ffeefec   5290:sshd_main_stack@0x7efef000 + 0xffffec                                                                               
kdebug> call 16 -1
thread 466, sshd
822a4dc0 800c6f2f   <kernel_x86>:list_remove_head_item(0x828ca7a8)
kdebug> dw 0x828ca7a8
[0x828ca7a8]  .�...�..........   82fdd018 82fdd018 00000000 80f00000
kdebug> dw 0x82fdd018
[0x82fdd018]  .........�...�..   00000000 828c001c 82fdd000 82fdd000                                                                                    
kdebug> traced 0 -2 -1 #0x82fdd000
1878673. [   466]  140281514: object cache alloc: cache: 0x828a6dc0, flags: 0x0 -> object: 0x82fdd000
1878688. [   466]  140281522: object cache free: cache: 0x828a6dc0, object: 0x82fdd000
printed 2 entries within range 1878673 to 1879633 (961 of 1879633 total, 1879516 ever)

So obviously net buffer data nodes are used after they have been freed, which would definitely explain the memory overwriting problems. I haven't checked, but I doubt that disabling the object depots would have any effect (other than speeding up the crash). So there might be different issues.

comment:7 in reply to:  6 Changed 9 years ago by bonefish

Replying to bonefish:

I haven't checked, but I doubt that disabling the object depots would have any effect (other than speeding up the crash).

Looks like I'm wrong. Disabling the object depots seems to make it impossible to reproduce.

comment:8 Changed 9 years ago by bonefish

Component: Network & Internet/StackSystem/Kernel
Owner: changed from axeld to bonefish
Status: newin-progress
PANIC: vm_page_fault: unhandled page fault in kernel space at 0x0, ip 0x800c6f0f

Welcome to Kernel Debugging Land...
Thread 254 "sshd" running on CPU 3
kdebug> bt
stack trace for thread 254 "sshd"
    kernel stack: 0x82286000 to 0x8228a000
      user stack: 0x7efef000 to 0x7ffef000
frame               caller     <image>:function + offset
 0 8228982c (+  32) 800724f5   <kernel_x86> invoke_command_trampoline(void*: 0x822898ac) + 0x0015
 1 8228984c (+  12) 800e74e0   <kernel_x86>:arch_debug_call_with_fault_handler + 0x001b
 2 82289858 (+  48) 800703a2   <kernel_x86>:debug_call_with_fault_handler + 0x0051
 3 82289888 (+  64) 80072872   <kernel_x86>:invoke_debugger_command + 0x00bb
 4 822898c8 (+  48) 8007298f   <kernel_x86> invoke_pipe_segment(debugger_command_pipe*: 0x8013e5a2, int32: 0, char*: NULL) + 0x0083
 5 822898f8 (+  32) 80072a57   <kernel_x86>:invoke_debugger_command_pipe + 0x008b
 6 82289918 (+ 128) 800767e2   <kernel_x86> ExpressionParser<0x822899e8>::_ParseCommandPipe(int&: 0x822899e4) + 0x0aae
 7 82289998 (+  48) 80078fab   <kernel_x86> ExpressionParser<0x822899e8>::EvaluateCommand(char const*: 0x8013e5a0 "bt", int&: 0x822899e4) + 0x06df
 8 822899c8 (+ 192) 80079124   <kernel_x86>:evaluate_debug_command + 0x0084
 9 82289a88 (+  64) 8007106e   <kernel_x86> kernel_debugger_loop(char const*: 0x3 "<???>", char const*: 0x8013296b "PANIC: ", char*: 0x82289af8, int32:c
10 82289ac8 (+  48) 80071232   <kernel_x86> kernel_debugger_internal(char const*: 0x3 "<???>", char const*: 0x82289b44 "t.�.�l..", char*: 0x82289b18, ic
11 82289af8 (+  32) 80071415   <kernel_x86>:panic + 0x0023
12 82289b18 (+  64) 800cc2f1   <kernel_x86>:vm_page_fault + 0x00f7
13 82289b58 (+  80) 800e2c70   <kernel_x86> page_fault_exception(iframe*: 0x82289bb4) + 0x0171
14 82289ba8 (+  12) 800e788d   <kernel_x86>:int_bottom + 0x003d
kernel iframe at 0x82289bb4 (end = 0x82289c04)
 eax 0xd4828818     ebx 0x81026cfc      ecx 0xd4828818   edx 0x0
 esi 0xd4828818     edi 0xd4828818      ebp 0x82289c04   esp 0x82289be8
 eip 0x800c6f0f  eflags 0x210286
 vector: 0xe, error code: 0x2
15 82289bb4 (+  80) 800c6f0f   <kernel_x86>:list_insert_item_before + 0x0031
16 82289c04 (+  64) 8101a29b   </boot/system/add-ons/kernel/network/stack> prepend_size(net_buffer*: 0x828dc280, uint32: 0x20 (32), void**: 0x82289cac)d
17 82289c44 (+ 160) 816ad32c   </boot/system/add-ons/kernel/network/protocols/tcp> add_tcp_header(net_address_module_info*: 0x810317a0, tcp_segment_heaa
18 82289ce4 (+ 112) 816ae2e7   </boot/system/add-ons/kernel/network/protocols/tcp> TCPEndpoint<0xcfd5a000>::_SendQueued(false, uint32: 0x3fd8 (16344)) 1
19 82289d54 (+  32) 816ae478   </boot/system/add-ons/kernel/network/protocols/tcp> TCPEndpoint<0xcfd5a000>::_SendQueued(false) + 0x0028
20 82289d74 (+ 112) 816b0441   </boot/system/add-ons/kernel/network/protocols/tcp> TCPEndpoint<0xcfd5a000>::SendData(net_buffer*: 0x828d4a00) + 0x02f3
21 82289de4 (+  32) 816accd3   </boot/system/add-ons/kernel/network/protocols/tcp> tcp_send_data(net_protocol*: 0xcfd5a000, net_buffer*: 0x828d4a00) + e
22 82289e04 (+ 128) 8101c0cf   </boot/system/add-ons/kernel/network/stack> socket_send(net_socket*: 0xcfd59000, msghdr*: NULL, void const*: 0x183ccf91,1
23 82289e84 (+  48) 81023447   </boot/system/add-ons/kernel/network/stack> stack_interface_send(net_socket*: 0xcfd59000, void const*: 0x183c4f91, uint3d
24 82289eb4 (+  32) 800a6546   <kernel_x86> socket_write(file_descriptor*: 0x83058fa0, int64: 216645467, void const*: 0x183c4f91, unsigned long*: 0x822f
25 82289ed4 (+  80) 800a0e9f   <kernel_x86> common_user_io(int32: 406605713, int64: 4295010439, void*: 0xd1b5f000, uint32: 0x82289fa8, true) + 0x00ff
26 82289f24 (+  32) 800a0f3b   <kernel_x86>:_user_write + 0x001c
27 82289f44 (+ 100) 800e7ad1   <kernel_x86>:handle_syscall + 0x00be
user iframe at 0x82289fa8 (end = 0x8228a000)
 eax 0x85           ebx 0x4b758c        ecx 0x7ffeec20   edx 0xffff0114
 esi 0x0            edi 0x183c4f91      ebp 0x7ffeec5c   esp 0x82289fdc
 eip 0xffff0114  eflags 0x200203   user esp 0x7ffeec20
 vector: 0x63, error code: 0x0
28 82289fa8 (+   0) ffff0114   <commpage>:commpage_syscall + 0x0004
29 7ffeec5c (+  48) 0022d78d   <sshd>:roaming_write + 0x0029
30 7ffeec8c (+  48) 0023d92f   <sshd>:packet_write_poll + 0x0067
31 7ffeecbc (+  80) 00217732   <sshd>:dump_config (nearest) + 0x1222
32 7ffeed0c (+  80) 002180ae   <sshd>:server_loop2 + 0x00fa
33 7ffeed5c (+  48) 0022044b   <sshd>:session_setup_x11fwd (nearest) + 0x028f
34 7ffeed8c (+  48) 0021bf1a   <sshd>:do_authenticated + 0x0062
35 7ffeedbc (+ 432) 00211482   <sshd>:main + 0x18b6
36 7ffeef6c (+  48) 0020db3f   <sshd>:_start + 0x005b
37 7ffeef9c (+  64) 00105367   </boot/system/runtime_loader@0x00100000>:unknown + 0x5367
38 7ffeefdc (+   0) 7ffeefec   3506:sshd_main_stack@0x7efef000 + 0xffffec
kdebug> call 15 -3
thread 254, sshd
82289bb4 800c6f0f   <kernel_x86>:list_insert_item_before(0x828dc2a8, 0xd4828818, 0xd4828818)
kdebug> dw 0x828dc2a8
[0x828dc2a8]  ...�...�.......�   d4828818 d4828818 00000000 d4828800
kdebug> traced 0 -10 -1 #0xd4828800
2718494. [   105]   69867173: object cache alloc: cache: 0x828a4780, flags: 0x0 -> object: 0xd4828800
2718502. [   105]   69867184: object cache free: cache: 0x828a4780, object: 0xd4828800
2719468. [   103]   69869292: object cache alloc: cache: 0x828a4780, flags: 0x0 -> object: 0xd4828800
2719722. [   105]   69869745: object cache free: cache: 0x828a4780, object: 0xd4828800
2719724. [   105]   69869746: object cache alloc: cache: 0x828a4780, flags: 0x0 -> object: 0xd4828800
2719915. [   105]   69870016: object cache free: cache: 0x828a4780, object: 0xd4828800
2722597. [   254]   69875083: object cache alloc: cache: 0x828a4780, flags: 0x0 -> object: 0xd4828800
2722598. [   254]   69875084: object cache alloc: cache: 0x828a4780, flags: 0x0 -> object: 0xd4828800
2722599. [   254]   69875085: object cache free: cache: 0x828a4780, object: 0xd4828800
2722600. [   254]   69875085: object cache alloc: cache: 0x828a4780, flags: 0x0 -> object: 0xd4828800
printed 10 entries within range 2718494 to 2722744 (4251 of 2722744 total, 2722548 ever)

Ah, sorry, should have checked that earlier. The net stack isn't to blame after all. The object cache can apparently give out objects twice when depots are enabled. I guess that would be a safe time to start betting body parts on a bug in the object depot implementation. ;-)

comment:9 Changed 9 years ago by bonefish

Resolution: fixed
Status: in-progressclosed

Fixed in hrev35752.

Note: See TracTickets for help on using tickets.