Opened 2 months ago

Last modified 45 hours ago

#19252 reopened bug

[ramfs] PANIC: vm_page_fault: unhandled page fault in kernel space

Reported by: bipolar Owned by: nobody
Priority: normal Milestone: R1/beta6
Component: File Systems/RAMFS Version: R1/beta5
Keywords: Cc:
Blocked By: Blocking:
Platform: All

Description

While using haikuporter to build the mozc recipe (with OUTPUT_DIRECTORY pointing to a RAMFS mount), on beta5+125, 64 bits, bare-metal, I got the attached KDL (syslog didn't retained the info upon reboot).

The build had almost ended, but haikuporter was unable to unmount the system volume from the chroot, so I did it manually, removed a boot/system dir that remained on the chroot, tried to run hp -F mozc, and that is where I got the KDL.

Attachments (1)

kdl-ramfs-vm_page_fault.jpg (1.2 MB ) - added by bipolar 2 months ago.

Download all attachments as: .zip

Change History (9)

by bipolar, 2 months ago

Attachment: kdl-ramfs-vm_page_fault.jpg added

comment:1 by waddlesplash, 2 months ago

Looks like a NULL dereference after OOM.

comment:2 by bipolar, 2 months ago

FWIW, I have 8 GB of RAM on that machine, and while building the recipe, RAM usage topped at 3.2 GB. The largest I've seen for that particular work dir during build, was about 1 GB. Even doubling that... should still have a couple of GBs left. But I have no idea how accurate those numbers are (of if there are other "hidden" uses of RAM).

Last edited 2 months ago by bipolar (previous) (diff)

comment:3 by waddlesplash, 2 months ago

Hmm, then maybe not.

Did "hp -F mozc" on a RAMFS twice now, didn't get this KDL so far. I looked at the codepath here and it acquires a write lock, as it should, and does NULL checks in the relevant places. So I'm not sure how this could happen.

I have a few refactors I'll push, though.

comment:4 by waddlesplash, 2 days ago

Milestone: UnscheduledR1/beta6
Resolution: fixed
Status: newclosed

Should be fixed in hrev58540. If it's not, hopefully the new assertions added in that commit will catch the real problem.

comment:5 by waddlesplash, 2 days ago

Resolution: fixed
Status: closedreopened

Hmm, or maybe not. I can actually reproduce this reliably now, if I try to delete the haikuports directory in ramfs after the build finishes, even if it finishes successfully; only it appears to be a use-after-free.

comment:6 by waddlesplash, 45 hours ago

It might be a race condition of some kind, I rebuilt ramfs in debug mode and now it doesn't seem to happen. Or at least, it didn't the one time I ran a full mozc build and then rm -rf'd again.

A smaller reproducer (if possible) would make this much easier to debug...

comment:7 by waddlesplash, 45 hours ago

For reference, the stack trace of the problem I was seeing is this:

PANIC: Unexpected exception "General Protection Exception" occurred in kernel mode! Error code: 0x0

Welcome to Kernel Debugging Land...
Thread 40876 "rm" running on CPU 2
stack trace for thread 40876 "rm"
    kernel stack: 0xffffffff82013000 to 0xffffffff82018000
      user stack: 0x00007f958ac84000 to 0x00007f958bc84000
frame                       caller             <image>:function + offset
 0 ffffffff82016f80 (+  32) ffffffff80154c50   <kernel_x86_64> arch_debug_call_with_fault_handler + 0x1a
 1 ffffffff82016fd0 (+  80) ffffffff800b8858   <kernel_x86_64> debug_call_with_fault_handler + 0x78
 2 ffffffff82017030 (+  96) ffffffff800b9f44   <kernel_x86_64> kernel_debugger_loop(char const*, char const*, __va_list_tag*, int) + 0xf4
 3 ffffffff82017080 (+  80) ffffffff800ba2de   <kernel_x86_64> kernel_debugger_internal(char const*, char const*, __va_list_tag*, int) + 0x6e
 4 ffffffff82017170 (+ 240) ffffffff800ba677   <kernel_x86_64> panic + 0xb7
 5 ffffffff820174f8 (+ 904) ffffffff8015652c   <kernel_x86_64> int_bottom + 0x80
kernel iframe at 0xffffffff820174f8 (end = 0xffffffff820175c0)
 rax 0xdeadbeefdeadbeef    rbx 0xffffffff82017970    rcx 0x3
 rdx 0xffffffff820175d8    rsi 0x0                   rdi 0xffffffffa017b780
 rbp 0xffffffff82017820     r8 0xffffffff820176d8     r9 0xffffffff8a771760
 r10 0xffffffff820175d8    r11 0x0                   r12 0xffffffffa34ddba0
 r13 0xffffffffa25e1f50    r14 0xffffffff820175d0    r15 0xffffffff8286ad40
 rip 0xffffffff8a7547ed    rsp 0xffffffff820175c8 rflags 0x10246
 vector: 0xd, error code: 0x0
 6 ffffffff82017820 (+ 808) ffffffff8a7547ed   </boot/system/add-ons/kernel/file_systems/ramfs> Attribute::GetKey(unsigned char*, unsigned long*) + 0x1d
 7 ffffffff82017850 (+  48) ffffffff8012a001   <kernel_x86_64> AVLTreeBase::Remove(void const*) + 0x41
 8 ffffffff82017ab0 (+ 608) ffffffff8a755164   </boot/system/add-ons/kernel/file_systems/ramfs> AttributeIndexImpl::Removed[clone .localalias] (Attribute*) + 0xe4
 9 ffffffff82017bf0 (+ 320) ffffffff8a7679f3   </boot/system/add-ons/kernel/file_systems/ramfs> Volume::NodeAttributeRemoved(long, Attribute*) + 0x53
10 ffffffff82017c30 (+  64) ffffffff8a760ca4   </boot/system/add-ons/kernel/file_systems/ramfs> Node::RemoveAttribute[clone .localalias] (Attribute*) + 0xd4
11 ffffffff82017c50 (+  32) ffffffff8a760e1b   </boot/system/add-ons/kernel/file_systems/ramfs> Node::~Node[clone .localalias] () + 0x3b
12 ffffffff82017c70 (+  32) ffffffff8a758f79   </boot/system/add-ons/kernel/file_systems/ramfs> File::~File[clone .localalias] () + 0x39
13 ffffffff82017c90 (+  32) ffffffff8a75a989   </boot/system/add-ons/kernel/file_systems/ramfs> ramfs_remove_vnode(fs_volume*, fs_vnode*, bool) + 0x39
14 ffffffff82017cd0 (+  64) ffffffff80102734   <kernel_x86_64> free_vnode(vnode*, bool) + 0xb4
15 ffffffff82017d20 (+  80) ffffffff8010424b   <kernel_x86_64> dec_vnode_ref_count[clone .isra.0] (vnode*, bool, bool) + 0x33b
16 ffffffff82017d40 (+  32) ffffffff8010bf37   <kernel_x86_64> put_vnode + 0x97
17 ffffffff82017d90 (+  80) ffffffff8a75c92b   </boot/system/add-ons/kernel/file_systems/ramfs> ramfs_unlink(fs_volume*, fs_vnode*, char const*) + 0x1bb
18 ffffffff82017ed0 (+ 320) ffffffff8010b160   <kernel_x86_64> common_unlink(int, char*, bool) + 0x60
19 ffffffff82017f20 (+  80) ffffffff8011211c   <kernel_x86_64> _user_unlink + 0x7c
20 ffffffff82017f30 (+  16) ffffffff8015682f   <kernel_x86_64> x86_64_syscall_entry + 0xfb
user iframe at 0xffffffff82017f30 (end = 0xffffffff82017ff8)
 rax 0x7f                  rbx 0x10a4d4e30cd0        rcx 0x172a8a44d5c
 rdx 0x0                   rsi 0x10a4d4e30dc0        rdi 0x7
 rbp 0x7f958bc83650         r8 0x3                    r9 0x7f958bc8372c
 r10 0x61ff7278            r11 0x246                 r12 0x0
 r13 0x10a4d4b14300        r14 0x7f958bc83800        r15 0x10a4d4e30cd0
 rip 0x172a8a44d5c         rsp 0x7f958bc83638     rflags 0x246
 vector: 0x63, error code: 0x0
21 00007f958bc83650 (+   0) 00000172a8a44d5c   </boot/system/lib/libroot.so> _kern_unlink + 0x0c
22 00007f958bc83700 (+ 176) 0000012909c64b2d   </boot/system/bin/rm> usage (nearest) + 0x44d
23 00007f958bc837e0 (+ 224) 0000012909c65276   </boot/system/bin/rm> rm + 0x106
24 00007f958bc838d0 (+ 240) 0000012909c64457   </boot/system/bin/rm> main + 0x3e7
25 00007f958bc83900 (+  48) 0000012909c646cf   </boot/system/bin/rm> _start + 0x3f
26 00007f958bc83930 (+  48) 000000622d621e05   </boot/system/runtime_loader> runtime_loader + 0x115
27 0000000000000000 (+   0) 00007ffd83c02258   2124931:commpage@0x00007ffd83c02000 + 0x258

comment:8 by waddlesplash, 45 hours ago

I did have an AssertWriteLocked() in Remove, and we check there if the attribute's index == this. I am not sure which attribute is the one GetKey is being invoked on: this one, or some other in the tree? I also added an ASSERT(fIndex == NULL); inside ~Attribute but this also did not fire.

Note: See TracTickets for help on using tickets.