Opened 11 years ago

Closed 11 years ago

#2471 closed bug (fixed)

PANIC: page fault but interrupts were disabled. Touching address 0x00000008 from eip 0x80031f85 while building Haiku

Reported by: anevilyak Owned by: bonefish
Priority: critical Milestone: R1/alpha1
Component: System/Kernel Version: R1/pre-alpha1
Keywords: Cc:
Blocked By: Blocking:
Has a Patch: no Platform: All

Description (last modified by mmlr)

While doing a jam of Haiku within Haiku, I hit this panic, backtrace as follows:

<snip>kernel debugger stuff
<kernel>:panic + 0x0029
<kernel>:page_fault_exception + 0x0060
<kernel>:int_bottom + 0x002a (nearest)
iframe at 0x9b466da4 (end = 0x9b466dfc)
eax 0x9ccf6000 ebx 0x0 ecx 0x800dc5dc edx 0x5cb5
esi 0xb10fcd80 edi 0x0 ebp 0x9b466e1c esp 0x9b466dd8
eip 0x80031f85 eflags 0x10086
vector: 0xe, error code: 0x0
<kernel>:mutex_unlock + 0x0045
<kernel>:vm_soft_fault__FUlbT1 + 0x0e28
<kernel>:vm_page_fault + 0x002e
<kernel>:page_fault_exception + 0x00b1
<kernel>:int_bottom_user + 0x005a (nearest)
iframe at 0x9b466fa8 (end = 0x9b467000)
eax 0x18036018 ebx 0x341928 ecx 0x0 edx 0xa
esi 0x180343d0 edi 0x18036018 ebp 0x7ffeeb0c esp 0x9b466fdc
eip 0x2c7c45 eflags 0x10206 user esp 0x7ffeeae4
vector: 0xe, error code: 0x6
<libroot.so>:__Q28BPrivate10superblockiiPQ28BPrivate9hoardHeap + 0x0131
<libroot.so>:makeSuperblock__Q28BPrivate10superblockiPQ28BPrivate11processHeap + 0x02f9
<libroot.so>:malloc__Q28BPrivate10threadHeapUl + 0x0abf
<libroot.so>:malloc + 0x0021
<_APP_>:xmalloc + 0x0021
<_APP_>:strvec_create + 0x0022
<_APP_>:merge_temporary_env + 0x005f (nearest)
<_APP_>:merge_temporary_env + 0x0272 (nearest)
<_APP_>:maybe_make_export_env + 0x00ed
<_APP_>:execute_command_internal + 0x28ed (nearest)
<_APP_>:execute_command_internal + 0x04be
<_APP_>:execute_command_internal + 0x139e (nearest)
<_APP_>:execute_command_internal + 0x160c (nearest)
<_APP_>:execute_command_internal + 0x0740
<_APP_>:parse_and_execute + 0x03ea
<_APP_>:disable_priv_mode + 0x02db (nearest)
<_APP_>:main + 0x06da
<_APP_>:start + 0x005b
785284:runtime_loader_seg0ro@0x00100000 + 0x8ea
785283:sh_main_stack@0x7efef000 + 0xffffec

This was in thread 'sh' by the way.

Attachments (2)

sc.txt (1.7 KB ) - added by anevilyak 11 years ago.
Attachment containing stack trace with readable newlines.
objdumps (69.1 KB ) - added by anevilyak 11 years ago.
objdumps of vm_soft_fault, fault_find_page and fault_get_page

Download all attachments as: .zip

Change History (17)

by anevilyak, 11 years ago

Attachment: sc.txt added

Attachment containing stack trace with readable newlines.

comment:1 by anevilyak, 11 years ago

Hmm...I have no idea why it ate all my newlines there :/ Made an attachment containing the trace.

comment:2 by mmlr, 11 years ago

Description: modified (diff)

Fixed formatting in the description. Any reason why this is an "admin" feature and not available to normal "developer" accounts?

comment:3 by anevilyak, 11 years ago

For reference, I tried a few more builds and managed to panic the same way every time...is there any extra information I can gather from the kernel debugger that'd be helpful?

in reply to:  3 comment:4 by bonefish, 11 years ago

Milestone: R1R1/alpha1
Priority: normalcritical

Replying to anevilyak:

For reference, I tried a few more builds and managed to panic the same way every time...

Cool, we tried a few times, but never could -- respectively ran into other problems.

is there any extra information I can gather from the kernel debugger that'd be helpful?

Yep, in vm.cpp make fault_get_page() and fault_find_page() non-inline (and maybe non-static), so that they aren't inlined into vm_soft_fault(). When you reproduce it next time the function offset in the stack trace function where the kernel page fault occurs would be important to know (should be in one of these three functions -- don't mind the rest of the stack trace). Please add an objdump of that function to the ticket (best get that on the build platform). That should provide us with the exact location in the source where things go wrong. It pretty much depends on that, what to investigate in the kernel debugger. You can do some general looking around though: You can use the "call" command to get the vm_soft_fault() parameters and thus the user fault address and kind of fault (read/write). The address gets you the area in question, via the area you can use "cache_tree" and "cache" to have a look at the caches that might be involved -- something may or may not look suspicious.

Normally mutex_lock/unlock() come in pairs. That the object pointer becomes NULL inbetween is certainly weird. Maybe it already helps to know the exact location where that happens and have a keen-eyed look at the source to turn the bug up.

by anevilyak, 11 years ago

Attachment: objdumps added

objdumps of vm_soft_fault, fault_find_page and fault_get_page

comment:5 by anevilyak, 11 years ago

Objdumps added (let me know if those aren't what you wanted please). Occasionally the build does error out early on a general OS error in mimeset for some reason, but this succeeds upon restarting jam again. In any case, I won't be able to try building again until I'm home in a few hours, sorry for the slow turnaround time. Will let you know what happens.

in reply to:  5 comment:6 by bonefish, 11 years ago

Replying to anevilyak:

Objdumps added (let me know if those aren't what you wanted please).

Yep, those are fine. Just need the updated page fault address, too. From the same kernel build the objdumps have been made from.

Occasionally the build does error out early on a general OS error in mimeset for some reason,

We've tracked this one down. It's an app server race condition that is probably also responsible for other issues (weird window sizes, black GUI parts). Will commit a fix soon.

but this succeeds upon restarting jam again. In any case, I won't be able to try building again until I'm home in a few hours, sorry for the slow turnaround time. Will let you know what happens.

Cool, thanks!

comment:7 by anevilyak, 11 years ago

Hi,

Now I get:

9db5ae04 (+120) 800321c5 <kernel>:_mutex_unlock + 0x0045
9db5ae7c (+160) 80097a76 <kernel>:vm_soft_fault__FulbT1 + 0x0296

call 25184 12 -3 yields:
9db5ae7c 80097a76 <kernel>:vm_soft_fault__FUlbT1(0x18031078, 0x800e7c01, 0xc001 (49153))

Doing cache_tree on 0x18031078 yields *READ/WRITE FAULT*

In any case, I'm leaving the box in the kernel debugger, please let me know what else I can gather that'd be any use here.

comment:8 by bonefish, 11 years ago

Sorry, I should have been more explicit: With the userland address you have, you can get the area the address resides in via area 0x18031078. The area should have a VM cache, which is listed by the "area" command. This one can be fed to "cache_tree" or "cache".

Anyway, the new crash address says that we happen to be in vm_soft_fault(), fault_get_page() returned B_OK and a valid page, but a NULL pageSource. This shouldn't happen and looking through the sources I haven't spotted anything that might cause it. Can you please add a check at the end of fault_get_page() if the cache to return is NULL and panic() if so (add the page to the panic() message). The same for fault_find_page() -- in case of *_restart == true a NULL cache is fine, though. If one of the panic()s is triggered the output of "page" for the page (if it is non-NULL) would be interesting.

comment:9 by anevilyak, 11 years ago

area 0x18031078 actually returns several pages worth of areas, is this normal? Note that for all cases, the area name is 'heap'. Will make the changes asked and rebuild in any case.

comment:10 by anevilyak, 11 years ago

Also note, the base is always 0x18000000, but the size varies, as does the owner.

in reply to:  9 comment:11 by bonefish, 11 years ago

Replying to anevilyak:

area 0x18031078 actually returns several pages worth of areas, is this normal? Note that for all cases, the area name is 'heap'. Will make the changes asked and rebuild in any case.

Yep, the area of the team in which the crash happens is the interesting one.

comment:12 by anevilyak, 11 years ago

The panic happens in fault_find_page, which gets called by fault_get_page, page address is 0x9b109ed4, dump is:

PAGE: 0x9b109ed4
queue_next, prev: 0x80082d87, 0x9b109f44
physical_number: 9f2004e0
cache: 0x9f200720
cache_offset: 51
cache_next: 0x00000000
type: 1
state: busy
wired_count: 0
usage_count: 2
busy_writing: 0
area_mappings:

comment:13 by anevilyak, 11 years ago

Also, just to be sure I did it correctly, the panic check I had in fault_find_page was if (!(*_restart) && cache == NULL) , right before the values are written into the passed in pointers.

comment:14 by bonefish, 11 years ago

Owner: changed from axeld to bonefish
Status: newassigned

Problem understood. Will look into fixing it tomorrow.

comment:15 by bonefish, 11 years ago

Resolution: fixed
Status: assignedclosed

Fixed in hrev26248.

Note: See TracTickets for help on using tickets.