Opened 16 years ago
Closed 16 years ago
#2471 closed bug (fixed)
PANIC: page fault but interrupts were disabled. Touching address 0x00000008 from eip 0x80031f85 while building Haiku
Reported by: | anevilyak | Owned by: | bonefish |
---|---|---|---|
Priority: | critical | Milestone: | R1/alpha1 |
Component: | System/Kernel | Version: | R1/pre-alpha1 |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Platform: | All |
Description (last modified by )
While doing a jam of Haiku within Haiku, I hit this panic, backtrace as follows:
<snip>kernel debugger stuff <kernel>:panic + 0x0029 <kernel>:page_fault_exception + 0x0060 <kernel>:int_bottom + 0x002a (nearest) iframe at 0x9b466da4 (end = 0x9b466dfc) eax 0x9ccf6000 ebx 0x0 ecx 0x800dc5dc edx 0x5cb5 esi 0xb10fcd80 edi 0x0 ebp 0x9b466e1c esp 0x9b466dd8 eip 0x80031f85 eflags 0x10086 vector: 0xe, error code: 0x0 <kernel>:mutex_unlock + 0x0045 <kernel>:vm_soft_fault__FUlbT1 + 0x0e28 <kernel>:vm_page_fault + 0x002e <kernel>:page_fault_exception + 0x00b1 <kernel>:int_bottom_user + 0x005a (nearest) iframe at 0x9b466fa8 (end = 0x9b467000) eax 0x18036018 ebx 0x341928 ecx 0x0 edx 0xa esi 0x180343d0 edi 0x18036018 ebp 0x7ffeeb0c esp 0x9b466fdc eip 0x2c7c45 eflags 0x10206 user esp 0x7ffeeae4 vector: 0xe, error code: 0x6 <libroot.so>:__Q28BPrivate10superblockiiPQ28BPrivate9hoardHeap + 0x0131 <libroot.so>:makeSuperblock__Q28BPrivate10superblockiPQ28BPrivate11processHeap + 0x02f9 <libroot.so>:malloc__Q28BPrivate10threadHeapUl + 0x0abf <libroot.so>:malloc + 0x0021 <_APP_>:xmalloc + 0x0021 <_APP_>:strvec_create + 0x0022 <_APP_>:merge_temporary_env + 0x005f (nearest) <_APP_>:merge_temporary_env + 0x0272 (nearest) <_APP_>:maybe_make_export_env + 0x00ed <_APP_>:execute_command_internal + 0x28ed (nearest) <_APP_>:execute_command_internal + 0x04be <_APP_>:execute_command_internal + 0x139e (nearest) <_APP_>:execute_command_internal + 0x160c (nearest) <_APP_>:execute_command_internal + 0x0740 <_APP_>:parse_and_execute + 0x03ea <_APP_>:disable_priv_mode + 0x02db (nearest) <_APP_>:main + 0x06da <_APP_>:start + 0x005b 785284:runtime_loader_seg0ro@0x00100000 + 0x8ea 785283:sh_main_stack@0x7efef000 + 0xffffec
This was in thread 'sh' by the way.
Attachments (2)
Change History (17)
by , 16 years ago
comment:1 by , 16 years ago
Hmm...I have no idea why it ate all my newlines there :/ Made an attachment containing the trace.
comment:2 by , 16 years ago
Description: | modified (diff) |
---|
Fixed formatting in the description. Any reason why this is an "admin" feature and not available to normal "developer" accounts?
follow-up: 4 comment:3 by , 16 years ago
For reference, I tried a few more builds and managed to panic the same way every time...is there any extra information I can gather from the kernel debugger that'd be helpful?
comment:4 by , 16 years ago
Milestone: | R1 → R1/alpha1 |
---|---|
Priority: | normal → critical |
Replying to anevilyak:
For reference, I tried a few more builds and managed to panic the same way every time...
Cool, we tried a few times, but never could -- respectively ran into other problems.
is there any extra information I can gather from the kernel debugger that'd be helpful?
Yep, in vm.cpp make fault_get_page() and fault_find_page() non-inline (and maybe non-static), so that they aren't inlined into vm_soft_fault(). When you reproduce it next time the function offset in the stack trace function where the kernel page fault occurs would be important to know (should be in one of these three functions -- don't mind the rest of the stack trace). Please add an objdump of that function to the ticket (best get that on the build platform). That should provide us with the exact location in the source where things go wrong. It pretty much depends on that, what to investigate in the kernel debugger. You can do some general looking around though: You can use the "call" command to get the vm_soft_fault() parameters and thus the user fault address and kind of fault (read/write). The address gets you the area in question, via the area you can use "cache_tree" and "cache" to have a look at the caches that might be involved -- something may or may not look suspicious.
Normally mutex_lock/unlock() come in pairs. That the object pointer becomes NULL inbetween is certainly weird. Maybe it already helps to know the exact location where that happens and have a keen-eyed look at the source to turn the bug up.
by , 16 years ago
objdumps of vm_soft_fault, fault_find_page and fault_get_page
follow-up: 6 comment:5 by , 16 years ago
Objdumps added (let me know if those aren't what you wanted please). Occasionally the build does error out early on a general OS error in mimeset for some reason, but this succeeds upon restarting jam again. In any case, I won't be able to try building again until I'm home in a few hours, sorry for the slow turnaround time. Will let you know what happens.
comment:6 by , 16 years ago
Replying to anevilyak:
Objdumps added (let me know if those aren't what you wanted please).
Yep, those are fine. Just need the updated page fault address, too. From the same kernel build the objdumps have been made from.
Occasionally the build does error out early on a general OS error in mimeset for some reason,
We've tracked this one down. It's an app server race condition that is probably also responsible for other issues (weird window sizes, black GUI parts). Will commit a fix soon.
but this succeeds upon restarting jam again. In any case, I won't be able to try building again until I'm home in a few hours, sorry for the slow turnaround time. Will let you know what happens.
Cool, thanks!
comment:7 by , 16 years ago
Hi,
Now I get:
9db5ae04 (+120) 800321c5 <kernel>:_mutex_unlock + 0x0045 9db5ae7c (+160) 80097a76 <kernel>:vm_soft_fault__FulbT1 + 0x0296 call 25184 12 -3 yields: 9db5ae7c 80097a76 <kernel>:vm_soft_fault__FUlbT1(0x18031078, 0x800e7c01, 0xc001 (49153))
Doing cache_tree on 0x18031078 yields *READ/WRITE FAULT*
In any case, I'm leaving the box in the kernel debugger, please let me know what else I can gather that'd be any use here.
comment:8 by , 16 years ago
Sorry, I should have been more explicit: With the userland address you have, you can get the area the address resides in via area 0x18031078
. The area should have a VM cache, which is listed by the "area" command. This one can be fed to "cache_tree" or "cache".
Anyway, the new crash address says that we happen to be in vm_soft_fault(), fault_get_page() returned B_OK and a valid page, but a NULL pageSource. This shouldn't happen and looking through the sources I haven't spotted anything that might cause it. Can you please add a check at the end of fault_get_page() if the cache to return is NULL and panic() if so (add the page to the panic() message). The same for fault_find_page() -- in case of *_restart == true
a NULL cache is fine, though. If one of the panic()s is triggered the output of "page" for the page (if it is non-NULL) would be interesting.
follow-up: 11 comment:9 by , 16 years ago
area 0x18031078 actually returns several pages worth of areas, is this normal? Note that for all cases, the area name is 'heap'. Will make the changes asked and rebuild in any case.
comment:10 by , 16 years ago
Also note, the base is always 0x18000000, but the size varies, as does the owner.
comment:11 by , 16 years ago
Replying to anevilyak:
area 0x18031078 actually returns several pages worth of areas, is this normal? Note that for all cases, the area name is 'heap'. Will make the changes asked and rebuild in any case.
Yep, the area of the team in which the crash happens is the interesting one.
comment:12 by , 16 years ago
The panic happens in fault_find_page, which gets called by fault_get_page, page address is 0x9b109ed4, dump is:
PAGE: 0x9b109ed4 queue_next, prev: 0x80082d87, 0x9b109f44 physical_number: 9f2004e0 cache: 0x9f200720 cache_offset: 51 cache_next: 0x00000000 type: 1 state: busy wired_count: 0 usage_count: 2 busy_writing: 0 area_mappings:
comment:13 by , 16 years ago
Also, just to be sure I did it correctly, the panic check I had in fault_find_page was
if (!(*_restart) && cache == NULL)
, right before the values are written into the passed in pointers.
comment:14 by , 16 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
Problem understood. Will look into fixing it tomorrow.
Attachment containing stack trace with readable newlines.