Opened 14 years ago

Closed 14 years ago

#5138 closed bug (fixed)

PANIC: remove page 0x82868278 from cache 0xd19e7680: page still has mappings!

Reported by: mmadia Owned by: bonefish
Priority: normal Milestone: R1
Component: System/Kernel Version: R1/Development
Keywords: Cc:
Blocked By: Blocking: #5216
Platform: All

Description (last modified by mmadia)

hrev34679 x86gcc2hybrid. My previous installation was hrev34464. This KDL has cropped several times since installing this revision. I don't know an exact step to reproduce it at will. It is not limited to a particular application.

PANIC: remove page 0x82868278 from cache 0xd19e7680: page still has mappings!

Welcome to Kernel Debugging Land...
Thread 3129 "ShowImage" running on CPU 1
kdebug> dmbf    bf  bt
stack trace for thread 3129 "ShowImage"
    kernel stack: 0xd2d08000 to 0xd2d0c000
      user stack: 0x7efef000 to 0x7ffef000
frame               caller     <image>:function + offset
 0 d2d0b7a8 (+  48) 8006bd84   <kernel_x86>:debug_builtin_commands_init (nearest) + 0x01e0
 1 d2d0b7d8 (+  12) 800e7d13   <kernel_x86>:arch_debug_call_with_fault_handler + 0x001b
 2 d2d0b7e4 (+  48) 8006af50   <kernel_x86>:debug_call_with_fault_handler + 0x0060
 3 d2d0b814 (+  64) 8006bfdd   <kernel_x86>:invoke_debugger_command + 0x00b9
 4 d2d0b854 (+  64) 8006be09   <kernel_x86>:debug_builtin_commands_init (nearest) + 0x0265
 5 d2d0b894 (+  64) 8006c148   <kernel_x86>:invoke_debugger_command_pipe + 0x009c
 6 d2d0b8d4 (+  48) 8006db44   <kernel_x86> ExpressionParser<0xd2d0b984>::_ParseCommandPipe(0xd2d0b980) + 0x0234
 7 d2d0b904 (+  64) 8006cf7e   <kernel_x86> ExpressionParser<0xd2d0b984>::EvaluateCommand(0x80133ce0 "bt", 0xd2d0b980) + 0x02ba
 8 d2d0b944 (+ 224) 8006ef58   <kernel_x86>:evaluate_debug_command + 0x0080
 9 d2d0ba24 (+  64) 80069ad6   <kernel_x86>:kgets (nearest) + 0x02b2
10 d2d0ba64 (+  48) 80069d2a   <kernel_x86>:kgets (nearest) + 0x0506
11 d2d0ba94 (+  48) 8006b11b   <kernel_x86>:kernel_debugger + 0x0023
12 d2d0bac4 (+ 192) 8006b0ed   <kernel_x86>:panic + 0x0029
13 d2d0bb84 (+  96) 800da88b   <kernel_x86> VMCache<0xd19e7680>::Delete(0xd19e7680) + 0x0077
14 d2d0bbe4 (+  64) 800dad4c   <kernel_x86> VMCache<0xd19e7680>::Unlock(0xd1a20000) + 0x0124
15 d2d0bc24 (+  48) 800dae53   <kernel_x86> VMCache<0xd19e7680>::ReleaseRef(0xd1a22f88) + 0x002b
16 d2d0bc54 (+  48) 800cc2bc   <kernel_x86>:vm_clone_area (nearest) + 0x048c
17 d2d0bc84 (+  48) 800ce9db   <kernel_x86>:vm_delete_areas + 0x0057
18 d2d0bcb4 (+  48) 800d64db   <kernel_x86> VMAddressSpace<0xd1a22f78>::RemoveAndPut(0x0) + 0x0033
19 d2d0bce4 (+ 112) 8005b2ee   <kernel_x86>:team_delete_team + 0x0272
20 d2d0bd54 (+ 368) 8005ef57   <kernel_x86>:thread_exit + 0x0473
21 d2d0bec4 (+  64) 80053149   <kernel_x86>:handle_signals + 0x03d1
22 d2d0bf04 (+  64) 8005f52e   <kernel_x86>:thread_at_kernel_exit + 0x008e
23 d2d0bf44 (+ 100) 800e83cb   <kernel_x86>:trap99 (nearest) + 0x01ab
user iframe at 0xd2d0bfa8 (end = 0xd2d0c000)
 eax 0x0            ebx 0x775a20        ecx 0x7ffeef40   edx 0x216
 esi 0x77a460       edi 0x7ffef540      ebp 0x7ffeef6c   esp 0xd2d0bfdc
 eip 0xffff0114  eflags 0x207      user esp 0x7ffeef40
 vector: 0x63, error code: 0x0
24 d2d0bfa8 (+   0) ffff0114   <commpage>:commpage_syscall + 0x0004
25 7ffeef6c (+   0) 1806f8f0   
26 180cc978 (+   0) 180cab20   
27 180cac90 (+ 296) 180cab20   
28 180cadb8 (+   0) 180cab20   
29 180cab68 (+   0) 180cab20   
kdebug> 

Are there any other KDL commands that can be used? I'm willing to run a build with tracing or other debugging as well.

Attachments (4)

kdl2.bt (9.7 KB ) - added by mmadia 14 years ago.
Another KDL with bt, aspace, cache_info, info, and a bit more
kdl2.areas (140.4 KB ) - added by mmadia 14 years ago.
kdl2.caches.zip (156.7 KB ) - added by mmadia 14 years ago.
1.32MB uncompressed
minicom.cap (152.0 KB ) - added by mmadia 14 years ago.
hooray! kdl with minimal debug tracing

Download all attachments as: .zip

Change History (31)

comment:1 by mmadia, 14 years ago

Owner: changed from axeld to bonefish

comment:2 by mmadia, 14 years ago

To note, i did a

on_exit sync
exit

which immediately caused this KDL

PANIC: to be freed page 0x82868278 has mappings
Welcome to Kernel Debugging Land...
Thread 3129 "ShowImage" running on CPU 1
kdebug> cont

At that point, Haiku seemed happy to keep working.

by mmadia, 14 years ago

Attachment: kdl2.bt added

Another KDL with bt, aspace, cache_info, info, and a bit more

by mmadia, 14 years ago

Attachment: kdl2.areas added

by mmadia, 14 years ago

Attachment: kdl2.caches.zip added

1.32MB uncompressed

comment:3 by mmadia, 14 years ago

A second KDL occured, this time while working with Pe, Tracker, & Terminal. IIRC, one of the Pe windows disappeared upon exiting KDL. The attached files "kdl2.*" all relate to this crash, which does not mention VMAddressSpace.

If it helps, this machine's serial debug output is accessible via ssh.

comment:4 by mmadia, 14 years ago

Description: modified (diff)

comment:5 by bonefish, 14 years ago

Tracking this bug down will require a bit of work with kernel tracing. What is currently available might not even suffice, in which case we'll need to add some more, but let's first try with what we have (>= hrev34751). Please change the following in build/user_config_headers/tracing_config.h (copy from build/config_headers/):

  • ENABLE_TRACING to 1.
  • MAX_TRACE_SIZE to (100 * 1024 * 1024) or more. Unless you've less than 256 MB. In that case try 10 or 20 MB and see whether that suffices.
  • SYSCALL_TRACING to 1.
  • TEAM_TRACING to 1.
  • VM_CACHE_TRACING to 2.
  • VM_CACHE_TRACING_STACK_TRACE to 10.
  • VM_PAGE_FAULT_TRACING to 1.

When the panic occurs, there are quite a few interesting basic information to retrieve:

  • As always the stack trace. If it is similar to the two attached ones -- i.e. goes through "vm_clone_area (nearest)" (which is actually delete_area()) -- a "call 16 -2" (16: the line number in the stack trace, 2: the number of arguments we want to see) will yield the function's parameters: the address of the address space and of the area (in case of gcc 4 they might be invalid). An "area address <area address>" should produce information on the area.
  • If the thread is a userland thread the recent syscall history of the thread might be interesting: "traced 0 -10 -1 and thread <thread ID> or #syscall #team".
  • Info on the cache: "cache <cache address>" (cache address as in the panic message).
  • Info on the page: "page <page address>" (page address as in the panic message). Since the page has mappings, there should be at least one entry under "area mappings:". Use "area address <area address>" with the address listed to get information on the area. A cache will be listed. Use "cache <cache address>" and "cache_tree <cache address>".

This was the straight forward part. The rest is working with the tracing entries, which depends mainly on information retrieved so far and further information from the tracing. I'll best give a bit of background first: An area has a (VM) cache which is the top of a stack of caches (actually part of a tree, but let's keep it simple). A cache contains pages. A page represents a page of physical memory and can only live in at most one cache. An area can map the any pages of the caches in its cache stack -- those and only those.

The panic this ticket is about is triggered when a no longer used cache is destroyed although it contains a page that is still mapped by some area. Normally that cannot happen, since an area that is still used has references to its cache (indirectly to all caches in the stack) and therefore the cache would not be destroyed.

There are two kinds of bugs that could trigger the panic: The cache reference counting is broken and the cache, although still in use, is destroyed. Or the page in question has been erroneously moved to another cache respectively the complete cache has been moved. Since the ref counting is relatively simple, I'd consider the latter far more likely. With the information retrieved above that can be easily verified: In the expected case the output of the "cache_tree" for the cache of the area the page is still mapped by will not contain the cache that is currently being destroyed. IOW the area has mapped a page that is not in one of its caches anymore. It is relatively save to assume, that at the time when the page was mapped it was still in the right cache (i.e. one referenced by the area).

The first step to understand the bug is to find out at which point exactly the page left the "legal reach" of the area, i.e. when it was removed from a cache in the area's stack and was inserted into an unrelated one, respectively when the cache containing the page was removed from the area's cache stack. In either case the approach is the same: First find the point at which the page was first inserted into a cache of the area, then trace the page and the caches it lives in until hitting the main point of interest. Both can be a bit tedious, since there can be quite a bit of fluctuation wrt. the caches of an area (fork()s and mmap()s can add new caches, exec*() and area deletion can cause removal of caches).

If the area in question (still talking about the one with the page mapping) is a userland area, its point of creation should be easy to find, since there are only two syscalls that create areas:

traced 0 -1 -1 and and #syscall #<area ID> or #create_area #map_file

<area ID> must be the area ID in hexadecimal notation with leading "0x". Note the tracing entry index and search forward from that point looking for the page:

traced <index> 10 -1 and #page #<page address>

We're looking for entries saying "vm cache insert page:...". Start with the first entry: It needs to be verified that the printed cache belongs to the area at this point:

cache_stack area <area address> <index2>

If the cache isn't part of the printed list, continue with the next entry, until finding a cache that is (always use the index of the tracing entry that is examined).

After having found the point where the page was inserted into one of the area's caches, both the cache and the page need to be tracked forward:

traced <index2> 10 -1 and "#vm cache" #<cache address>

This will turn up tracing entries that involve the cache. The interesting ones are: "vm cache remove page:..." for our page and "vm cache remove consumer:..." with another cache being the consumer. If the page is removed from the cache, find out where it is inserted next:

traced <index3> 10 -1 and "#vm cache" #<page address>

It must be verified that the new cache is in our area's stack at that point. If a consumer is removed from the cache, it must be checked whether the removed consumer is in the area's cache stack at that point (if not, the entry can be ignored) and if so, whether shortly after another cache of the area's stack is added as a consumer again. Finding out whether a cache is in the area's cache stack at a point works just as above, i.e. run

cache_stack area <area address> <index4>

and check whether the cache's address is printed. If the cache is not in the area's cache stack, we have found the point where things go wrong. Otherwise continue following the cache containing the page.

At the point of interest (i.e. the tracing entry where the page or its containing cache was removed for the area's cache stack) -- let's call it <index5> -- a bit of context would be interesting:

traced --stacktrace <index5> -20
traced --stacktrace <index5> 20

(will probably be longish)

Please attach the complete KDL session.

A general hint for capturing KDL sessions: Make sure the serial output is processed by a terminal, since otherwise edited prompt lines will be hard to read. I.e. if you direct the output directly into a file, when you're done "cat" the file in a terminal and copy the terminal lines (the history must be big enough, of course).

comment:6 by mmadia, 14 years ago

following up:

  • Somewhere between 34645 and 34655 is when the KDL started appearing with ease.
  • These initial suggestions applied to 34753 resulted in the KDL vanishing.
  • A normal 34784 was confirmed to still exhibit the KDL.
  • As per off-list suggestions, tracing_config.h has been modified to these non-zero values:
    #	define ENABLE_TRACING 1
    #	define MAX_TRACE_SIZE (300 * 1024 * 1024)
    #define SYSCALL_TRACING_IGNORE_KTRACE_OUTPUT	1
    #define VM_CACHE_TRACING			2
    #define VM_CACHE_TRACING_STACK_TRACE		5
    

With 3 hours of continually jam -aq @alpha-raw in 34784, no KDL has yet to occur.

comment:7 by bonefish, 14 years ago

Have you also tried with VM_CACHE_TRACING_STACK_TRACE disabled completely? The stack traces would be nice, but VM_CACHE_TRACING alone might already be helpful. If that doesn't work either, we'll have to add some hand-tailored debug code and hope that it doesn't make the bug disappear as well.

comment:8 by mmadia, 14 years ago

VM_CACHE_TRACING_STACK_TRACE disabled completely doesn't help. This is still with 34784... I haven't tested newer yet.

comment:9 by bonefish, 14 years ago

Blocking: 5216 added

(In #5216) Fixed the second problem in hrev34958. That was just incorrect debug code. The first issue is probably a duplicate of #5138.

comment:10 by bonefish, 14 years ago

During my optimization tests I've run into hard system freezes several times, with seemingly increasing frequency lately. This might be the same problem just combined with some effect that prevents the system from entering the kernel debugger. Unfortunately I'm still waiting for the serial port for this machine to be shipped, so it's pretty hard for me to debug ATM. Might also be related to the freeze-on-boot problem Rene reported on the commit list (as a follow-up to commit hrev34615).

comment:11 by anevilyak, 14 years ago

I haven't yet run into that problem with a newer revision here, even while doing -j2 / -j4 builds. Are you doing anything particular that triggers it or just high load/activity?

in reply to:  11 ; comment:12 by bonefish, 14 years ago

Replying to anevilyak:

I haven't yet run into that problem with a newer revision here, even while doing -j2 / -j4 builds. Are you doing anything particular that triggers it or just high load/activity?

No, I'm only running image builds in a KDEBUG=0 installation. IIRC I've never encountered a freeze during the first build, though. My usual procedure is to do one complete build and two times remove the objects and capture scheduling information during the first 10 s (after the header scanning) of a fresh (but warmed-up) image build. The freeze has so far always happened during one of those scheduling info capture runs (tends to be the second run, I'd say).

While at the beginning of my optimization endeavors I encountered a freeze only every few days, I've seen them several times a day in the last days. A possible explanation -- besides me screwing up more and more things in the kernel -- would be that the improved concurrency in the kernel does make running into the race condition this bug is probably caused by more likely. Maybe Matt can confirm that the panic() has become more easy to reproduce with a current revision.

I'm hoping that further optimization might make the bug even more easy to reproduce, so that someone (*duck* :-)) can tackle it. Especially that the system freezes and doesn't enter KDL is extremely annoying.

comment:13 by anevilyak, 14 years ago

Ahh. I have to admit I've never tried it with the scheduling recorder, will see if that makes any difference here. There's no possibility it could be related to whatever mechanisms it uses to capture data is there? I must confess that right now I have no idea how its information capture works.

comment:14 by anevilyak, 14 years ago

Update: I likewise managed to hit a hang with the scheduling recorder. However, I did eventually panic in acquire_spinlock:

PANIC: acquire_spinlock(): Failed to acquire spinlock 0x80130678 for a long time!
kdebug> bt
stack trace for thread 5205 "scheduling_recorder"
    kernel stack: 0x91cd7000 to 0x91cdb000
      user stack: 0x7efef000 to 0x7ffef000
frame               caller     <image>:function + offset
 0 91cdaae8 (+  32) 8006b5c1   <kernel_x86> invoke_command_trampoline(void*: 0x91cdab68) + 0x0015
 1 91cdab08 (+  12) 800d8173   <kernel_x86>:arch_debug_call_with_fault_handler + 0x001b
 2 91cdab14 (+  48) 8006946a   <kernel_x86>:debug_call_with_fault_handler + 0x0051
 3 91cdab44 (+  64) 8006b96a   <kernel_x86>:invoke_debugger_command + 0x00bb
 4 91cdab84 (+  48) 8006ba87   <kernel_x86> invoke_pipe_segment(debugger_command_pipe*: 0x8012aa22, int32: 0, char*: NULL) + 0x0083
 5 91cdabb4 (+  32) 8006bb4f   <kernel_x86>:invoke_debugger_command_pipe + 0x008b
 6 91cdabd4 (+ 128) 8006f8ee   <kernel_x86> ExpressionParser<0x91cdaca4>::_ParseCommandPipe(int&: 0x91cdaca0) + 0x0aae
 7 91cdac54 (+  48) 800720b7   <kernel_x86> ExpressionParser<0x91cdaca4>::EvaluateCommand(char const*: 0x8012aa20 "bt", int&: 0x91cdaca0) + 0x06df
 8 91cdac84 (+ 192) 80072230   <kernel_x86>:evaluate_debug_command + 0x0084
 9 91cdad44 (+  96) 8006a3ba   <kernel_x86> kernel_debugger_internal(char const*: 0x1 "<???>", int32: -1848791600) + 0x03a7
10 91cdada4 (+  16) 8006a51b   <kernel_x86>:kernel_debugger + 0x0019
11 91cdadb4 (+ 160) 8006a5f5   <kernel_x86>:panic + 0x002a
12 91cdae54 (+  48) 80056d13   <kernel_x86>:acquire_spinlock + 0x003a
13 91cdae84 (+  48) 80041990   <kernel_x86> ConditionVariable<0x81bc8734>::Add(ConditionVariableEntry*: 0x91cdaed8) + 0x0022
14 91cdaeb4 (+  80) 80075a37   <kernel_x86> SystemProfiler<0x81bc86c0>::NextBuffer(uint32: 0xbf280 (782976), unsigned long long*: 0x91cdaf2c) + 0x00af
15 91cdaf04 (+  64) 8007669d   <kernel_x86>:_user_system_profiler_next_buffer + 0x00a2
16 91cdaf44 (+ 100) 800d8761   <kernel_x86>:handle_syscall + 0x00be
user iframe at 0x91cdafa8 (end = 0x91cdb000)
 eax 0xd8           ebx 0x2039fc        ecx 0x7ffeee4c   edx 0xffff0114
 esi 0xbf280        edi 0x7ffeef48      ebp 0x7ffeef68   esp 0x91cdafdc
 eip 0xffff0114  eflags 0x246      user esp 0x7ffeee4c
 vector: 0x63, error code: 0x0
17 91cdafa8 (+   0) ffff0114   <commpage>:commpage_syscall + 0x0004
18 7ffeef68 (+  52) 00201363   <_APP_>:_start + 0x0053
19 7ffeef9c (+  64) 001052c3   </boot/system/runtime_loader@0x00100000>:unknown + 0x52c3
20 7ffeefdc (+   0) 7ffeefec   172038:scheduling_recorder_main_stack@0x7efef000 + 0xffffec

The exact invocation was rm -rf generated/objects ; scheduling_recorder ~/sched_data jam -qj2

Interestingly I didn't hit this problem when recording just jam -qj2 kernel, and I also note that it was towards the end of the "patience..." portion of jam. Anything worth investigating if I hit it again?

in reply to:  12 ; comment:15 by mmadia, 14 years ago

Replying to bonefish:

Maybe Matt can confirm that the panic() has become more easy to reproduce with a current revision.

I'm hoping that further optimization might make the bug even more easy to reproduce, so that someone (*duck* :-)) can tackle it. Especially that the system freezes and doesn't enter KDL is extremely annoying.

Yup, one build cycle of jam -aqj3 @alpha-raw in hrev34965-gcc2hybrid was enough to generate it. The uptime was between 20~30 minutes. I'll build the image with just {{{ # define ENABLE_TRACING 1 # define MAX_TRACE_SIZE (300 * 1024 * 1024) #define SYSCALL_TRACING_IGNORE_KTRACE_OUTPUT 1 #define VM_CACHE_TRACING 2 }}}

in reply to:  14 comment:16 by bonefish, 14 years ago

Replying to anevilyak:

Update: I likewise managed to hit a hang with the scheduling recorder. However, I did eventually panic in acquire_spinlock:

Sounds not related, so maybe better open a new ticket.

[...] Anything worth investigating if I hit it again?

Sure. If you have DEBUG_SPINLOCKS enabled, "spinlock ..." might tell you which function last called acquire_spinlock() on the spinlock in question. If it's not just an unbalanced call, a stack trace from the other CPU will help tremendously.

in reply to:  15 comment:17 by bonefish, 14 years ago

Replying to mmadia:

Replying to bonefish:

Maybe Matt can confirm that the panic() has become more easy to reproduce with a current revision.

I'm hoping that further optimization might make the bug even more easy to reproduce, so that someone (*duck* :-)) can tackle it. Especially that the system freezes and doesn't enter KDL is extremely annoying.

Yup, one build cycle of jam -aqj3 @alpha-raw in hrev34965-gcc2hybrid was enough to generate it. The uptime was between 20~30 minutes. I'll build the image with just {{{ # define ENABLE_TRACING 1 # define MAX_TRACE_SIZE (300 * 1024 * 1024) #define SYSCALL_TRACING_IGNORE_KTRACE_OUTPUT 1 #define VM_CACHE_TRACING 2 }}}

Great! So if you can reproduce the bug with VM cache tracing enabled, now, it would be great, if you found the time to track it down as described in comment 5. Even better, if it could be done with VM_CACHE_TRACING_STACK_TRACE enabled.

by mmadia, 14 years ago

Attachment: minicom.cap added

hooray! kdl with minimal debug tracing

comment:18 by mmadia, 14 years ago

"page <page address>" didn't list any area mappings. so, i included "areas" just in case. After continuing for a few times, a different KDL appeared and required a hard reboot. I'll rebuild 34965 with VM_CACHE_TRACING_STACK_TRACE enabled.

in reply to:  18 comment:19 by bonefish, 14 years ago

Replying to mmadia:

"page <page address>" didn't list any area mappings. so, i included "areas" just in case. After continuing for a few times, a different KDL appeared and required a hard reboot. I'll rebuild 34965 with VM_CACHE_TRACING_STACK_TRACE enabled.

There two different mechanisms to keep track of page mappings. One is an explicit list of area mappings (which is empty in this case), the other is the page's "wired_count". wired_count is 1 for the page, so it is (or at least thinks it is) indeed mapped. It's just harder to find out where. In hrev34979 I added the switch "-m" to the "page" command, which searches through all teams' address spaces to find the mappings for the page. That should list the area in your case too. Unless it belongs to a team that is already in the process of being destroyed that is.

Since static symbols are available again, the stack trace contains a line like:

15 ffffbee8 (+  64) 800d311e   <kernel_x86> delete_area(VMAddressSpace*: 0x80fbe000, VMArea*: 0x817a95a0) + 0x014e

Please print the info for that area too ("area address ..."). It's probably not the area the page is still mapped in, but might be of interest nonetheless. (The "call ..." commands to get the parameters of the function aren't necessary in this case.)

The complete area listing is not so interesting.

comment:20 by mmadia, 14 years ago

No luck in hrev34979 with VM_CACHE_TRACING_STACK_TRACE as "10" or "5". The latter ran for 4hours of continuous jam -aqj3 @alpha-raw with no KDL. I'm testing VM_CACHE_TRACING_STACK_TRACE 5 and kernel_debug_config.h 's #define KDEBUG_LEVEL 1 . If this doesn't KDL, i'll go back to VM_CACHE_TRACING_STACK_TRACE 0 and try your instructions above.

comment:21 by mmadia, 14 years ago

Just a follow up on activities .. I tested 34979 with VM_CACHE_TRACING_STACK_TRACE 3, but no luck. Instead of retesting 35979 with VM_CACHE_TRACING_STACK_TRACE 0, I updated to 35012 & ran into #5242.Then udated to 35020 with VM_CACHE_TRACING_STACK_TRACE 0 and that failed to produce a KDL. As per off list suggestion, I tested 35020 with KDEBUG_LEVEL 0. However that also fails to produce this KDL.

Should I reproduce the KDL in 34979 + VM_CACHE_TRACING_STACK_TRACE 0, and grab information from page -m ?

in reply to:  21 comment:22 by bonefish, 14 years ago

Replying to mmadia:

Should I reproduce the KDL in 34979 + VM_CACHE_TRACING_STACK_TRACE 0, and grab information from page -m ?

Yep, that seems the only option to get any more information ATM.

comment:23 by mmadia, 14 years ago

hrev34979 + VM_CACHE_TRACING_STACK_TRACE 0 = no kdl
hrev34979 + VM_CACHE_TRACING_STACK_TRACE 0 , KDEBUG_LEVEL 0 = also no kdl
The last productive session was 34965 + VM_CACHE_TRACING_STACK_TRACE 0 , whose output is minicom.cap. Right now I'm building hrev34965 + the diffs for [34977, 34978, 34979] + VM_CACHE_TRACING_STACK_TRACE 0 , KDEBUG_LEVEL 0 . Hopefully those diffs will allow the 'page -m'. If that doesn't work, I'll try 34980.

comment:24 by mmadia, 14 years ago

hrev34979 + VM_CACHE_TRACING_STACK_TRACE 0 , KDEBUG_LEVEL 0

while building @alpha-raw, Terminal would spit out "could not create mutex". At this point Vision would disconnect. This tends to occur while executing the build_haiku_image script. The jam process will need to be forcefully killed. Prior to killing the process, KDL can not be entered.

relevant syslog snippet:

KERN: vm_soft_fault: va 0x0 not covered by area in address space
KERN: vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0x0, ip 0x24c284, write 1, user 1, thread 0x26eb5
KERN: vm_page_fault: thread "bfs_shell" (159413) in team "bfs_shell" (159413) tried to write address 0x0, ip 0x24c284 ("bfs_shell_seg0ro" +0x4c284)
KERN: thread_hit_serious_debug_event(): Failed to install debugger: thread: 159413: Out of memory

in reply to:  24 comment:25 by bonefish, 14 years ago

Replying to mmadia:

hrev34979 + VM_CACHE_TRACING_STACK_TRACE 0 , KDEBUG_LEVEL 0

while building @alpha-raw, Terminal would spit out "could not create mutex".

Please create a new ticket for that -- it doesn't seem related to this one. The system has apparently run out of semaphores. Not sure why that is or why that would prevent entering KDL (USB keyboard?). Normally a panic() like the above would deliberately crash the fs_shell, but I suppose the debug server needs semaphores to show the debug dialog and just hangs, if it doesn't get any.

comment:26 by bonefish, 14 years ago

Probably fixed in hrev35195.

Fortunately today an ASSERT() I introduced a few days ago was triggered while running a Haiku build. It showed the race condition due to the now fixed missing locking in action. The pages of the kernel stack for a new userland thread were just being incorrectly mapped into a heap area. After continuing I got the "page has still mappings" panic() in the undertaker. Since a page in that area was not mapped correctly its wired_count wouldn't be decremented, thus leading to the panic().

Matt, please give the new revision a whirl with a configuration likely to reproduce the bug (i.e. no tracing) and close the ticket, if you can't reproduce it anymore.

comment:27 by mmadia, 14 years ago

Resolution: fixed
Status: newclosed

no KDL nor instant reboot after 9+ hours of jam -aqj4 @alpha-raw in a standard hrev35200 @alpha-raw gcc2hybrid build. I'll let this run for another day or so, but will close the ticket now. Thank you!

Note: See TracTickets for help on using tickets.