Opened 17 years ago
Closed 9 years ago
#1512 closed bug (fixed)
deadlock after clicking on Deskbar
Reported by: | marcusoverhagen | Owned by: | axeld |
---|---|---|---|
Priority: | critical | Milestone: | R1 |
Component: | System/Kernel | Version: | R1/pre-alpha1 |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Platform: | All |
Description
After clicking on Deskbar (which doesn't react), the mouse continues to work for about 1 second, then everything freezes reproduceable.
Occured with hrev22394
PANIC: keyboard requested halt.
Welcome to Kernel Debugging Land... Running on CPU 0 kdebug> wait 4 thread id state sem/cv cpu pri stack team name 0x90ae9800 82 waiting 4 - 104 0x94649000 47 AT Keyboard 1 watcher 0x90afb800 85 waiting 4 - 15 0x94668000 54 w>Deskbar 0x9090a800 29 waiting 4 - 5 0x803c0000 1 syslog sender 0x90a5d000 69 waiting 4 - 10 0x90703000 56 big brother is watching you 0x90a29000 4d waiting 4 - 103 0x906cb000 47 _input_server_event_loop_ 0x90a2d000 51 waiting 4 - 90 0x906db000 37 event loop 0x90988000 32 waiting 4 - 10 0x803d3000 2f timer_thread 0x9098e000 36 waiting 4 - 10 0x803e3000 36 net_server kdebug> sem 4 SEM: 0x93f5f0b0 id: 0x4 name: 'kernel_aspacelock' owner: 0x0 count: 0xfffffff8 queue: 85 51 4d 32 69 36 82 29 last acquired by: 133, count: 1024 last released by: 133, count: 1 kdebug>
Attachments (3)
Change History (29)
comment:1 by , 17 years ago
comment:2 by , 17 years ago
This might be an SMP problem. This is a core 2 duo sytem, running at 2.4 GHz. I "downclocked" it to 900 MHz and now I'm able to use Deskbar for a few seconds, can even run the About app. However, it deadlocks soon, with the same semaphore.
comment:3 by , 17 years ago
Priority: | normal → critical |
---|
Can you reproduce this with GCC 2.95.3? Also, can you provide a stack trace from the deskbar thread?
comment:4 by , 17 years ago
I couldn't exaclty reproduce it. A similar deadlock occured now, but the thread that acquired the semaphore seems to be gone. Please see attached logfile.
by , 17 years ago
Attachment: | deadlock3.txt added |
---|
comment:5 by , 17 years ago
I think I got it reproduced now. This time the Deskbar's menu tracking thread was the holder of sem 4. See attached logfile.
by , 17 years ago
Attachment: | deadlock_menutracking.txt added |
---|
comment:7 by , 17 years ago
According to deadlock_menutracking.txt the thread is creating the kernel stack area for an new thread. In insert_area() it apparently accesses invalid memory. If vm.cpp was compiled with gcc 4 and no debugging the instruction would try to access vm_area::base with %edx (0xe8458d00) being the area pointer.
deadlock3.txt is interesting too: The thread that holds the kernel address space lock still lives -- it is still queued in the semaphore -- but it's no longer in the thread hash table. So it is apparently in the last phase of its death. Probably it is currently deleting its kernel stack area and accesses invalid memory. This would at least allow for the same explanation as in the other case.
So, supposedly the area/address space structures become invalid at some point. Given that it doesn't seem to be reproducible on a single CPU, a SMP-only race condition seems likely.
Browsing through some code, I found at least one of those, though it seems a little unlikely to be that well reproducible: In thread_exit2() we call put_death_stack(), but don't hold (and cannot hold) the thread spinlock. Between releasing the spinlock in put_death_stack() and reacquiring it in thread_exit2() another thread could grab our still used death stack.
comment:8 by , 17 years ago
I've fixed the death stack issue in hrev22403. I don't think it's likely that this was the cause of the problem, but it wouldn't harm to check.
comment:10 by , 17 years ago
Milestone: | R1 → R1/alpha |
---|
follow-up: 12 comment:11 by , 17 years ago
I just experienced a full system deadlock, fresh start, on my first click on the leaf menu. Hitting F12 to KDL worked. It never happened to me, seems hard to reproduce.
Since i don't really know what to do in kdl, and vmware suspend works thanks to Ingo (#985), i've put the vmdk and vmss here : http://haikubeat.free.fr/files/testing/suspended_vmware-image_deadlock.zip , i never tried to move a vmss though, we'll see if it works. If if you can resume the image on your side and you think it's another bug, just tell :)
comment:12 by , 17 years ago
Ohh, and as you might find it's disk related, i must add it was a first boot (freshly built image).
follow-up: 14 comment:13 by , 17 years ago
Since this bug doesn't show up with the "default" compiler, has it really have to be targeted to R1/alpha ?
comment:14 by , 17 years ago
Replying to jackburton:
Since this bug doesn't show up with the "default" compiler, has it really have to be targeted to R1/alpha ?
Hmm, not really sure it is the same bug, but it happened to me with GCC 2.
comment:15 by , 16 years ago
I just had this bug happen to me on gcc2 with a single non-HT CPU, fyi. Though the rw_lock seems to have been renamed in the meantime. If I see it again I'll try and gather more detailed data on it, at first I thought it might be related to swap and just rebooted to try disabling swap and seeing if it worked, and I haven't been able to duplicate it again since, with or without.
comment:16 by , 16 years ago
If you can reproduce this, please attach an up-to-date backtrace. Also, it would be helpful if you tried removing the FireWire driver (in case you have FireWire hardware as Marcus has).
comment:17 by , 16 years ago
I haven't seen this one in quite a while now here, I may have disabled my FW controller though. Will check later.
comment:18 by , 16 years ago
OK...with my FW controller enabled, I now hang when the cursor appears on the desktop, though I can still enter KDL. Interestingly it appears to be an interrupt sharing problem, as the last few messages in syslog are:
More than 99-188498180nterrupts of vector 3 are unhandled
Along with some complaints by the radeon driver. Not sure what broke with the dprintf that's used for that message though, I thought it used to say 99% correctly. In any case that appears to be unrelated to this ticket.
comment:19 by , 16 years ago
I can't seem to duplicate this any more, and I've actively tried to abuse deskbar during startup and various other high activity situations. Unless someone else still sees it, I think it might meanwhile have been fixed.
comment:20 by , 16 years ago
Since this appears so hard to reproduce (I never encountered it myself), it should probably be at least moved out of the alpha/1 target.
comment:21 by , 16 years ago
Milestone: | R1/alpha1 → R1 |
---|
comment:22 by , 16 years ago
I just saw this one again while messing with IMKit...one of its teams seemed to be unkillable. Dropped to KDL and found the following:
The deadlocking team:
team 164512 TEAM: 0xb1f54000 id: 164512 (0x282a0) name: 'ProtocolLoader' args: '/boot/apps/im_kit/ProtocolLoader /boot/common/add-ons/im_kit/pr' next: 0x00000000 parent: 0x81192198 (id = 1) children: 0x00000000 num_threads: 2 state: 0 flags: 0x0 io_context: 0x96ac7e80 address_space: 0x94008d20 main_thread: 0x9055d000 thread_list: 0x905c8800 group_id: 164504 session_id: 160047 kdebug> threads 164512 thread id state wait for object cpu pri stack team name 0x905c8800 164529 waiting rwlock 0x94008d28 - 10 0x95eec000164512 user thread 0x9055d000 164512 waiting rwlock 0x94008d28 - 20 0x95ebc000164512 MSN
The rw_lock mentioned:
rw lock 0x94008d28: name: address space holder: -1 reader count: 1 writer count: 1 owner count: 0 flags: 0x0 waiting threads: 164512/w 164529/r 164575/r 164602/r
The mentioned threads:
thread 164512 THREAD: 0x9055d000 id: 164512 (0x282a0) name: "MSN" all_next: 0x905db800 team_next: 0x00000000 q_next: 0x905d4800 priority: 20 (next 20, I/O: -1) state: waiting next_state: waiting cpu: 0x00000000 sig_pending: 0x104100 (blocked: 0x0) in_kernel: 1 waiting for: rwlock 0x94008d28 fault_handler: 0x00000000 args: 0x813f055c 0x00000000 entry: 0x8004d4fc team: 0xb1f54000, "ProtocolLoader" exit.sem: 45164700 exit.status: 0x0 (No error) exit.reason: 0x1 exit.signal: 0x0 exit.waiters: kernel_stack_area: 736446 kernel_stack_base: 0x95ebc000 user_stack_area: -1 user_stack_base: 0x7efef000 user_local_storage: 0x7ffef000 kernel_errno: 0x0 (No error) kernel_time: 6022 user_time: 5048 flags: 0x1 architecture dependant section: esp: 0x95ebfc08 ss: 0x00000010 fpu_state at 0x9055d3d0
I find it interesting that this particular one is marked as user_stack area -1, which seems incorrect.
The other 3 threads waiting on read locks belong to ProcessController, and were me attempting to invoke its context menu. Backtrace of 164512:
stack trace for thread 164512 "MSN" kernel stack: 0x95ebc000 to 0x95ec0000 user stack: 0x7efef000 to 0x7ffef000 frame caller <image>:function + offset 0 95ebfc64 (+ 48) 80057bc8 <kernel_x86> context_switch(thread*: 0x9055d000, thread*: 0x905d4800) + 0x003c 1 95ebfc94 (+ 64) 80057eaf <kernel_x86> simple_reschedule() + 0x029f 2 95ebfcd4 (+ 48) 8003c8ad <kernel_x86> rw_lock_wait(rw_lock*: 0x94008d28, true) + 0x00c9 3 95ebfd04 (+ 64) 8003cf12 <kernel_x86>:rw_lock_write_lock + 0x010a 4 95ebfd44 (+ 48) 800b6c20 <kernel_x86> AddressSpaceWriteLocker<0x95ebfdfc>::SetFromArea(int32: 164512, int32: 736448, false, vm_area&: 0x95ebfdf8) + 0x008c 5 95ebfd74 (+ 64) 800b6ca9 <kernel_x86> AddressSpaceWriteLocker<0x95ebfdfc>::SetFromArea(int32: 164512, int32: 736448, vm_area&: 0x95ebfdf8) + 0x0025 6 95ebfdb4 (+ 80) 800b99c7 <kernel_x86>:vm_delete_area + 0x0027 7 95ebfe04 (+ 192) 80052c7b <kernel_x86>:thread_exit + 0x009f 8 95ebfec4 (+ 64) 800484e8 <kernel_x86>:handle_signals + 0x0454 9 95ebff04 (+ 64) 80053384 <kernel_x86>:thread_at_kernel_exit + 0x0090 10 95ebff44 (+ 100) 800d0152 <kernel_x86>:kernel_exit_handle_signals + 0x0006 user iframe at 0x95ebffa8 (end = 0x95ec0000) eax 0x0 ebx 0x55f580 ecx 0x7ffeef40 edx 0x246 esi 0x7ffef538 edi 0x7ffef54c ebp 0x7ffeef6c esp 0x95ebffdc eip 0xffff0104 eflags 0x200207 user esp 0x7ffeef40 vector: 0x63, error code: 0x0 11 95ebffa8 (+ 0) ffff0104 <commpage>:commpage_syscall + 0x0004 12 7ffeef6c (+ 64) 00202998 <_APP_>:_start + 0x0064 13 7ffeefac (+ 48) 0010090a </boot/beos/system/runtime_loader@0x00100000>:unknown + 0x090a 14 7ffeefdc (+ 0) 7ffeefec 736448:ProtocolLoader_main_stack@0x7efef000 + 0xffffec
The areas mentioned in the call to AddressSpaceWriteLocker:
kdebug> area 164512 could not find area 164512 (164512) kdebug> area 736448 AREA: 0x8134aeb0 name: 'ProtocolLoader_main_stack' owner: 0x282a0 id: 0xb3cc0 base: 0x7efef000 size: 0x1001000 protection: 0x3b wiring: 0x0 memory_type: 0x0 cache: 0x94008f00 cache_type: RAM cache_offset: 0x0 cache_next: 0x00000000 cache_prev: 0x00000000 page mappings: 8
Not sure what else to look up right now, leaving this system up and running though, please let me know if there's any other information I can dig up from its current state.
comment:23 by , 10 years ago
No news for the last 6 years here, is it ok to assume this is fixed now?
comment:24 by , 10 years ago
I am running hrev48487 and I do not see this error at all.
I would assume as well that this has been fixed.
comment:26 by , 9 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
hrev22392 works fine here (hrev22393 and hrev22394 don't seem to change anything important). No AHCI here, though, so it might be related to that.