Ticket #1512 (new bug)

Opened 8 months ago

Last modified 3 months ago

deadlock after clicking on Deskbar

Reported by: marcusoverhagen Assigned to: axeld
Priority: critical Milestone: R1/alpha1
Component: System/Kernel Version: R1 development
Cc: Platform: All

Description

After clicking on Deskbar (which doesn't react),
the mouse continues to work for about 1 second,
then everything freezes reproduceable.

Occured with r22394

PANIC: keyboard requested halt.

Welcome to Kernel Debugging Land...
Running on CPU 0
kdebug> wait 4
thread id state sem/cv cpu pri stack team name
0x90ae9800 82 waiting 4 - 104 0x94649000 47 AT Keyboard 1 watcher
0x90afb800 85 waiting 4 - 15 0x94668000 54 w>Deskbar
0x9090a800 29 waiting 4 - 5 0x803c0000 1 syslog sender
0x90a5d000 69 waiting 4 - 10 0x90703000 56 big brother is watching you
0x90a29000 4d waiting 4 - 103 0x906cb000 47 _input_server_event_loop_
0x90a2d000 51 waiting 4 - 90 0x906db000 37 event loop
0x90988000 32 waiting 4 - 10 0x803d3000 2f timer_thread
0x9098e000 36 waiting 4 - 10 0x803e3000 36 net_server
kdebug> sem 4
SEM: 0x93f5f0b0
id: 0x4
name: 'kernel_aspacelock'
owner: 0x0
count: 0xfffffff8
queue: 85 51 4d 32 69 36 82 29
last acquired by: 133, count: 1024
last released by: 133, count: 1
kdebug>

Attachments

deadlock.txt (155.2 kB) - added by marcusoverhagen on 10/01/07 07:19:18.
serial debug output
deadlock3.txt (196.8 kB) - added by marcusoverhagen on 10/01/07 08:22:38.
deadlock_menutracking.txt (23.5 kB) - added by marcusoverhagen on 10/01/07 08:33:31.

Change History

10/01/07 06:37:53 changed by jackburton

r22392 works fine here (r22393 and r22394 don't seem to change anything important). No AHCI here, though, so it might be related to that.

10/01/07 07:15:37 changed by marcusoverhagen

This might be an SMP problem. This is a core 2 duo sytem, running at 2.4 GHz.
I "downclocked" it to 900 MHz and now I'm able to use Deskbar for a few seconds,
can even run the About app. However, it deadlocks soon, with the same semaphore.

10/01/07 07:19:18 changed by marcusoverhagen

  • attachment deadlock.txt added.

serial debug output

10/01/07 07:42:03 changed by axeld

  • priority changed from normal to critical.

Can you reproduce this with GCC 2.95.3?
Also, can you provide a stack trace from the deskbar thread?

10/01/07 08:22:01 changed by marcusoverhagen

I couldn't exaclty reproduce it. A similar deadlock occured now,
but the thread that acquired the semaphore seems to be gone.
Please see attached logfile.

10/01/07 08:22:38 changed by marcusoverhagen

  • attachment deadlock3.txt added.

10/01/07 08:32:37 changed by marcusoverhagen

I think I got it reproduced now. This time the Deskbar's menu
tracking thread was the holder of sem 4. See attached logfile.

10/01/07 08:33:31 changed by marcusoverhagen

  • attachment deadlock_menutracking.txt added.

10/01/07 16:02:30 changed by umccullough

I'm guessing #1509 is a duplicate of this.

10/01/07 16:46:35 changed by bonefish

According to deadlock_menutracking.txt the thread is creating the kernel stack area for an new thread. In insert_area() it apparently accesses invalid memory. If vm.cpp was compiled with gcc 4 and no debugging the instruction would try to access vm_area::base with %edx (0xe8458d00) being the area pointer.

deadlock3.txt is interesting too: The thread that holds the kernel address space lock still lives -- it is still queued in the semaphore -- but it's no longer in the thread hash table. So it is apparently in the last phase of its death. Probably it is currently deleting its kernel stack area and accesses invalid memory. This would at least allow for the same explanation as in the other case.

So, supposedly the area/address space structures become invalid at some point. Given that it doesn't seem to be reproducible on a single CPU, a SMP-only race condition seems likely.

Browsing through some code, I found at least one of those, though it seems a little unlikely to be that well reproducible: In thread_exit2() we call put_death_stack(), but don't hold (and cannot hold) the thread spinlock. Between releasing the spinlock in put_death_stack() and reacquiring it in thread_exit2() another thread could grab our still used death stack.

10/01/07 17:48:45 changed by bonefish

I've fixed the death stack issue in r22403. I don't think it's likely that this was the cause of the problem, but it wouldn't harm to check.

10/01/07 18:36:53 changed by marcusoverhagen

This problem is not reproduceable with GCC 2.95.3

10/02/07 05:52:00 changed by axeld

  • milestone changed from R1 to R1/alpha.

(follow-up: ↓ 12 ) 02/27/08 06:49:55 changed by aldeck

I just experienced a full system deadlock, fresh start, on my first click on the leaf menu. Hitting F12 to KDL worked.
It never happened to me, seems hard to reproduce.

Since i don't really know what to do in kdl, and vmware suspend works thanks to Ingo (#985), i've put the vmdk and vmss here : http://haikubeat.free.fr/files/testing/suspended_vmware-image_deadlock.zip , i never tried to move a vmss though, we'll see if it works.
If if you can resume the image on your side and you think it's another bug, just tell :)

(in reply to: ↑ 11 ) 02/27/08 06:58:22 changed by aldeck

Ohh, and as you might find it's disk related, i must add it was a first boot (freshly built image).

(follow-up: ↓ 14 ) 02/28/08 02:48:49 changed by jackburton

Since this bug doesn't show up with the "default" compiler, has it really have to be targeted to R1/alpha ?

(in reply to: ↑ 13 ) 02/28/08 03:35:27 changed by aldeck

Replying to jackburton:

Since this bug doesn't show up with the "default" compiler, has it really have to be targeted to R1/alpha ?

Hmm, not really sure it is the same bug, but it happened to me with GCC 2.