Opened 17 years ago

Closed 9 years ago

#1512 closed bug (fixed)

deadlock after clicking on Deskbar

Reported by: marcusoverhagen Owned by: axeld
Priority: critical Milestone: R1
Component: System/Kernel Version: R1/pre-alpha1
Keywords: Cc:
Blocked By: Blocking:
Platform: All

Description

After clicking on Deskbar (which doesn't react), the mouse continues to work for about 1 second, then everything freezes reproduceable.

Occured with hrev22394

PANIC: keyboard requested halt.

Welcome to Kernel Debugging Land... Running on CPU 0 kdebug> wait 4 thread id state sem/cv cpu pri stack team name 0x90ae9800 82 waiting 4 - 104 0x94649000 47 AT Keyboard 1 watcher 0x90afb800 85 waiting 4 - 15 0x94668000 54 w>Deskbar 0x9090a800 29 waiting 4 - 5 0x803c0000 1 syslog sender 0x90a5d000 69 waiting 4 - 10 0x90703000 56 big brother is watching you 0x90a29000 4d waiting 4 - 103 0x906cb000 47 _input_server_event_loop_ 0x90a2d000 51 waiting 4 - 90 0x906db000 37 event loop 0x90988000 32 waiting 4 - 10 0x803d3000 2f timer_thread 0x9098e000 36 waiting 4 - 10 0x803e3000 36 net_server kdebug> sem 4 SEM: 0x93f5f0b0 id: 0x4 name: 'kernel_aspacelock' owner: 0x0 count: 0xfffffff8 queue: 85 51 4d 32 69 36 82 29 last acquired by: 133, count: 1024 last released by: 133, count: 1 kdebug>

Attachments (3)

deadlock.txt (155.2 KB ) - added by marcusoverhagen 17 years ago.
serial debug output
deadlock3.txt (196.8 KB ) - added by marcusoverhagen 17 years ago.
deadlock_menutracking.txt (23.5 KB ) - added by marcusoverhagen 17 years ago.

Download all attachments as: .zip

Change History (29)

comment:1 by jackburton, 17 years ago

hrev22392 works fine here (hrev22393 and hrev22394 don't seem to change anything important). No AHCI here, though, so it might be related to that.

comment:2 by marcusoverhagen, 17 years ago

This might be an SMP problem. This is a core 2 duo sytem, running at 2.4 GHz. I "downclocked" it to 900 MHz and now I'm able to use Deskbar for a few seconds, can even run the About app. However, it deadlocks soon, with the same semaphore.

by marcusoverhagen, 17 years ago

Attachment: deadlock.txt added

serial debug output

comment:3 by axeld, 17 years ago

Priority: normalcritical

Can you reproduce this with GCC 2.95.3? Also, can you provide a stack trace from the deskbar thread?

comment:4 by marcusoverhagen, 17 years ago

I couldn't exaclty reproduce it. A similar deadlock occured now, but the thread that acquired the semaphore seems to be gone. Please see attached logfile.

by marcusoverhagen, 17 years ago

Attachment: deadlock3.txt added

comment:5 by marcusoverhagen, 17 years ago

I think I got it reproduced now. This time the Deskbar's menu tracking thread was the holder of sem 4. See attached logfile.

by marcusoverhagen, 17 years ago

Attachment: deadlock_menutracking.txt added

comment:6 by umccullough, 17 years ago

I'm guessing #1509 is a duplicate of this.

comment:7 by bonefish, 17 years ago

According to deadlock_menutracking.txt the thread is creating the kernel stack area for an new thread. In insert_area() it apparently accesses invalid memory. If vm.cpp was compiled with gcc 4 and no debugging the instruction would try to access vm_area::base with %edx (0xe8458d00) being the area pointer.

deadlock3.txt is interesting too: The thread that holds the kernel address space lock still lives -- it is still queued in the semaphore -- but it's no longer in the thread hash table. So it is apparently in the last phase of its death. Probably it is currently deleting its kernel stack area and accesses invalid memory. This would at least allow for the same explanation as in the other case.

So, supposedly the area/address space structures become invalid at some point. Given that it doesn't seem to be reproducible on a single CPU, a SMP-only race condition seems likely.

Browsing through some code, I found at least one of those, though it seems a little unlikely to be that well reproducible: In thread_exit2() we call put_death_stack(), but don't hold (and cannot hold) the thread spinlock. Between releasing the spinlock in put_death_stack() and reacquiring it in thread_exit2() another thread could grab our still used death stack.

comment:8 by bonefish, 17 years ago

I've fixed the death stack issue in hrev22403. I don't think it's likely that this was the cause of the problem, but it wouldn't harm to check.

comment:9 by marcusoverhagen, 17 years ago

This problem is not reproduceable with GCC 2.95.3

comment:10 by axeld, 17 years ago

Milestone: R1R1/alpha

comment:11 by aldeck, 17 years ago

I just experienced a full system deadlock, fresh start, on my first click on the leaf menu. Hitting F12 to KDL worked. It never happened to me, seems hard to reproduce.

Since i don't really know what to do in kdl, and vmware suspend works thanks to Ingo (#985), i've put the vmdk and vmss here : http://haikubeat.free.fr/files/testing/suspended_vmware-image_deadlock.zip , i never tried to move a vmss though, we'll see if it works. If if you can resume the image on your side and you think it's another bug, just tell :)

in reply to:  11 comment:12 by aldeck, 17 years ago

Ohh, and as you might find it's disk related, i must add it was a first boot (freshly built image).

comment:13 by jackburton, 17 years ago

Since this bug doesn't show up with the "default" compiler, has it really have to be targeted to R1/alpha ?

in reply to:  13 comment:14 by aldeck, 17 years ago

Replying to jackburton:

Since this bug doesn't show up with the "default" compiler, has it really have to be targeted to R1/alpha ?

Hmm, not really sure it is the same bug, but it happened to me with GCC 2.

comment:15 by anevilyak, 16 years ago

I just had this bug happen to me on gcc2 with a single non-HT CPU, fyi. Though the rw_lock seems to have been renamed in the meantime. If I see it again I'll try and gather more detailed data on it, at first I thought it might be related to swap and just rebooted to try disabling swap and seeing if it worked, and I haven't been able to duplicate it again since, with or without.

comment:16 by axeld, 16 years ago

If you can reproduce this, please attach an up-to-date backtrace. Also, it would be helpful if you tried removing the FireWire driver (in case you have FireWire hardware as Marcus has).

comment:17 by anevilyak, 16 years ago

I haven't seen this one in quite a while now here, I may have disabled my FW controller though. Will check later.

comment:18 by anevilyak, 16 years ago

OK...with my FW controller enabled, I now hang when the cursor appears on the desktop, though I can still enter KDL. Interestingly it appears to be an interrupt sharing problem, as the last few messages in syslog are:

More than 99-188498180nterrupts of vector 3 are unhandled

Along with some complaints by the radeon driver. Not sure what broke with the dprintf that's used for that message though, I thought it used to say 99% correctly. In any case that appears to be unrelated to this ticket.

comment:19 by anevilyak, 16 years ago

I can't seem to duplicate this any more, and I've actively tried to abuse deskbar during startup and various other high activity situations. Unless someone else still sees it, I think it might meanwhile have been fixed.

comment:20 by stippi, 16 years ago

Since this appears so hard to reproduce (I never encountered it myself), it should probably be at least moved out of the alpha/1 target.

comment:21 by stippi, 16 years ago

Milestone: R1/alpha1R1

comment:22 by anevilyak, 16 years ago

I just saw this one again while messing with IMKit...one of its teams seemed to be unkillable. Dropped to KDL and found the following:

The deadlocking team:

team 164512
TEAM: 0xb1f54000
id:          164512 (0x282a0)
name:        'ProtocolLoader'
args:        '/boot/apps/im_kit/ProtocolLoader /boot/common/add-ons/im_kit/pr'
next:        0x00000000
parent:      0x81192198 (id = 1)
children:    0x00000000
num_threads: 2
state:       0
flags:       0x0
io_context:  0x96ac7e80
address_space: 0x94008d20
main_thread: 0x9055d000
thread_list: 0x905c8800
group_id:    164504
session_id:  160047
kdebug> threads 164512
thread         id  state     wait for   object  cpu pri  stack      team  name
0x905c8800 164529  waiting   rwlock 0x94008d28    -  10  0x95eec000164512  user thread
0x9055d000 164512  waiting   rwlock 0x94008d28    -  20  0x95ebc000164512  MSN

The rw_lock mentioned:

rw lock 0x94008d28:
  name:            address space
  holder:          -1
  reader count:    1
  writer count:    1
  owner count:      0
  flags:           0x0
  waiting threads: 164512/w 164529/r 164575/r 164602/r

The mentioned threads:

thread 164512
THREAD: 0x9055d000
id:                 164512 (0x282a0)
name:               "MSN"
all_next:           0x905db800
team_next:          0x00000000
q_next:             0x905d4800
priority:           20 (next 20, I/O: -1)
state:              waiting
next_state:         waiting
cpu:                0x00000000
sig_pending:        0x104100 (blocked: 0x0)
in_kernel:          1
waiting for:        rwlock 0x94008d28
fault_handler:      0x00000000
args:               0x813f055c 0x00000000
entry:              0x8004d4fc
team:               0xb1f54000, "ProtocolLoader"
  exit.sem:         45164700
  exit.status:      0x0 (No error)
  exit.reason:      0x1
  exit.signal:      0x0
  exit.waiters:
kernel_stack_area:  736446
kernel_stack_base:  0x95ebc000
user_stack_area:    -1
user_stack_base:    0x7efef000
user_local_storage: 0x7ffef000
kernel_errno:       0x0 (No error)
kernel_time:        6022
user_time:          5048
flags:              0x1
architecture dependant section:
        esp: 0x95ebfc08
        ss: 0x00000010
        fpu_state at 0x9055d3d0

I find it interesting that this particular one is marked as user_stack area -1, which seems incorrect.

The other 3 threads waiting on read locks belong to ProcessController, and were me attempting to invoke its context menu. Backtrace of 164512:

stack trace for thread 164512 "MSN"
    kernel stack: 0x95ebc000 to 0x95ec0000
      user stack: 0x7efef000 to 0x7ffef000
frame               caller     <image>:function + offset
 0 95ebfc64 (+  48) 80057bc8   <kernel_x86> context_switch(thread*: 0x9055d000, thread*: 0x905d4800) + 0x003c
 1 95ebfc94 (+  64) 80057eaf   <kernel_x86> simple_reschedule() + 0x029f
 2 95ebfcd4 (+  48) 8003c8ad   <kernel_x86> rw_lock_wait(rw_lock*: 0x94008d28, true) + 0x00c9
 3 95ebfd04 (+  64) 8003cf12   <kernel_x86>:rw_lock_write_lock + 0x010a
 4 95ebfd44 (+  48) 800b6c20   <kernel_x86> AddressSpaceWriteLocker<0x95ebfdfc>::SetFromArea(int32: 164512, int32: 736448, false, vm_area&: 0x95ebfdf8) + 0x008c
 5 95ebfd74 (+  64) 800b6ca9   <kernel_x86> AddressSpaceWriteLocker<0x95ebfdfc>::SetFromArea(int32: 164512, int32: 736448, vm_area&: 0x95ebfdf8) + 0x0025
 6 95ebfdb4 (+  80) 800b99c7   <kernel_x86>:vm_delete_area + 0x0027
 7 95ebfe04 (+ 192) 80052c7b   <kernel_x86>:thread_exit + 0x009f
 8 95ebfec4 (+  64) 800484e8   <kernel_x86>:handle_signals + 0x0454
 9 95ebff04 (+  64) 80053384   <kernel_x86>:thread_at_kernel_exit + 0x0090
10 95ebff44 (+ 100) 800d0152   <kernel_x86>:kernel_exit_handle_signals + 0x0006
user iframe at 0x95ebffa8 (end = 0x95ec0000)
 eax 0x0            ebx 0x55f580        ecx 0x7ffeef40   edx 0x246
 esi 0x7ffef538     edi 0x7ffef54c      ebp 0x7ffeef6c   esp 0x95ebffdc
 eip 0xffff0104  eflags 0x200207   user esp 0x7ffeef40
 vector: 0x63, error code: 0x0
11 95ebffa8 (+   0) ffff0104   <commpage>:commpage_syscall + 0x0004
12 7ffeef6c (+  64) 00202998   <_APP_>:_start + 0x0064
13 7ffeefac (+  48) 0010090a   </boot/beos/system/runtime_loader@0x00100000>:unknown + 0x090a
14 7ffeefdc (+   0) 7ffeefec   736448:ProtocolLoader_main_stack@0x7efef000 + 0xffffec

The areas mentioned in the call to AddressSpaceWriteLocker:

kdebug> area 164512
could not find area 164512 (164512)
kdebug> area 736448
AREA: 0x8134aeb0
name:           'ProtocolLoader_main_stack'
owner:          0x282a0
id:             0xb3cc0
base:           0x7efef000
size:           0x1001000
protection:     0x3b
wiring:         0x0
memory_type:    0x0
cache:          0x94008f00
cache_type:     RAM
cache_offset:   0x0
cache_next:     0x00000000
cache_prev:     0x00000000
page mappings:  8

Not sure what else to look up right now, leaving this system up and running though, please let me know if there's any other information I can dig up from its current state.

comment:23 by pulkomandy, 10 years ago

No news for the last 6 years here, is it ok to assume this is fixed now?

comment:24 by DarkmatterVale, 10 years ago

I am running hrev48487 and I do not see this error at all.

I would assume as well that this has been fixed.

comment:25 by mmu_man, 10 years ago

I'd vote for closing it as well.

comment:26 by kallisti5, 9 years ago

Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.