Opened 11 years ago

Closed 11 years ago

#2893 closed bug (fixed)

System hanging at desktop with r28244

Reported by: anevilyak Owned by: euan
Priority: normal Milestone: R1
Component: Drivers/Graphics/radeon Version: R1/pre-alpha1
Keywords: Cc:
Blocked By: Blocking:
Has a Patch: no Platform: All

Description

My Athlon64 hangs after the terminal launches with hrev28244. Syslog attached. Interestingly, the keyboard still responds normally, but hitting ctrl+alt+delete drops me into a gdb session for input_server (via fake_app_server). KDL indicates the following teams active:

team           id  parent      name
0x8117d198      1  0x00000000  kernel_team
0x812aa000    157  0x812307f8  sh
0x812304c8     95  0x8117d198  Deskbar
0x81230660     96  0x8117d198  sh
0x812307f8     97  0x8117d198  Terminal
0x81230b28     99  0x8117d198  media_server
0x81230cc0    100  0x8117d198  midi_server
0x81230e58    101  0x8117d198  print_server
0x81230990    165  0x81230660  ProcessController
0x8117db28     79  0x8117d198  syslog_daemon
0x81230330     83  0x8117d990  input_server
0x8117d4c8     52  0x8117d198  registrar
0x8117d330    115  0x81230b28  media_addon_server
0x8117d660     59  0x8117d198  debug_server
0x8117d7f8     60  0x8117d198  net_server
0x81230198     92  0x8117d198  Tracker
0x8117d990     61  0x8117d198  app_server

Also indicated is that the sh for Bootscript is waiting on a child. Interestingly, when returning from KDL, the desktop isn't repainted like it usually is. The backtrace from ctrl+alt+del crashing input_server is as follows:

debug_server: Thread 164 entered the debugger: Debugger call: `getNumAvailable() == 0'
stack trace, current PC 0xffff0104  :
  (0x701c5c9c)  0x4ea6d9  __assert_fail + 0x51
  (0x701c5ccc)  0x4f47fc  malloc__Q28BPrivate10threadHeapUl + 0x4a0
  (0x701c5d1c)  0x4f5150  malloc + 0x154
  (0x701c5d4c)  0x4df6fd  __builtin_new + 0x29
  (0x701c5d7c)  0x3911be  _InitData__7BWindowG5BRectPCc11window_look11window_feelUlUll + 0x4f2
  (0x701c5ddc)  0x38aadc  __7BWindowG5BRectPCc11window_look11window_feelUlUl + 0xcc
  (0x701c5e2c)  0x5b3cc0  __8TMWindow + 0x5c
  (0x701c5eec)  0x5b1cbb  _DeviceWatcher__19KeyboardInputDevicePv + 0x227
  (0x701c5fac)  0x4e58fc  thread_entry + 0x20

Not really sure what's going on here, but 28243 at least worked fine apart from the issue mentioned in ticket #2866. Disabling SMP makes no difference for this bug (though I'd assume it shouldn't anyways given lack of hyperthreading and/or > 1 CPU / core). Let me know what else would be of interest please.

Attachments (1)

syslog_a64_r28244_19102008.log (92.1 KB ) - added by anevilyak 11 years ago.
Boot syslog leading up to hang

Download all attachments as: .zip

Change History (27)

by anevilyak, 11 years ago

Boot syslog leading up to hang

comment:1 by bonefish, 11 years ago

The Bootscript sh is already gone. It would have a team ID smaller than all of the servers. The sh with the ID 96 should be the sub shell that starts ProcessController. I guess it just hangs due to the same problem everything is hanging.

As always in situations when something hangs, I'd try to analyze in the kernel debugger what the concerned threads are waiting for and whether they may have deadlocked somehow. You could start with ProcessController, but it's probably waiting for an app server reply. So checking the app server might be the better approach.

comment:2 by anevilyak, 11 years ago

hrev28246 still exhibits this problem, will try to analyze it later today.

comment:3 by anevilyak, 11 years ago

I suspect this might not be directly related to the change and instead might be exposing some app_server / input_server race, it's inconsistent in terms of when it locks, and in some respects sounds suspiciously like ticket #2894. I had another instance of it locking where it made it all the way through booting but then froze when I actually tried to click the Terminal window. Note though that it's not 100% frozen: keyboard still responds normally (as in numlock, f12, etc. work), and also the mouse pointer still moves correctly. Redraw on exit from KDL is still broken though, as is any other form of screen update at this point. Will let you know when I have further information.

comment:4 by anevilyak, 11 years ago

Maybe not related to ticket 2894, tried reverting the changes in 28241 and made no difference, except that input_server doesn't crash into gdb on ctrl+alt+del any more.

comment:5 by anevilyak, 11 years ago

Possibly relevant observation: removing the radeon accelerant makes the system work correctly with current revisions... would the VM changes made in hrev26244 affect the memory mapping related ops that the app_server / accelerants use? I double checked and hrev26243 definitely works perfectly.

in reply to:  5 comment:6 by bonefish, 11 years ago

Replying to anevilyak:

Possibly relevant observation: removing the radeon accelerant makes the system work correctly with current revisions... would the VM changes made in hrev26244 affect the memory mapping related ops that the app_server / accelerants use? I double checked and hrev26243 definitely works perfectly.

Device drivers usually use map_physical_memory() to permanently map the physical addresses that are interesting for them (registers, graphic memory). Although it sounds similar the physical page mapping stuff I've changed is unrelated. It is only used when someone needs temporary access to a physical page.

The observation that it is radeon related is interesting. Checking what the app server threads are waiting for might give some further clue.

comment:7 by anevilyak, 11 years ago

Will check on the thread states tonight. In any case, it succeeded in completing a full build without issues when using VESA, and the speed difference at build was quite impressive, great work!

comment:8 by anevilyak, 11 years ago

OK, had a look at the thread states.

kdebug> threads 61
thread         id  state     wait for   object  cpu pri  stack      team  name
0x838cf800    156  waiting   sem           749    -  15  0x84e38000   61  w:93:Deskbar
0x83910000    191  waiting   sem          2361    -  15  0x84e73000   61  a:189:x-vnd.Geb-ProcessControll
0x838db000    129  waiting   sem          1541    -  15  0x84d60000   61  a:92:x-vnd.Be-TRAK
0x838cc000    130  waiting   sem          1559    -  15  0x84dc3000   61  a:93:x-vnd.Be-TSKB
0x838c8000    103  waiting   sem          1077    -  90  0x84d4c000   61  event loop
0x838c4000    135  waiting   sem          1628    -  15  0x84d48000   61  a:96:x-vnd.Haiku-Terminal
0x838d1800    104  waiting   sem          1076    -  95  0x84d54000   61  cursor loop
0x838f4000    137  waiting   sem          1616    -  15  0x84dff000   61  w:93:Twitcher
0x838ce800    106  waiting   sem          1148    -  15  0x84d58000   61  a:98:x-vnd.Haiku-midi_server
0x838f5800    139  waiting   sem          1605    -  15  0x84def000   61  w:92:Tracker Status
0x8388f800     77  waiting   sem           707    -  10  0x84192000   61  Font Manager
0x838f3800    140  waiting   sem           749    -  15  0x84df3000   61  w:96:Terminal 1
0x838cb000    109  waiting   sem          1172    -  15  0x84d5c000   61  a:99:x-vnd.Be-PSRV
0x8389c000     78  waiting   sem           744    -  10  0x84d04000   61  screen manager
0x838f8000    142  waiting   sem          1694    -  15  0x84dfb000   61  w:93:offscreen
0x838dd000    111  waiting   sem          1228    -  15  0x84d8f000   61  a:97:x-vnd.Be.media-server
0x838fc800    147  waiting   sem             0    -  15  0x84e18000   61  w:92:Desktop
0x838a2800     85  waiting   sem           832    -  15  0x84d10000   61  d:0:baron
0x838a3000     86  waiting   sem           895    -  15  0x84d14000   61  a:79:x-vnd.Haiku-SystemLogger
0x838a3800     87  waiting   sem           906    -  15  0x84d18000   61  a:83:x-vnd.Be-input_server
0x838ec800    119  waiting   sem          1329    -  15  0x84da7000   61  a:116:x-vnd.Be.addon-host
0x83866000     61  waiting   sem           702    -  10  0x80233000   61  picasso

Several threads are waiting on the sem "floating overlays lock", including the windows that appear frozen (i.e. Deskbar and Terminal):

SEM: 0x8312a508
id:      749 (0x2ed)
name:    'floating overlays lock'
owner:   61
count:   -2
queue:   140 156

This sem is held by the Picasso thread, which in turn appears to have recursively locked the "haiku app_server" sem" and is stuck waiting there.

kdebug> sem 702
SEM: 0x83129db0
id:      702 (0x2be)
name:    'haiku app_server'
owner:   -1
count:   -1
queue:   61
last acquired by: 61

If of interest:

THREAD: 0x83866000
id:                 61 (0x3d)
name:               "picasso"
all_next:           0x00000000
team_next:          0x00000000
q_next:             0x838a6800
priority:           10 (next 10, I/O: -1)
state:              waiting
next_state:         waiting
cpu:                0x00000000
sig_pending:        0x0 (blocked: 0x0)
in_kernel:          1
waiting for:        semaphore 702
fault_handler:      0x00000000
args:               0x811ea708 0x00000000
entry:              0x8004b700
team:               0x8117b990, "app_server"
  exit.sem:         545
  exit.status:      0x0 (No error)
  exit.reason:      0x0
  exit.signal:      0x0
  exit.waiters:
kernel_stack_area:  3605
kernel_stack_base:  0x80233000
user_stack_area:    3641
user_stack_base:    0x7efef000
user_local_storage: 0x7ffef000
kernel_errno:       0x80006003 (No such file or directory)
kernel_time:        1060540
user_time:          143514
flags:              0x0
architecture dependant section:
        esp: 0x80236d28
        ss: 0x00000010
        fpu_state at 0x838663d0

What also seems suspicious to me though is that the desktop window is indicated as waiting on semaphore 0, which doesn't exist, unless this is a special indicator of some kind that I'm not aware of.

comment:9 by anevilyak, 11 years ago

Ran through this a few more times, and while the other thread states vary somewhat, the root cause reliably seems to be picasso deadlocking while holding other locks, and always on the same sem. Picasso's backtrace:

stack trace for thread 61 "picasso"
    kernel stack: 0x80233000 to 0x80237000
      user stack: 0x7efef000 to 0x7ffef000
frame               caller     <image>:function + offset
 0 80236d84 (+  48) 80055318   <kernel_x86> context_switch(thread*: 0x83868000, thread*: 0x838cb800) + 0x003c
 1 80236db4 (+  64) 800555ff   <kernel_x86> simple_reschedule() + 0x029f
 2 80236df4 (+  64) 80044bed   <kernel_x86>:switch_sem_etc + 0x0319
 3 80236e34 (+  64) 800448a6   <kernel_x86>:acquire_sem_etc + 0x0026
 4 80236e74 (+  80) 8004204a   <kernel_x86>:_get_port_message_info_etc + 0x012e
 5 80236ec4 (+  80) 80041f0d   <kernel_x86>:port_buffer_size_etc + 0x0025
 6 80236f14 (+  48) 80042d99   <kernel_x86>:_user_port_buffer_size_etc + 0x008d
 7 80236f44 (+ 100) 800cae52   <kernel_x86>:handle_syscall + 0x00af
user iframe at 0x80236fa8 (end = 0x80237000)
 eax 0xc3           ebx 0x77e71c        ecx 0x7ffeecd0   edx 0xffff0104
 esi 0xffffffff     edi 0x7fffffff      ebp 0x7ffeecfc   esp 0x80236fdc
 eip 0xffff0104  eflags 0x212      user esp 0x7ffeecd0
 vector: 0x63, error code: 0x0
 8 80236fa8 (+   0) ffff0104   <commpage>:commpage_syscall + 0x0004
 9 7ffeecfc (+  48) 003c6e9d   <libbe.so> BPrivate12LinkReceiverx<0x1801e3a0>::AdjustReplyBuffer(0xffffffff) + 0x0045
10 7ffeed2c (+  64) 003c6f5f   <libbe.so> BPrivate12LinkReceiverx<0x1801e3a0>::ReadFromPort(0xffffffff) + 0x003b
11 7ffeed6c (+  48) 003c6d43   <libbe.so> BPrivate12LinkReceiverRlx<0x1801e3a0>::GetNextMessage(0x7ffeedd8) + 0x0047
12 7ffeed9c (+ 320) 00266f41   <_APP_> MessageLooper<0x1801a300>::_MessageLooper(0x18023ef0) + 0x0051
13 7ffeeedc (+  48) 00267028   <_APP_> MessageLooper<0x1801a300>::_message_thread(NULL) + 0x0028
14 7ffeef0c (+  64) 0023f519   <_APP_> AppServer<0x1801a300>::RunLooper(0x7ffef120) + 0x003d
15 7ffeef4c (+  48) 0023fb23   <_APP_>:main + 0x0093
16 7ffeef7c (+  48) 0023e6d7   <_APP_>:_start + 0x005b
17 7ffeefac (+  48) 001008ea   </boot/beos/system/runtime_loader@0x00100000>:unknown + 0x08ea
18 7ffeefdc (+   0) 7ffeefec   3643:app_server_main_stack@0x7efef000 + 0xffffec

comment:10 by anevilyak, 11 years ago

The Desktop is also consistently on sem 0 when this hang happens, regardless of the state of the other threads btw.

comment:11 by stippi, 11 years ago

Thanks for all the info. The stack trace for picasso is interesting, since it just seems to be in it's message loop waiting for the next message. That would indicate that the lock was leaked in the processing of an earlier message, no?

The "floating overlays lock" is used to protect the visible cursor shape and the stored background behind the cursor. When something renders to the screen that would overdraw the cursor, then the lock comes into play. The code is using auto lockers, see DrawingEngine. What programs are you running at startup? (Trying to figure out which messages may have been processed earlier.) If we can't spot this ourselves, maybe we need to enable some tracing mode in app_server to see what was going on before. Perhaps something that uses the kernel tracing would be most convenient.

comment:12 by axeld, 11 years ago

The most interesting stack trace should indeed be the one of the Desktop - there is no semaphore '0', so this looks like a kernel bug.

Since the lockers are all benaphores, you can't make any conclusions from the info you gave above. "owner" just means the team, not the thread that locked it last. picasso just doesn't do much more than reading messages from its port, and in that process, it will lock the same semaphore over and over (it's released only when sending data from the other thread).

comment:13 by bonefish, 11 years ago

Semaphore 0 looks bad indeed. Besides the stack trace of the Desktop thread it would be good to probe with "call" whether the 0 was supplied intentionally or whether something went wrong later. A look at the semaphore table would be interesting, too. A dw [symbol sSems] 1 gives its address and another "dw" for that address with at least count 13 will print the data structure for the 0th semaphore entry.

in reply to:  11 comment:14 by anevilyak, 11 years ago

Replying to stippi:

The "floating overlays lock" is used to protect the visible cursor shape and the stored background behind the cursor. When something renders to the screen that would overdraw the cursor, then the lock comes into play. The code is using auto lockers, see DrawingEngine. What programs are you running at startup? (Trying to figure out which messages may have been processed earlier.) If we can't spot this ourselves, maybe we need to enable some tracing mode in app_server to see what was going on before. Perhaps something that uses the kernel tracing would be most convenient.

I actually don't have a chance to run any program other than what the image runs by default (ergo Terminal, post install script, etc)., it hangs right around when Tracker, Deskbar and Terminal are starting.

comment:15 by anevilyak, 11 years ago

I tried altering the debugging levels in the radeon driver to get more information, but that didn't change much other than that it locks sooner. Now I have the deskbar's window thread in the semaphore 0 state, backtrace as follows:

stack trace for thread 157 "w:93:Deskbar"
    kernel stack: 0x91645000 to 0x91649000
      user stack: 0x704d3000 to 0x70513000
frame               caller     <image>:function + offset
 0 91648c04 (+  48) 8005704c   <kernel_x86> context_switch(thread*: 0x900f3800, thread*: 0x900d7800) + 0x003c
 1 91648c34 (+  64) 80057333   <kernel_x86> simple_reschedule() + 0x029f
 2 91648c74 (+  64) 800464ed   <kernel_x86>:switch_sem_etc + 0x0319
 3 91648cb4 (+  48) 8004617c   <kernel_x86>:acquire_sem + 0x001c
 4 91648ce4 (+  48) 908a365d   </boot/beos/system/add-ons/kernel/drivers/dev/graphics/radeon>:Radeon_WaitForIdle + 0x0051
 5 91648d14 (+  64) 9089e971   </boot/beos/system/add-ons/kernel/drivers/dev/graphics/radeon>:control_hook + 0x0221
 6 91648d54 (+ 320) 8006b2bf   <kernel_x86> devfs_ioctl(fs_volume*: 0x81170284, fs_vnode*: 0x812026e8, 0x81229210, uint32: 0x2716 (10006), 0x705129a4, uint32: 0x8 (8)) + 0x0217
 7 91648e94 (+  64) 80098e17   <kernel_x86> common_ioctl(file_descriptor*: 0x812273e8, uint32: 0x2716 (10006), 0x705129a4, uint32: 0x8 (8)) + 0x0043
 8 91648ed4 (+  48) 80086b4b   <kernel_x86> fd_ioctl(false, int32: 9, uint32: 0x2716 (10006), 0x705129a4, uint32: 0x8 (8)) + 0x005f
 9 91648f04 (+  64) 80087a09   <kernel_x86>:_user_ioctl + 0x0051
10 91648f44 (+ 100) 800ceec1   <kernel_x86>:handle_syscall + 0x00be
user iframe at 0x91648fa8 (end = 0x91649000)
 eax 0x85           ebx 0x77e2fc        ecx 0x70512950   edx 0xffff0104
 esi 0x9            edi 0x180489d4      ebp 0x7051297c   esp 0x91648fdc
 eip 0xffff0104  eflags 0x206      user esp 0x70512950
 vector: 0x63, error code: 0x0
11 91648fa8 (+   0) ffff0104   <commpage>:commpage_syscall + 0x0004
12 7051297c (+  48) 007fd363   <radeon.accelerant>:Radeon_WaitForIdle + 0x0037
13 705129ac (+  80) 007f910f   <radeon.accelerant>:SYNC_TO_TOKEN + 0x004f
14 705129fc (+  48) 007f8f45   <radeon.accelerant>:ACQUIRE_ENGINE + 0x0059
15 70512a2c (+  48) 002942e3   <_APP_> AccelerantHWInterface<0x18048918>::FillRegion(BRegion&: 0x70512aec, rgb_color&: 0x70512ab8, true) + 0x004f
16 70512a5c (+ 176) 002981ff   <_APP_> DrawingEngine<0x18056648>::FillRect(BRect: 0x70512c34) + 0x028f
17 70512b0c (+ 320) 0028149e   <_APP_> ServerWindow<0x180bd600>::_DispatchViewDrawingMessage(int32: 157, BPrivate12LinkReceiver&: 0x180bf960) + 0x051e
18 70512c4c (+ 416) 00280d2e   <_APP_> ServerWindow<0x180bd600>::_DispatchViewMessage(int32: 157, BPrivate12LinkReceiver&: 0x180bf960) + 0x2132
19 70512dec (+ 288) 0027eb41   <_APP_> ServerWindow<0x180bd600>::_DispatchMessage(int32: 157, BPrivate12LinkReceiver&: 0x180bf960, 0x7fffffff, int32: 3084832, 0x900f3800, 0x0) + 0x117d
20 70512f0c (+ 112) 00284a16   <_APP_> ServerWindow<0x180bd600>::_MessageLooper(0x0) + 0x0282
21 70512f7c (+  48) 00267028   <_APP_> MessageLooper<0x180bd600>::_message_thread(NULL) + 0x0028
22 70512fac (+  48) 006f28fc   <libroot.so>:_get_next_team_info (nearest) + 0x005c
23 70512fdc (+   0) 70512fec   5197:w:93:Deskbar_157_stack@0x704d3000 + 0x3ffec

call 157 3 -1
thread 157, w:93:Deskbar
91648cb4 8004617c   <kernel_x86>:acquire_sem(0x0 (0))

kdebug> dw [symbol sSems] 1
0x8010c614     4 kernel_x86:sSems
0x8010c618     1 kernel_x86:sSemsActive
0x8010c624     4 kernel_x86:sSemsSpinlock
[0x8010c624]  ....   00000000
kdebug> dw 0x8010c614 13
[0x8010c614]  .p.....°¹.q.   8f927000 00000001 8f93b9b0 8f937180
[0x8010c624]  .....S.........   00000000 800e53ea 00000000 00000000
[0x8010c634]  ................   00000000 00000000 00000000 00000000
[0x8010c644]  ....               00000000

Hope that helps, need to head to work now so any further info will have to wait until later in the day.

comment:16 by anevilyak, 11 years ago

By the way, is there a difference in how acquire_sem validates semaphores vs the kernel debugger? Since acquire_sem seems to have decided it's ok to try and grab semaphore 0, while the kernel debugger tells me that's not a valid sem id.

in reply to:  15 comment:17 by bonefish, 11 years ago

Replying to anevilyak:

kdebug> dw [symbol sSems] 1
0x8010c614     4 kernel_x86:sSems
0x8010c618     1 kernel_x86:sSemsActive
0x8010c624     4 kernel_x86:sSemsSpinlock
[0x8010c624]  ....   00000000
kdebug> dw 0x8010c614 13
[0x8010c614]  .p.....°¹.q.   8f927000 00000001 8f93b9b0 8f937180
[0x8010c624]  .....S.........   00000000 800e53ea 00000000 00000000
[0x8010c634]  ................   00000000 00000000 00000000 00000000
[0x8010c644]  ....               00000000

That went slightly wrong. I didn't think "symbol" would be so "helpful" not only to return the exact name match. Hence the first "dw" printed data from the wrong address. sSems contains the pointer to the semaphore entry table, so we need the data at 0x8f927000 in this case.

Replying to anevilyak:

By the way, is there a difference in how acquire_sem validates semaphores vs the kernel debugger?

Not really. Both just check the ID in the respective semaphore table entry.

Since acquire_sem seems to have decided it's ok to try and grab semaphore 0, while the kernel debugger tells me that's not a valid sem id.

Yep, that's weird indeed. Maybe table entry 0 can tell us more.

You could add a panic() in switch_sem_etc() when someone tries to acquire semaphore 0 (after the gKernelStartup check). That would drop the thread into the kernel debugger right when it happens.

The acquire_sem() is invoked by Radeon_WaitForIdle(), probably by the following construct:

	if( acquire_lock )
		ACQUIRE_BEN( di->si->cp.lock );

The lock is initialized in Radeon_InitCP(). Printing the semaphore ID in both functions might verify whether or not it is modified at some point. If it is, a kernel watchpoint should help to track the perpetrator down.

comment:18 by anevilyak, 11 years ago

How does one go about setting kernel watchpoints? Will check the other items in your list after work.

in reply to:  18 comment:19 by bonefish, 11 years ago

Replying to anevilyak:

How does one go about setting kernel watchpoints? Will check the other items in your list after work.

First of all enable them (KERNEL_BREAKPOINTS in kernel_debug_config.h). Then either use the kernel debugger command "watchpoint" -- watchpoint <address> 4 sets a 4-byte write watchpoint -- or set them programatically, i.e. include <arch/user_debugger.h> and call arch_set_kernel_watchpoint(address, B_DATA_WRITE_WATCHPOINT, 4);.

comment:20 by anevilyak, 11 years ago

It gets stranger: Radeon_InitCP is never actually called. I even put panics at the beginning of the function and after initializing that sem, they never showed up. However, at the point of calling Radeon_WaitForIdle(), di->si->cp.lock.sem is 0. Sems output as follows:

kdebug> symbol sSems
0x80117614     4 kernel_x86:sSems
0x80117618     1 kernel_x86:sSemsActive
0x80117624     4 kernel_x86:sSemsSpinlock
kdebug> dw 0x80117614
[0x80117614]  .`¥.....Ø6..3..   8fa56000 00000001 8fa736d8 8fa73398
kdebug> dw 0x8fa56000 13
[0x8fa56000]  ................   00000000 00000000 fffffffe 00000000
[0x8fa56010]  .r..............   81177200 00000001 00000000 00000009
[0x8fa56020]  ............<.   00000001 00000089 00000001 91a03ca0
[0x8fa56030]  d..               8018af64

I then tried adding panics to switch_sem_etc with id 0. First hit was early in the boot loader:

stack trace for thread 9 "heap grower"
    kernel stack: 0x80187000 to 0x8018b000
frame               caller     <image>:function + offset
 0 8018ac08 (+  48) 8005dad1   <kernel_x86> invoke_debugger_command() + 0x00f5
 1 8018ac38 (+  64) 8005d8c1   <kernel_x86> invoke_pipe_segment__FP21debugger_command_pipelPc() + 0x0079
 2 8018ac78 (+  64) 8005dc48   <kernel_x86> invoke_debugger_command_pipe() + 0x009c
 3 8018acb8 (+  48) 8005f1d0   <kernel_x86> _ParseCommandPipe__16ExpressionParserRi() + 0x0234
 4 8018ace8 (+  64) 8005e60a   <kernel_x86> EvaluateCommand__16ExpressionParserPCcRi() + 0x02ba
 5 8018ad28 (+ 224) 800605f8   <kernel_x86> evaluate_debug_command() + 0x0088
 6 8018ae08 (+  64) 8005ba92   <kernel_x86> kernel_debugger_loop__Fv() + 0x01ae
 7 8018ae48 (+  32) 8005c8fd   <kernel_x86> kernel_debugger() + 0x004d
 8 8018ae68 (+ 192) 8005c8a5   <kernel_x86> panic() + 0x0029
 9 8018af28 (+  80) 80047fdf   <kernel_x86> switch_sem_etc() + 0x012f
10 8018af78 (+  48) 80047e58   <kernel_x86> acquire_sem() + 0x001c
11 8018afa8 (+  48) 8003a9a6   <kernel_x86> heap_grow_thread__FPv() + 0x0016
12 8018afd8 (+  32) 80052b17   <kernel_x86> _create_kernel_thread_kentry__Fv() + 0x001b
13 8018aff8 (+2145865736) 80052ab4   <kernel_x86> thread_kthread_exit__Fv() + 0x0000
kdebug> sem 0
sem "0" doesn't exist!

kdebug> symbol sSems
0x80117614     4 kernel_x86:sSems
0x80117618     1 kernel_x86:sSemsActive
0x80117624     4 kernel_x86:sSemsSpinlock
kdebug> dw 0x80117614
[0x80117614]  .`¥.....ìg¥.Ì_..   8fa56000 00000001 8fa567ec 900d5fcc
kdebug> dw 0x8fa56000 13
[0x8fa56000]  ................   00000000 00000001 00000000 00000000
[0x8fa56010]  .r..............   81177200 00000001 00000000 00000000
[0x8fa56020]  ................   00000000 00000000 00000000 00000000
[0x8fa56030]  ....               00000000

The heap grower calls switch_sem_etc with id 0 quite a few times, though this seems to not prevent the boot process from moving along. Next ideas?

comment:21 by anevilyak, 11 years ago

Ah...after some further investigation, Radeon_InitCP is only called if DMA acceleration is enabled (it's disabled by default since it doesn't work correctly on some chipsets, including mine (X800)). If I force the driver to enable DMA acceleration, then the hang doesn't happen, but I consequently have corrupted graphics. I'm now even more confused as to why hrev28244 affected it at all, since it'd appear as if it might've been trying to lock an invalid/uninitialized semaphore all along.

comment:22 by anevilyak, 11 years ago

With respect to sem ids there seems to be an inconsistency between the kernel debugger dump_sem_info command and switch_sem_etc with respect to validity checking by the way. switch_sem_etc automatically assumes a sem is invalid only if its id is < 0 (sem.cpp line 728), whilst dump_sem_info only checks the semaphore table if the id is > 0 (sem.cpp line 224). So is 0 an allowed semaphore id or not?

comment:23 by bonefish, 11 years ago

Component: System/KernelDrivers/Graphics/radeon
Owner: changed from axeld to euan

0 is definitely allowed, though I was convinced that the first semaphore ID we give out is 1. Obviously I was wrong; the first semaphore created (for the heap grower) has ID 0. The check in the debugger command function was incorrect.

That explains pretty much everything. The heap grower semaphore is only released when the heap is in need for growing, which may just not happen at that point. Before hrev28244 the driver might have gotten lucky in that heap growing was necessary and the semaphore was released. hrev28244 has probably changed the heap usage just enough that this doesn't happen anymore.

At any rate, this is obviously a bug in the driver. Changing component and hoping someone familiar with the driver will fix it.

comment:24 by axeld, 11 years ago

Maybe we should change the kernel to give out IDs starting with 1 only? IIRC we already changed this for some other things.

Rene, if you want to have a look at the Radeon driver, please do so, if not, I could take care of it next week.

in reply to:  24 comment:25 by anevilyak, 11 years ago

Replying to axeld:

Maybe we should change the kernel to give out IDs starting with 1 only? IIRC we already changed this for some other things.

Rene, if you want to have a look at the Radeon driver, please do so, if not, I could take care of it next week.

Fixed in hrev28295, let me know if you see any problems with my solution. It works correctly here in any case. Much thanks to all of you for your patience helping track this down, I promise I won't need as much handholding with the kernel debugger eventually :)

comment:26 by anevilyak, 11 years ago

Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.