Opened 16 years ago
Closed 16 years ago
#2893 closed bug (fixed)
System hanging at desktop with r28244
Reported by: | anevilyak | Owned by: | euan |
---|---|---|---|
Priority: | normal | Milestone: | R1 |
Component: | Drivers/Graphics/radeon | Version: | R1/pre-alpha1 |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Platform: | All |
Description
My Athlon64 hangs after the terminal launches with hrev28244. Syslog attached. Interestingly, the keyboard still responds normally, but hitting ctrl+alt+delete drops me into a gdb session for input_server (via fake_app_server). KDL indicates the following teams active:
team id parent name 0x8117d198 1 0x00000000 kernel_team 0x812aa000 157 0x812307f8 sh 0x812304c8 95 0x8117d198 Deskbar 0x81230660 96 0x8117d198 sh 0x812307f8 97 0x8117d198 Terminal 0x81230b28 99 0x8117d198 media_server 0x81230cc0 100 0x8117d198 midi_server 0x81230e58 101 0x8117d198 print_server 0x81230990 165 0x81230660 ProcessController 0x8117db28 79 0x8117d198 syslog_daemon 0x81230330 83 0x8117d990 input_server 0x8117d4c8 52 0x8117d198 registrar 0x8117d330 115 0x81230b28 media_addon_server 0x8117d660 59 0x8117d198 debug_server 0x8117d7f8 60 0x8117d198 net_server 0x81230198 92 0x8117d198 Tracker 0x8117d990 61 0x8117d198 app_server
Also indicated is that the sh for Bootscript is waiting on a child. Interestingly, when returning from KDL, the desktop isn't repainted like it usually is. The backtrace from ctrl+alt+del crashing input_server is as follows:
debug_server: Thread 164 entered the debugger: Debugger call: `getNumAvailable() == 0' stack trace, current PC 0xffff0104 : (0x701c5c9c) 0x4ea6d9 __assert_fail + 0x51 (0x701c5ccc) 0x4f47fc malloc__Q28BPrivate10threadHeapUl + 0x4a0 (0x701c5d1c) 0x4f5150 malloc + 0x154 (0x701c5d4c) 0x4df6fd __builtin_new + 0x29 (0x701c5d7c) 0x3911be _InitData__7BWindowG5BRectPCc11window_look11window_feelUlUll + 0x4f2 (0x701c5ddc) 0x38aadc __7BWindowG5BRectPCc11window_look11window_feelUlUl + 0xcc (0x701c5e2c) 0x5b3cc0 __8TMWindow + 0x5c (0x701c5eec) 0x5b1cbb _DeviceWatcher__19KeyboardInputDevicePv + 0x227 (0x701c5fac) 0x4e58fc thread_entry + 0x20
Not really sure what's going on here, but 28243 at least worked fine apart from the issue mentioned in ticket #2866. Disabling SMP makes no difference for this bug (though I'd assume it shouldn't anyways given lack of hyperthreading and/or > 1 CPU / core). Let me know what else would be of interest please.
Attachments (1)
Change History (27)
by , 16 years ago
Attachment: | syslog_a64_r28244_19102008.log added |
---|
comment:1 by , 16 years ago
The Bootscript sh is already gone. It would have a team ID smaller than all of the servers. The sh with the ID 96 should be the sub shell that starts ProcessController. I guess it just hangs due to the same problem everything is hanging.
As always in situations when something hangs, I'd try to analyze in the kernel debugger what the concerned threads are waiting for and whether they may have deadlocked somehow. You could start with ProcessController, but it's probably waiting for an app server reply. So checking the app server might be the better approach.
comment:2 by , 16 years ago
hrev28246 still exhibits this problem, will try to analyze it later today.
comment:3 by , 16 years ago
I suspect this might not be directly related to the change and instead might be exposing some app_server / input_server race, it's inconsistent in terms of when it locks, and in some respects sounds suspiciously like ticket #2894. I had another instance of it locking where it made it all the way through booting but then froze when I actually tried to click the Terminal window. Note though that it's not 100% frozen: keyboard still responds normally (as in numlock, f12, etc. work), and also the mouse pointer still moves correctly. Redraw on exit from KDL is still broken though, as is any other form of screen update at this point. Will let you know when I have further information.
comment:4 by , 16 years ago
Maybe not related to ticket 2894, tried reverting the changes in 28241 and made no difference, except that input_server doesn't crash into gdb on ctrl+alt+del any more.
follow-up: 6 comment:5 by , 16 years ago
comment:6 by , 16 years ago
Replying to anevilyak:
Possibly relevant observation: removing the radeon accelerant makes the system work correctly with current revisions... would the VM changes made in hrev26244 affect the memory mapping related ops that the app_server / accelerants use? I double checked and hrev26243 definitely works perfectly.
Device drivers usually use map_physical_memory() to permanently map the physical addresses that are interesting for them (registers, graphic memory). Although it sounds similar the physical page mapping stuff I've changed is unrelated. It is only used when someone needs temporary access to a physical page.
The observation that it is radeon related is interesting. Checking what the app server threads are waiting for might give some further clue.
comment:7 by , 16 years ago
Will check on the thread states tonight. In any case, it succeeded in completing a full build without issues when using VESA, and the speed difference at build was quite impressive, great work!
comment:8 by , 16 years ago
OK, had a look at the thread states.
kdebug> threads 61 thread id state wait for object cpu pri stack team name 0x838cf800 156 waiting sem 749 - 15 0x84e38000 61 w:93:Deskbar 0x83910000 191 waiting sem 2361 - 15 0x84e73000 61 a:189:x-vnd.Geb-ProcessControll 0x838db000 129 waiting sem 1541 - 15 0x84d60000 61 a:92:x-vnd.Be-TRAK 0x838cc000 130 waiting sem 1559 - 15 0x84dc3000 61 a:93:x-vnd.Be-TSKB 0x838c8000 103 waiting sem 1077 - 90 0x84d4c000 61 event loop 0x838c4000 135 waiting sem 1628 - 15 0x84d48000 61 a:96:x-vnd.Haiku-Terminal 0x838d1800 104 waiting sem 1076 - 95 0x84d54000 61 cursor loop 0x838f4000 137 waiting sem 1616 - 15 0x84dff000 61 w:93:Twitcher 0x838ce800 106 waiting sem 1148 - 15 0x84d58000 61 a:98:x-vnd.Haiku-midi_server 0x838f5800 139 waiting sem 1605 - 15 0x84def000 61 w:92:Tracker Status 0x8388f800 77 waiting sem 707 - 10 0x84192000 61 Font Manager 0x838f3800 140 waiting sem 749 - 15 0x84df3000 61 w:96:Terminal 1 0x838cb000 109 waiting sem 1172 - 15 0x84d5c000 61 a:99:x-vnd.Be-PSRV 0x8389c000 78 waiting sem 744 - 10 0x84d04000 61 screen manager 0x838f8000 142 waiting sem 1694 - 15 0x84dfb000 61 w:93:offscreen 0x838dd000 111 waiting sem 1228 - 15 0x84d8f000 61 a:97:x-vnd.Be.media-server 0x838fc800 147 waiting sem 0 - 15 0x84e18000 61 w:92:Desktop 0x838a2800 85 waiting sem 832 - 15 0x84d10000 61 d:0:baron 0x838a3000 86 waiting sem 895 - 15 0x84d14000 61 a:79:x-vnd.Haiku-SystemLogger 0x838a3800 87 waiting sem 906 - 15 0x84d18000 61 a:83:x-vnd.Be-input_server 0x838ec800 119 waiting sem 1329 - 15 0x84da7000 61 a:116:x-vnd.Be.addon-host 0x83866000 61 waiting sem 702 - 10 0x80233000 61 picasso
Several threads are waiting on the sem "floating overlays lock", including the windows that appear frozen (i.e. Deskbar and Terminal):
SEM: 0x8312a508 id: 749 (0x2ed) name: 'floating overlays lock' owner: 61 count: -2 queue: 140 156
This sem is held by the Picasso thread, which in turn appears to have recursively locked the "haiku app_server" sem" and is stuck waiting there.
kdebug> sem 702 SEM: 0x83129db0 id: 702 (0x2be) name: 'haiku app_server' owner: -1 count: -1 queue: 61 last acquired by: 61
If of interest:
THREAD: 0x83866000 id: 61 (0x3d) name: "picasso" all_next: 0x00000000 team_next: 0x00000000 q_next: 0x838a6800 priority: 10 (next 10, I/O: -1) state: waiting next_state: waiting cpu: 0x00000000 sig_pending: 0x0 (blocked: 0x0) in_kernel: 1 waiting for: semaphore 702 fault_handler: 0x00000000 args: 0x811ea708 0x00000000 entry: 0x8004b700 team: 0x8117b990, "app_server" exit.sem: 545 exit.status: 0x0 (No error) exit.reason: 0x0 exit.signal: 0x0 exit.waiters: kernel_stack_area: 3605 kernel_stack_base: 0x80233000 user_stack_area: 3641 user_stack_base: 0x7efef000 user_local_storage: 0x7ffef000 kernel_errno: 0x80006003 (No such file or directory) kernel_time: 1060540 user_time: 143514 flags: 0x0 architecture dependant section: esp: 0x80236d28 ss: 0x00000010 fpu_state at 0x838663d0
What also seems suspicious to me though is that the desktop window is indicated as waiting on semaphore 0, which doesn't exist, unless this is a special indicator of some kind that I'm not aware of.
comment:9 by , 16 years ago
Ran through this a few more times, and while the other thread states vary somewhat, the root cause reliably seems to be picasso deadlocking while holding other locks, and always on the same sem. Picasso's backtrace:
stack trace for thread 61 "picasso" kernel stack: 0x80233000 to 0x80237000 user stack: 0x7efef000 to 0x7ffef000 frame caller <image>:function + offset 0 80236d84 (+ 48) 80055318 <kernel_x86> context_switch(thread*: 0x83868000, thread*: 0x838cb800) + 0x003c 1 80236db4 (+ 64) 800555ff <kernel_x86> simple_reschedule() + 0x029f 2 80236df4 (+ 64) 80044bed <kernel_x86>:switch_sem_etc + 0x0319 3 80236e34 (+ 64) 800448a6 <kernel_x86>:acquire_sem_etc + 0x0026 4 80236e74 (+ 80) 8004204a <kernel_x86>:_get_port_message_info_etc + 0x012e 5 80236ec4 (+ 80) 80041f0d <kernel_x86>:port_buffer_size_etc + 0x0025 6 80236f14 (+ 48) 80042d99 <kernel_x86>:_user_port_buffer_size_etc + 0x008d 7 80236f44 (+ 100) 800cae52 <kernel_x86>:handle_syscall + 0x00af user iframe at 0x80236fa8 (end = 0x80237000) eax 0xc3 ebx 0x77e71c ecx 0x7ffeecd0 edx 0xffff0104 esi 0xffffffff edi 0x7fffffff ebp 0x7ffeecfc esp 0x80236fdc eip 0xffff0104 eflags 0x212 user esp 0x7ffeecd0 vector: 0x63, error code: 0x0 8 80236fa8 (+ 0) ffff0104 <commpage>:commpage_syscall + 0x0004 9 7ffeecfc (+ 48) 003c6e9d <libbe.so> BPrivate12LinkReceiverx<0x1801e3a0>::AdjustReplyBuffer(0xffffffff) + 0x0045 10 7ffeed2c (+ 64) 003c6f5f <libbe.so> BPrivate12LinkReceiverx<0x1801e3a0>::ReadFromPort(0xffffffff) + 0x003b 11 7ffeed6c (+ 48) 003c6d43 <libbe.so> BPrivate12LinkReceiverRlx<0x1801e3a0>::GetNextMessage(0x7ffeedd8) + 0x0047 12 7ffeed9c (+ 320) 00266f41 <_APP_> MessageLooper<0x1801a300>::_MessageLooper(0x18023ef0) + 0x0051 13 7ffeeedc (+ 48) 00267028 <_APP_> MessageLooper<0x1801a300>::_message_thread(NULL) + 0x0028 14 7ffeef0c (+ 64) 0023f519 <_APP_> AppServer<0x1801a300>::RunLooper(0x7ffef120) + 0x003d 15 7ffeef4c (+ 48) 0023fb23 <_APP_>:main + 0x0093 16 7ffeef7c (+ 48) 0023e6d7 <_APP_>:_start + 0x005b 17 7ffeefac (+ 48) 001008ea </boot/beos/system/runtime_loader@0x00100000>:unknown + 0x08ea 18 7ffeefdc (+ 0) 7ffeefec 3643:app_server_main_stack@0x7efef000 + 0xffffec
comment:10 by , 16 years ago
The Desktop is also consistently on sem 0 when this hang happens, regardless of the state of the other threads btw.
follow-up: 14 comment:11 by , 16 years ago
Thanks for all the info. The stack trace for picasso is interesting, since it just seems to be in it's message loop waiting for the next message. That would indicate that the lock was leaked in the processing of an earlier message, no?
The "floating overlays lock" is used to protect the visible cursor shape and the stored background behind the cursor. When something renders to the screen that would overdraw the cursor, then the lock comes into play. The code is using auto lockers, see DrawingEngine. What programs are you running at startup? (Trying to figure out which messages may have been processed earlier.) If we can't spot this ourselves, maybe we need to enable some tracing mode in app_server to see what was going on before. Perhaps something that uses the kernel tracing would be most convenient.
comment:12 by , 16 years ago
The most interesting stack trace should indeed be the one of the Desktop - there is no semaphore '0', so this looks like a kernel bug.
Since the lockers are all benaphores, you can't make any conclusions from the info you gave above. "owner" just means the team, not the thread that locked it last. picasso just doesn't do much more than reading messages from its port, and in that process, it will lock the same semaphore over and over (it's released only when sending data from the other thread).
comment:13 by , 16 years ago
Semaphore 0 looks bad indeed. Besides the stack trace of the Desktop thread it would be good to probe with "call" whether the 0 was supplied intentionally or whether something went wrong later. A look at the semaphore table would be interesting, too. A dw [symbol sSems] 1
gives its address and another "dw" for that address with at least count 13 will print the data structure for the 0th semaphore entry.
comment:14 by , 16 years ago
Replying to stippi:
The "floating overlays lock" is used to protect the visible cursor shape and the stored background behind the cursor. When something renders to the screen that would overdraw the cursor, then the lock comes into play. The code is using auto lockers, see DrawingEngine. What programs are you running at startup? (Trying to figure out which messages may have been processed earlier.) If we can't spot this ourselves, maybe we need to enable some tracing mode in app_server to see what was going on before. Perhaps something that uses the kernel tracing would be most convenient.
I actually don't have a chance to run any program other than what the image runs by default (ergo Terminal, post install script, etc)., it hangs right around when Tracker, Deskbar and Terminal are starting.
follow-up: 17 comment:15 by , 16 years ago
I tried altering the debugging levels in the radeon driver to get more information, but that didn't change much other than that it locks sooner. Now I have the deskbar's window thread in the semaphore 0 state, backtrace as follows:
stack trace for thread 157 "w:93:Deskbar" kernel stack: 0x91645000 to 0x91649000 user stack: 0x704d3000 to 0x70513000 frame caller <image>:function + offset 0 91648c04 (+ 48) 8005704c <kernel_x86> context_switch(thread*: 0x900f3800, thread*: 0x900d7800) + 0x003c 1 91648c34 (+ 64) 80057333 <kernel_x86> simple_reschedule() + 0x029f 2 91648c74 (+ 64) 800464ed <kernel_x86>:switch_sem_etc + 0x0319 3 91648cb4 (+ 48) 8004617c <kernel_x86>:acquire_sem + 0x001c 4 91648ce4 (+ 48) 908a365d </boot/beos/system/add-ons/kernel/drivers/dev/graphics/radeon>:Radeon_WaitForIdle + 0x0051 5 91648d14 (+ 64) 9089e971 </boot/beos/system/add-ons/kernel/drivers/dev/graphics/radeon>:control_hook + 0x0221 6 91648d54 (+ 320) 8006b2bf <kernel_x86> devfs_ioctl(fs_volume*: 0x81170284, fs_vnode*: 0x812026e8, 0x81229210, uint32: 0x2716 (10006), 0x705129a4, uint32: 0x8 (8)) + 0x0217 7 91648e94 (+ 64) 80098e17 <kernel_x86> common_ioctl(file_descriptor*: 0x812273e8, uint32: 0x2716 (10006), 0x705129a4, uint32: 0x8 (8)) + 0x0043 8 91648ed4 (+ 48) 80086b4b <kernel_x86> fd_ioctl(false, int32: 9, uint32: 0x2716 (10006), 0x705129a4, uint32: 0x8 (8)) + 0x005f 9 91648f04 (+ 64) 80087a09 <kernel_x86>:_user_ioctl + 0x0051 10 91648f44 (+ 100) 800ceec1 <kernel_x86>:handle_syscall + 0x00be user iframe at 0x91648fa8 (end = 0x91649000) eax 0x85 ebx 0x77e2fc ecx 0x70512950 edx 0xffff0104 esi 0x9 edi 0x180489d4 ebp 0x7051297c esp 0x91648fdc eip 0xffff0104 eflags 0x206 user esp 0x70512950 vector: 0x63, error code: 0x0 11 91648fa8 (+ 0) ffff0104 <commpage>:commpage_syscall + 0x0004 12 7051297c (+ 48) 007fd363 <radeon.accelerant>:Radeon_WaitForIdle + 0x0037 13 705129ac (+ 80) 007f910f <radeon.accelerant>:SYNC_TO_TOKEN + 0x004f 14 705129fc (+ 48) 007f8f45 <radeon.accelerant>:ACQUIRE_ENGINE + 0x0059 15 70512a2c (+ 48) 002942e3 <_APP_> AccelerantHWInterface<0x18048918>::FillRegion(BRegion&: 0x70512aec, rgb_color&: 0x70512ab8, true) + 0x004f 16 70512a5c (+ 176) 002981ff <_APP_> DrawingEngine<0x18056648>::FillRect(BRect: 0x70512c34) + 0x028f 17 70512b0c (+ 320) 0028149e <_APP_> ServerWindow<0x180bd600>::_DispatchViewDrawingMessage(int32: 157, BPrivate12LinkReceiver&: 0x180bf960) + 0x051e 18 70512c4c (+ 416) 00280d2e <_APP_> ServerWindow<0x180bd600>::_DispatchViewMessage(int32: 157, BPrivate12LinkReceiver&: 0x180bf960) + 0x2132 19 70512dec (+ 288) 0027eb41 <_APP_> ServerWindow<0x180bd600>::_DispatchMessage(int32: 157, BPrivate12LinkReceiver&: 0x180bf960, 0x7fffffff, int32: 3084832, 0x900f3800, 0x0) + 0x117d 20 70512f0c (+ 112) 00284a16 <_APP_> ServerWindow<0x180bd600>::_MessageLooper(0x0) + 0x0282 21 70512f7c (+ 48) 00267028 <_APP_> MessageLooper<0x180bd600>::_message_thread(NULL) + 0x0028 22 70512fac (+ 48) 006f28fc <libroot.so>:_get_next_team_info (nearest) + 0x005c 23 70512fdc (+ 0) 70512fec 5197:w:93:Deskbar_157_stack@0x704d3000 + 0x3ffec call 157 3 -1 thread 157, w:93:Deskbar 91648cb4 8004617c <kernel_x86>:acquire_sem(0x0 (0)) kdebug> dw [symbol sSems] 1 0x8010c614 4 kernel_x86:sSems 0x8010c618 1 kernel_x86:sSemsActive 0x8010c624 4 kernel_x86:sSemsSpinlock [0x8010c624] .... 00000000 kdebug> dw 0x8010c614 13 [0x8010c614] .p.....°¹.q. 8f927000 00000001 8f93b9b0 8f937180 [0x8010c624] .....S......... 00000000 800e53ea 00000000 00000000 [0x8010c634] ................ 00000000 00000000 00000000 00000000 [0x8010c644] .... 00000000
Hope that helps, need to head to work now so any further info will have to wait until later in the day.
comment:16 by , 16 years ago
By the way, is there a difference in how acquire_sem validates semaphores vs the kernel debugger? Since acquire_sem seems to have decided it's ok to try and grab semaphore 0, while the kernel debugger tells me that's not a valid sem id.
comment:17 by , 16 years ago
Replying to anevilyak:
kdebug> dw [symbol sSems] 1 0x8010c614 4 kernel_x86:sSems 0x8010c618 1 kernel_x86:sSemsActive 0x8010c624 4 kernel_x86:sSemsSpinlock [0x8010c624] .... 00000000 kdebug> dw 0x8010c614 13 [0x8010c614] .p.....°¹.q. 8f927000 00000001 8f93b9b0 8f937180 [0x8010c624] .....S......... 00000000 800e53ea 00000000 00000000 [0x8010c634] ................ 00000000 00000000 00000000 00000000 [0x8010c644] .... 00000000
That went slightly wrong. I didn't think "symbol" would be so "helpful" not only to return the exact name match. Hence the first "dw" printed data from the wrong address. sSems contains the pointer to the semaphore entry table, so we need the data at 0x8f927000 in this case.
Replying to anevilyak:
By the way, is there a difference in how acquire_sem validates semaphores vs the kernel debugger?
Not really. Both just check the ID in the respective semaphore table entry.
Since acquire_sem seems to have decided it's ok to try and grab semaphore 0, while the kernel debugger tells me that's not a valid sem id.
Yep, that's weird indeed. Maybe table entry 0 can tell us more.
You could add a panic() in switch_sem_etc() when someone tries to acquire semaphore 0 (after the gKernelStartup check). That would drop the thread into the kernel debugger right when it happens.
The acquire_sem() is invoked by Radeon_WaitForIdle(), probably by the following construct:
if( acquire_lock ) ACQUIRE_BEN( di->si->cp.lock );
The lock is initialized in Radeon_InitCP(). Printing the semaphore ID in both functions might verify whether or not it is modified at some point. If it is, a kernel watchpoint should help to track the perpetrator down.
follow-up: 19 comment:18 by , 16 years ago
How does one go about setting kernel watchpoints? Will check the other items in your list after work.
comment:19 by , 16 years ago
Replying to anevilyak:
How does one go about setting kernel watchpoints? Will check the other items in your list after work.
First of all enable them (KERNEL_BREAKPOINTS in kernel_debug_config.h). Then either use the kernel debugger command "watchpoint" -- watchpoint <address> 4
sets a 4-byte write watchpoint -- or set them programatically, i.e. include <arch/user_debugger.h> and call arch_set_kernel_watchpoint(address, B_DATA_WRITE_WATCHPOINT, 4);
.
comment:20 by , 16 years ago
It gets stranger: Radeon_InitCP is never actually called. I even put panics at the beginning of the function and after initializing that sem, they never showed up. However, at the point of calling Radeon_WaitForIdle(), di->si->cp.lock.sem is 0. Sems output as follows:
kdebug> symbol sSems 0x80117614 4 kernel_x86:sSems 0x80117618 1 kernel_x86:sSemsActive 0x80117624 4 kernel_x86:sSemsSpinlock kdebug> dw 0x80117614 [0x80117614] .`¥.....Ø6..3.. 8fa56000 00000001 8fa736d8 8fa73398 kdebug> dw 0x8fa56000 13 [0x8fa56000] ................ 00000000 00000000 fffffffe 00000000 [0x8fa56010] .r.............. 81177200 00000001 00000000 00000009 [0x8fa56020] ............<. 00000001 00000089 00000001 91a03ca0 [0x8fa56030] d.. 8018af64
I then tried adding panics to switch_sem_etc with id 0. First hit was early in the boot loader:
stack trace for thread 9 "heap grower" kernel stack: 0x80187000 to 0x8018b000 frame caller <image>:function + offset 0 8018ac08 (+ 48) 8005dad1 <kernel_x86> invoke_debugger_command() + 0x00f5 1 8018ac38 (+ 64) 8005d8c1 <kernel_x86> invoke_pipe_segment__FP21debugger_command_pipelPc() + 0x0079 2 8018ac78 (+ 64) 8005dc48 <kernel_x86> invoke_debugger_command_pipe() + 0x009c 3 8018acb8 (+ 48) 8005f1d0 <kernel_x86> _ParseCommandPipe__16ExpressionParserRi() + 0x0234 4 8018ace8 (+ 64) 8005e60a <kernel_x86> EvaluateCommand__16ExpressionParserPCcRi() + 0x02ba 5 8018ad28 (+ 224) 800605f8 <kernel_x86> evaluate_debug_command() + 0x0088 6 8018ae08 (+ 64) 8005ba92 <kernel_x86> kernel_debugger_loop__Fv() + 0x01ae 7 8018ae48 (+ 32) 8005c8fd <kernel_x86> kernel_debugger() + 0x004d 8 8018ae68 (+ 192) 8005c8a5 <kernel_x86> panic() + 0x0029 9 8018af28 (+ 80) 80047fdf <kernel_x86> switch_sem_etc() + 0x012f 10 8018af78 (+ 48) 80047e58 <kernel_x86> acquire_sem() + 0x001c 11 8018afa8 (+ 48) 8003a9a6 <kernel_x86> heap_grow_thread__FPv() + 0x0016 12 8018afd8 (+ 32) 80052b17 <kernel_x86> _create_kernel_thread_kentry__Fv() + 0x001b 13 8018aff8 (+2145865736) 80052ab4 <kernel_x86> thread_kthread_exit__Fv() + 0x0000 kdebug> sem 0 sem "0" doesn't exist! kdebug> symbol sSems 0x80117614 4 kernel_x86:sSems 0x80117618 1 kernel_x86:sSemsActive 0x80117624 4 kernel_x86:sSemsSpinlock kdebug> dw 0x80117614 [0x80117614] .`¥.....ìg¥.Ì_.. 8fa56000 00000001 8fa567ec 900d5fcc kdebug> dw 0x8fa56000 13 [0x8fa56000] ................ 00000000 00000001 00000000 00000000 [0x8fa56010] .r.............. 81177200 00000001 00000000 00000000 [0x8fa56020] ................ 00000000 00000000 00000000 00000000 [0x8fa56030] .... 00000000
The heap grower calls switch_sem_etc with id 0 quite a few times, though this seems to not prevent the boot process from moving along. Next ideas?
comment:21 by , 16 years ago
Ah...after some further investigation, Radeon_InitCP is only called if DMA acceleration is enabled (it's disabled by default since it doesn't work correctly on some chipsets, including mine (X800)). If I force the driver to enable DMA acceleration, then the hang doesn't happen, but I consequently have corrupted graphics. I'm now even more confused as to why hrev28244 affected it at all, since it'd appear as if it might've been trying to lock an invalid/uninitialized semaphore all along.
comment:22 by , 16 years ago
With respect to sem ids there seems to be an inconsistency between the kernel debugger dump_sem_info command and switch_sem_etc with respect to validity checking by the way. switch_sem_etc automatically assumes a sem is invalid only if its id is < 0 (sem.cpp line 728), whilst dump_sem_info only checks the semaphore table if the id is > 0 (sem.cpp line 224). So is 0 an allowed semaphore id or not?
comment:23 by , 16 years ago
Component: | System/Kernel → Drivers/Graphics/radeon |
---|---|
Owner: | changed from | to
0 is definitely allowed, though I was convinced that the first semaphore ID we give out is 1. Obviously I was wrong; the first semaphore created (for the heap grower) has ID 0. The check in the debugger command function was incorrect.
That explains pretty much everything. The heap grower semaphore is only released when the heap is in need for growing, which may just not happen at that point. Before hrev28244 the driver might have gotten lucky in that heap growing was necessary and the semaphore was released. hrev28244 has probably changed the heap usage just enough that this doesn't happen anymore.
At any rate, this is obviously a bug in the driver. Changing component and hoping someone familiar with the driver will fix it.
follow-up: 25 comment:24 by , 16 years ago
Maybe we should change the kernel to give out IDs starting with 1 only? IIRC we already changed this for some other things.
Rene, if you want to have a look at the Radeon driver, please do so, if not, I could take care of it next week.
comment:25 by , 16 years ago
Replying to axeld:
Maybe we should change the kernel to give out IDs starting with 1 only? IIRC we already changed this for some other things.
Rene, if you want to have a look at the Radeon driver, please do so, if not, I could take care of it next week.
Fixed in hrev28295, let me know if you see any problems with my solution. It works correctly here in any case. Much thanks to all of you for your patience helping track this down, I promise I won't need as much handholding with the kernel debugger eventually :)
comment:26 by , 16 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Boot syslog leading up to hang