Context Navigation

#4360 closed bug (fixed)

KDL during bootscript on waitfor on SMP x86 machine

Reported by:	phoudoin	Owned by:	axeld
Priority:	blocker	Milestone:	R1
Component:	System/Kernel	Version:	R1/pre-alpha1
Keywords:		Cc:
Blocked By:		Blocking:
Platform:	x86

Description

Since a few days, any Haiku x86gcc4hybrid (ATA stack) built from trunk goes to KDL during bootscript time: the default desktop screen show up (app_server started) and immediately after this KDL message:

PANIC: page fault, but interrupts were disabled. 
Touching address 0x00020036 from eip 0x800d6d79

Welcome to Kernel Debugging Land...
Thread 65 "waitfor", on CPU 0.
kdebug>

Seen again this morning with a hrev32762, but it was the case since few days already. Trying every safe options but "disable SMP" changes nothing. With only "disable SMP" option, everything works fine. Except SMP ;-).

Haiku is installed natively, the machine is a Quad Core 2 Q6600 @ 2.4Ghz, Radeon HD4870 (-> vesa driver), 2 SATA disks (controller in SATA mode), USB Keyboard and mouse, an USB Volito tablet and an USB finger print reader.

Unfortunatly, my Logitech Illuminated USB keyboard connected to onboard EHCI/UHCI controller doesn't work while in KDL, so I can investigate more from the KDL :-\

I'll check with a r1a1 raw image tonight.

Change History (13)

follow-up: 2 comment:1 by axeld, 16 years ago

It would be very interesting to know which revision brought that regression.

in reply to: 1 comment:2 by phoudoin, 16 years ago

Replying to axeld:

It would be very interesting to know which revision brought that regression.

I dunno exactly, but it's during the last two weeks, no more.

I'll tonight test with a clean r1a1 image written on my boot partition. Maybe I'm just doing something nuts in my UserBuildConfig, build config or, worse, I just didn't notice that svn update merge some forgotten but evil changes I keep secret (for good reason, will be the proof!) deep in my working_copy.

follow-up: 4 comment:3 by mmlr, 16 years ago

That's interesting. Marcus reported having to disable SMP on his quad as well while he was testing ATA yesterday. I have a Q6600 here as well, it works fine, but is only at hrev32438 right now. I'll use my second partition and update that one to a current revision and see if I can reproduce it.

The keyboard not working KDL is unfortunate, but caused by the fact that usb_hid needs to be active to register the USB keyboard. This is only the case after the input_server has been started and usb_hid is in use. Besides that, there's a bug somewhere that causes the USB keyboards to not work if you don't enter KDL manually first. That one is solvable though and on my TODO list.

in reply to: 3 comment:4 by phoudoin, 16 years ago

I'll plug my former PS2 keyboard in order to report more (hopefully) usefull info on that KDL.

follow-up: 6 comment:5 by phoudoin, 16 years ago

Also, not that I'm paranoid or nothing, but I'll try to revert hrev32503 and hrev32554.

in reply to: 5 comment:6 by phoudoin, 16 years ago

Replying to phoudoin:

Also, not that I'm paranoid or nothing, but I'll try to revert hrev32503 and hrev32554.

Changed nothing.

comment:7 by phoudoin, 16 years ago

A PS2 keyboard, a second computer and an USB-serial adapter give me more info:

allocate MTRR slot 0, base = 7ff00000, length = 100000, type=0x0
allocate MTRR slot 1, base = 0, length = 80000000, type=0x6
kernel debugger extension "debugger/disasm/v1": loaded
kernel debugger extension "debugger/hangman/v1": loaded
kernel debugger extension "debugger/invalidate_on_exit/v1": loaded
kernel debugger extension "debugger/run_on_exit/v1": loaded
kernel debugger extension "debugger/usb_keyboard/v1": loaded
allocate MTRR slot 2, base = e0000000, length = 800000, type=0x1
acpi: ACPI disabled
ahci: ahci_supports_device
PANIC: page fault, but interrupts were disabled. Touching address 0x00020036 from eip 0x800d6d79

Welcome to Kernel Debugging Land...
Thread 65 "waitfor" running on CPU 0
kdebug> sc
stack trace for thread 65 "waitfor"
    kernel stack: 0x8020e000 to 0x80212000
      user stack: 0x7efef000 to 0x7ffef000
frame               caller     <image>:function + offset
 0 80211ab0 (+  32) 80065155   <kernel_x86> invoke_command_trampoline(void*: 0x80211b30) + 0x0015
 1 80211ad0 (+  12) 800caec3   <kernel_x86>:arch_debug_call_with_fault_handler + 0x001b
 2 80211adc (+  48) 8006349b   <kernel_x86>:debug_call_with_fault_handler + 0x004c
 3 80211b0c (+  64) 80065523   <kernel_x86>:invoke_debugger_command + 0x00bb
 4 80211b4c (+  48) 80065640   <kernel_x86> invoke_pipe_segment(debugger_command_pipe*: 0x8011b5c2, int32: 0, char*: NULL) + 0x0083
 5 80211b7c (+  32) 80065708   <kernel_x86>:invoke_debugger_command_pipe + 0x008b
 6 80211b9c (+ 128) 800695e7   <kernel_x86> ExpressionParser<0x80211c6c>::_ParseCommandPipe(int&: 0x80211c68) + 0x0aa3
 7 80211c1c (+  48) 8006bd87   <kernel_x86> ExpressionParser<0x80211c6c>::EvaluateCommand(char const*: 0x8011b5c0 "sc", int&: 0x8021
1c68) + 0x06d5
 8 80211c4c (+ 192) 8006bf00   <kernel_x86>:evaluate_debug_command + 0x0084
 9 80211d0c (+  96) 80064495   <kernel_x86> kernel_debugger_internal(char const*: 0x819f4800 "", int32: -2145313176) + 0x0395
10 80211d6c (+  16) 8006461c   <kernel_x86>:kernel_debugger + 0x003f
11 80211d7c (+ 160) 800646d9   <kernel_x86>:panic + 0x002a
12 80211e1c (+  64) 800c7ebe   <kernel_x86> page_fault_exception(iframe*: 0x80211e68) + 0x011e
13 80211e5c (+  12) 800cb26d   <kernel_x86>:int_bottom + 0x003d
kernel iframe at 0x80211e68 (end = 0x80211eb8)
 eax 0x20036        ebx 0x20002         ecx 0x0          edx 0x80120060
 esi 0x80211f24     edi 0x20036         ebp 0x80211ec4   esp 0x80211e9c
 eip 0x800d6d79  eflags 0x10086
 vector: 0xe, error code: 0x0
14 80211e68 (+  92) 800d6d79   <kernel_x86>:strcmp + 0x0011
15 80211ec4 (+  64) 80058cb7   <kernel_x86>:find_thread + 0x006c
16 80211f04 (+  64) 80058d7f   <kernel_x86>:_user_find_thread + 0x0049
17 80211f44 (+ 100) 800cb4a2   <kernel_x86>:handle_syscall + 0x00af
user iframe at 0x80211fa8 (end = 0x80212000)
 eax 0x2d           ebx 0x2c5e48        ecx 0x7ffeef1c   edx 0xffff0114
 esi 0x7ffef538     edi 0x7ffef544      ebp 0x7ffeef38   esp 0x80211fdc
 eip 0xffff0114  eflags 0x216      user esp 0x7ffeef1c
 vector: 0x63, error code: 0x0
18 80211fa8 (+   0) ffff0114   <commpage>:commpage_syscall + 0x0004
19 7ffeef38 (+  48) 00200813   <_APP_>:main + 0x0057
20 7ffeef68 (+  52) 0020069d   <_APP_>:_start + 0x0051
21 7ffeef9c (+  64) 0010525b   </boot/system/runtime_loader@0x00100000>:unknown + 0x525b
22 7ffeefdc (+   0) 7ffeefec   1095:waitfor_main_stack@0x7efef000 + 0xffffec
kdebug> teams
team           id  parent      name
0x811b7000      1  0x00000000  kernel_team
0x811b7330     64  0x811b7198  registrar
0x811b74c8     65  0x811b7198  waitfor
0x811b7198     55  0x811b7000  sh
kdebug> threads
thread         id  state     wait for   object  cpu pri  stack      team  name
0x819f8000     31  waiting   sem           115    -  20  0x809c3000    1  uhci finish thread
0x819f8800     32  waiting   sem           116    -  10  0x809c7000    1  uhci cleanup thread
0x801241a0      1  ready             -            -   0  0x80201000    1  idle thread 1
0x848fe000     64  ready             -            -  10  0x80189000   64  registrar
0x819f9000     33  waiting   sem           123    -  20  0x809cb000    1  uhci isochronous finish thread
0x80124780      2  running           -            2   0  0x80980000    1  idle thread 2
0x819f4800     65  running           -            0  10  0x8020e000   65  waitfor
0x81a0a000     34  waiting   sem           128    -  20  0x809d0000    1  uhci finish thread
0x80120060 -1073430524  UNKNOWN           -
[*** READ FAULT at 0xd508d508, pc: 0x800594ca ***]
kdebug>

Something corrupts threads list.

follow-ups: 9 11 comment:8 by mmlr, 16 years ago

I've seen the exact same panic happen here when using a kernel revision that was incompatible with libroot concerning the size of the DIR cookie, due to the recent addition of seekdir/telldir support fields. Is it possible that your kernel is out of sync somehow? Did you update the kernel separately (like I do often)?

in reply to: 8 comment:9 by phoudoin, 16 years ago

Replying to mmlr:

Is it possible that your kernel is out of sync somehow? Did you update the kernel separately (like I do often)?

No. Only when I tried to revert scheduler changes. But the above KDL output was from a hrev32798 gcc4 build with a whole svn update on my wc (not changes pending) and jam -qa @disk directly at target partition. Kernel and libroot should be in sync in such case, right?

I still have to test with a nightly build r1a1 gcc2 raw image, BTW.

comment:10 by mmlr, 16 years ago

Yeah you should be fine there. You could try reverting just hrev32679 to see just in case.

in reply to: 8 comment:11 by bonefish, 16 years ago

Replying to mmlr:

I've seen the exact same panic happen here when using a kernel revision that was incompatible with libroot concerning the size of the DIR cookie, due to the recent addition of seekdir/telldir support fields. Is it possible that your kernel is out of sync somehow? Did you update the kernel separately (like I do often)?

DIR cookies never cross the libroot-kernel boundary, so regarding this change it really shouldn't matter if kernel and userland are not in sync.

The stack trace suggests that a thread structure respectively the thread table is corrupt. So a scheduler-related problem seems more likely, particularly since with SMP disabled things work fine. hrev32503 also having the problem speaks against the theory. Are you sure you have correctly updated to that revision (to be sure the complete system so that you don't miss any of kernel, boot loader, runtime loader or any lib that does syscalls)?

follow-up: 13 comment:12 by marcusoverhagen, 16 years ago

This is probably fixed in hrev32817. Please test.

in reply to: 12 comment:13 by phoudoin, 16 years ago

Resolution:	→ fixed
Status:	new → closed

Replying to marcusoverhagen:

This is probably fixed in hrev32817. Please test.

Confirmed. Nice job.

Note: See TracTickets for help on using tickets.

Download in other formats: