#4399 closed bug (fixed)
[kernel] reboot before 1st icon after BIOS update Intel DQ35JO
Reported by: | diver | Owned by: | axeld |
---|---|---|---|
Priority: | normal | Milestone: | R1 |
Component: | System/Kernel | Version: | R1/Development |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Platform: | All |
Description
I've had some problems with BIOS such as slow boot and slow UI redraw and I decided to update it.
Motherboard description
BIOS release notes
BIOS version before update: 0559
BIOS version after update: 1109
I can enter haiku boot loader, but neither safe mode option change anything :-(
I have Ubuntu installed on the same hard drive, but it works ok after bios update.
This board lacks serial port, but as I build haiku myself I could try anything you want to fix it.
Attachments (5)
Change History (47)
follow-up: 2 comment:1 by , 15 years ago
comment:2 by , 15 years ago
comment:3 by , 15 years ago
Is there anything I could try before I flash my bios back to get a working haiku again?
comment:4 by , 15 years ago
I find it very odd that it reboots at that place. From what you tell you can enter the boot loader menu, so the bootloader, including disk access to look for possibly bootable partitions, seems to work. That rules out quite a few parts like ACPI table parsing to detect MP configuration, disk access using BIOS calls, initial VESA setup, or getting loaded in the first place. Some of these parts are known to have issues on some BIOSes. What it doesn't rule out is the initial SMP setup, but if you say no safemode options change anything, including disabling SMP, then I guess this is unlikely as well. The only thing you could do is to enable early debug output by pressing the escape key instead of space at the time the boot loader starts. It should then print out messages from the early boot stages. If you can capture this info using a camera for example, or by just looking at it, you could see if it gets to the kernel at all. The output should look something like this:
APM version 1.2 available, flags 3. smp: using ACPI to detect MP configuration smp: local apic address is 0xfee00000 smp: found local APIC with id 0 smp: found io APIC with id 1 and address 0xfec00000 VESA version = 2.0, capabilities 0 OEM string: VGABIOS Cirrus extension 101: 640 x 480 x 8 (a = 187, mem = 4, phy = f0000000, p = 1, b = 1) 110: 640 x 480 x 15 (a = 187, mem = 6, phy = f0000000, p = 1, b = 1) 111: 640 x 480 x 16 (a = 187, mem = 6, phy = f0000000, p = 1, b = 1) 112: 640 x 480 x 24 (a = 187, mem = 6, phy = f0000000, p = 1, b = 1) 103: 800 x 600 x 8 (a = 187, mem = 4, phy = f0000000, p = 1, b = 1) 113: 800 x 600 x 15 (a = 187, mem = 6, phy = f0000000, p = 1, b = 1) 114: 800 x 600 x 16 (a = 187, mem = 6, phy = f0000000, p = 1, b = 1) 115: 800 x 600 x 24 (a = 187, mem = 6, phy = f0000000, p = 1, b = 1) 105: 1024 x 768 x 8 (a = 187, mem = 4, phy = f0000000, p = 1, b = 1) 116: 1024 x 768 x 15 (a = 187, mem = 6, phy = f0000000, p = 1, b = 1) 117: 1024 x 768 x 16 (a = 187, mem = 6, phy = f0000000, p = 1, b = 1) 118: 1024 x 768 x 24 (a = 187, mem = 6, phy = f0000000, p = 1, b = 1) 107: 1280 x 1024 x 8 (a = 187, mem = 4, phy = f0000000, p = 1, b = 1) 119: 1280 x 1024 x 15 (a = 187, mem = 6, phy = f0000000, p = 1, b = 1) 11a: 1280 x 1024 x 16 (a = 187, mem = 6, phy = f0000000, p = 1, b = 1) VESA compatible graphics! Welcome to the Haiku boot loader! legacy base address 1f0 ATA device, master number of drives: 1 add_partitions_for(0x001051cc, mountFS = no) add_partitions_for(fd = 0, mountFS = no) 0x00105320 Partition::Partition 0x00105320 Partition::Scan() check for partitioning_system: EFI GUID Partition Map check for partitioning_system: Intel Partition Map priority: 500 check for partitioning_system: Intel Extended Partition boot partition offset: 0 0x00105320 Partition::_Mount check for file_system: BFS Filesystem load kernel... video mode: 1024x768x24 smp: found 1 cpu smp: apic_phys = 0xfee00000 smp: ioapic_phys = 0xfec00000 smp: apic = 0x806ca000 smp: ioapic = 0x806cb000 APIC ticks/sec = 998576174 kernel entry at 80043a50
If you get down to the kernel entry line then it'd be a problem in the kernel. If not it'd be somewhere in the boot loader. The closer you can narrow it down the easier it will be to identify the offending part.
If all of this doesn't yield anything I don't think there's much more you can do. If you are going to flash the BIOS anyway though it would be interesting to know what update caused that behaviour. From the BIOS release notes it looks like you updated from the very first BIOS version straight up to the newest one. This makes pinpointing something specific very hard as there are a lot of changes. If you don't fear killing your BIOS flash binary searching the BIOS version that causes this could help narrow down the issue as well.
comment:5 by , 15 years ago
Thanks mmlr, I didn't knew that early debug output trick with Esc key. BTW it would be useful if it would turn onscreen debug output too, is that possible?
Unfortunately the only thing I could see before reboot is "APM version".
So I decided to downgrade BIOS, starting with previous bios version, so...
Last working BIOS version: 1056
I now have working haiku again and almost newest bios revision. Please yell if you want me to try something else, I will flash my bios again if needed.
Thanks again!
comment:6 by , 15 years ago
The changes from there to newest don't really ring a bell for me. MTRRs were an issue some time ago, but not that early in the boot process anyway. You could build a debug version of the bootloader that adds a delay after outputting a line of info. I did that here to debug similar rebooting issues. If you don't want to build it yourself I could attach such a modified bootloader.
Concerning enabling on-screen debug output, I'm not sure, but I thought it did that some time ago. Could certainly be done and would make sense.
comment:7 by , 15 years ago
I've attached a loader which adds a one second pause after each line of debug output. Just in case you want to try that again.
comment:9 by , 15 years ago
Ok, I've flashed my bios again to the problematic version and applied your patch.
The last line before reboot is kernel entry at something. So it seems that boot loader is ok, and reboot happens just after kernel is loaded. So now I need a debug version of the kernel? Thanks you so far!
by , 15 years ago
Attachment: | kernel_slow_debug_output.diff added |
---|
comment:10 by , 15 years ago
I've added a patch that enables the on-screen debug output and adds the same 1 second pause per line. Note that the debug output can only be enabled after initializing the vm. So it's possible that the reset happens before the first line. That'd tell us something as well though.
comment:11 by , 15 years ago
I've applied your new patch and tested it in qemu to be sure that I did it right and it works (it is!).
Then i've rebooted but as you said reset happened before the first line. Doest is tell you something?
comment:13 by , 15 years ago
Well it tells me that it happens quite early. The easiest would be to add an infinite loop to the kernel main, then moving it around until you found the subsystem init that causes the problem. If you got to src/system/kernel/main.cpp you see the _start() function. Now add an infinite loop just before the "arch_platform_init" call (just add a line "while(1);"). Then check if this will hang instead of reboot. If it hangs then move it after the "arch_platform_init", then after "debug_init", then after "cpu_init", "cpu_init_percpu", "int_init" and finally "vm_init". Then tell where it starts to reboot instead of hang.
comment:14 by , 15 years ago
I've just updated my DG33TL to the newest BIOS as that one has pretty similar changes listed in the release notes, but I am not able to reproduce the issue over here.
comment:15 by , 15 years ago
Here is what I did:
Index: src/system/kernel/main.cpp =================================================================== --- src/system/kernel/main.cpp (revision 32949) +++ src/system/kernel/main.cpp (working copy) @@ -102,6 +102,7 @@ // if we're not a boot cpu, spin here until someone wakes us up if (smp_trap_non_boot_cpus(currentCPU)) { // init platform + while(1); // diver arch_platform_init(&sKernelArgs); // setup debug output
Testing it in qemu make it wait forever showing boot logo.
Testing on real iron make it show boot logo (it didn't before!) BUT!!! after 5 seconds reboot happens.
comment:16 by , 15 years ago
Very odd. Since the boot CPU is spinning, and the other CPUs should be trapped as well, this should definitely not happen. Since this only happens with that BIOS version applied and other OSes don't seem to have a problem, broken hardware is unlikely. If you move the infinite loop before that (smp_trap_non_boot_cpus), does it then at least hang solid? If not then there's something wrong with SMP setup maybe so that not all processors enter the kernel correctly. If it hangs, then something with the inter-CPU communication is messed up. If it doesn't, please try to move the loop upwards until you find out where it stops rebooting.
comment:17 by , 15 years ago
Placing
while(1);
before
smp_set_num_cpus(bootKernelArgs->num_cpus);
make it hang solid and doesn't reboot.
comment:18 by , 15 years ago
That's interesting. This call doesn't actually do much. It just sets the static variable to the CPU count. Maybe the CPU count is messed up? You could do something like:
for (uint32 i = 0; i < bootKernelArgs->num_cpus; i++) spin(10 * 1000 * 1000);
This would add a 10 second delay for every CPU. Then you could simply time the time it takes to reboot and deduce the count from that. Can you also post a picture of the early debug output you get when pressing escape? The numbers should also be there.
comment:19 by , 15 years ago
With this change qemu and real hardware are stuck showing boot logo.
Am I doing something worng here?
svn diff
Index: src/system/kernel/main.cpp =================================================================== --- src/system/kernel/main.cpp (revision 32984) +++ src/system/kernel/main.cpp (working copy) @@ -83,7 +83,7 @@ "kernel!\n"); return -1; } - +for (uint32 i = 0; i < bootKernelArgs->num_cpus; i++) spin(10 * 1000 * 1000); smp_set_num_cpus(bootKernelArgs->num_cpus); // wait for all the cpus to get here
comment:20 by , 15 years ago
If the CPU count is messed up then it will likely wait for a very long time. I'm pretty sure this is the case. The reboot then happens because of spinlock debugging that panics if it can't get a spinlock for a long time (5 seconds in your case). Can you please get a picture of the initial debug output (by pressing escape) or transcribe here all the CPU/SMP related info. Relevant are CPUs found, APIC addresses and such. Everything with the "smp:" prefix basically.
comment:21 by , 15 years ago
svn up
At revision 33013.
svn diff
Index: src/system/kernel/main.cpp =================================================================== --- src/system/kernel/main.cpp (revision 33010) +++ src/system/kernel/main.cpp (working copy) @@ -83,7 +83,7 @@ "kernel!\n"); return -1; } - +while(1); smp_set_num_cpus(bootKernelArgs->num_cpus); // wait for all the cpus to get here Index: src/system/boot/platform/bios_ia32/console.cpp =================================================================== --- src/system/boot/platform/bios_ia32/console.cpp (revision 33010) +++ src/system/boot/platform/bios_ia32/console.cpp (working copy) @@ -71,9 +71,10 @@ return bufferSize; for (uint32 i = 0; i < bufferSize; i++) { - if (string[0] == '\n') + if (string[0] == '\n') { sScreenOffset += sScreenWidth - (sScreenOffset % sScreenWidth); - else + spin(1 * 1000 * 1000); + } else sScreenBase[sScreenOffset++] = sColor | string[0]; if (sScreenOffset >= sScreenWidth * sScreenHeight)
Initial debug output
APM version 1.2 available, flags 3. smp: using acpi to detect MP configuration smp: local apic address os 0xfee00000 smp: found local APIC with id 0 smp: found local APIC with id 2 smp: found local APIC with id 1 smp: found local APIC with id 3 smp: found io APIC with id 2 and address 0xfec00000
At this point screen is cleared (ouch!) and starts again with vesa and partition scanning stuff, so it's skipped.
load kernel... smp: found 4 cpus smp: apic_phys = 0xfee00000 smp: ioapic_phys = 0xfee00000 smp: apic = 0x845c000 smp: ioapic = 0x845d000 APIC ticks/sec = 332394184 trampolining other cpus wait for delivery deassert INIT wait for delivery num startup = 2 send STARTUP wait for delivery send STARTUP wait for delivery wait for delivery deassert INIT wait for delivery num startup = 2 send STARTUP wait for delivery send STARTUP wait for delivery wait for delivery deassert INIT wait for delivery num startup = 2 send STARTUP wait for delivery send STARTUP wait for delivery done trampolining kernel entry at 80043e5c
HANG
comment:22 by , 15 years ago
And this really is a quad core? It looks pretty much fine so far. I'm a bit clueless. If the CPU count gets messed up this only leaves some memory corruption or an incompatible kernel args structure. But since it doesn't happen with the older BIOS this would seem very strange.
comment:24 by , 15 years ago
I can only imagine it to be a MTRR issue. If some of the variables involved in startup code are set to cached then this can cause problems as soon as more than one CPU is involved. Adding an MTRR dump to the early output could verify that. Let me prepare a patch.
by , 15 years ago
Attachment: | boot_mtrr_dump.diff added |
---|
comment:25 by , 15 years ago
Can you please apply the boot_mtrr_dump patch and capture the early debug output again? It should be the very first thing dumped. Just the lines starting with "mtrr" are of interest.
comment:26 by , 15 years ago
Sorry for the long delay, here it is:
mtrr: 7 variable ranges mtrr: default type: 0xc06 (write-back, variable enabled, fixed enabled) mtrr: entry 0 : base: 0xcf300000; lenght: 0x1007ff; type: 0 uncacheable mtrr: entry 1 : base: 0xcf400000; lenght: 0x4007ff; type: 0 uncacheable mtrr: entry 2 : base: 0xcf800000; lenght: 0x8007ff; type: 0 uncacheable mtrr: entry 3 : base: 0xd0000000; lenght: 0x100007ff; type: 0 uncacheable mtrr: entry 4 : base: 0xe0000000; lenght: 0x200007ff; type: 0 uncacheable mtrr: entry 5 : base: 0xcf278000; lenght: 0x17ff; type: 6 write-back mtrr: entry 6 : empty
by , 15 years ago
Attachment: | mtrr_uncachable.diff added |
---|
comment:27 by , 15 years ago
It looks like the MTRRs are basically fine. It seems common for newer Intel BIOSes to use the "cachable default and then mask out uncachable parts"-approach. Traditionally this has been the other way around. But since the ranges marked uncacheable look fine it shouldn't pose a problem. I've added another patch that sets the default to uncacheable in any case. It likely won't change anything as it will only update the MTRRs on the boot CPU which probably isn't at fault here. A possibility is that the non-boot CPUs have set their MTRRs differently or even invalid, which is hard to debug or correct though.
Still it'd seem strange that the CPU count would go corrupt due to caching behaviour, as the values are filled from the boot CPU which seems to be fine in this regard. I see the problem is with the spin loop as spin does relay on system_time() which isn't really set up this early. I'm running out of options on how the get debug output from such a stage though. There's no serial so the early output won't go anywhere and there is no on-screen debug yet either. What you could try is adding a reboot in case an unexpected CPU count is determined like so:
if (bootKernelArgs->num_cpus != 4) panic("will reset"); while (true);
Adding that to the place you had the loop before in _start(). This will either panic (and therefore reboot because nothing is setup to handle the panic) or it will hang. If it hangs then the cpu count is setup fine, if it reboots something's up.
comment:28 by , 15 years ago
After adding
if (bootKernelArgs->num_cpus != 4) panic("will reset"); while (true);
It hangs now. Should I apply mtrr_uncachable.diff?
comment:30 by , 15 years ago
Is it correct that the apic_phys and ioapic_phys are the same address ? I mean:
smp: apic_phys = 0xfee00000 smp: ioapic_phys = 0xfee00000
comment:33 by , 15 years ago
So you typed that stuff yourself ? Sorry, I thought you got that from serial output. Then could be a typo, indeed.
comment:34 by , 15 years ago
mmlr, is there anything else I could do? Current revision is still no-go.
comment:35 by , 15 years ago
Version: | R1/pre-alpha1 → R1/Development |
---|
comment:37 by , 15 years ago
Replying to diver:
Just my usual nudging once in 3 month :-)
I'm still rather clueless, sorry. I presume there hasn't been another BIOS update yet?
comment:39 by , 15 years ago
comment:40 by , 15 years ago
Awesome! I just updated to hrev35758 and it doesn't reboot anymore! It KDL'ed somewhere near the last icon with some MTRR message, but this would be another problem. Please close this one. Thanks a lot!
comment:42 by , 15 years ago
Forgot to mention: The MTRR panic() should be continuable (might happen several times till fully booted, though). When you open a new ticket for it, a syslog would be helpful.
Hi, have you ruled out the cause mentioned in ticket #4325?