Opened 15 years ago

Closed 15 years ago

Last modified 15 years ago

#4399 closed bug (fixed)

[kernel] reboot before 1st icon after BIOS update Intel DQ35JO

Reported by: diver Owned by: axeld
Priority: normal Milestone: R1
Component: System/Kernel Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Platform: All

Description

I've had some problems with BIOS such as slow boot and slow UI redraw and I decided to update it.

Motherboard description
BIOS release notes

BIOS version before update: 0559
BIOS version after update: 1109

I can enter haiku boot loader, but neither safe mode option change anything :-(
I have Ubuntu installed on the same hard drive, but it works ok after bios update.

This board lacks serial port, but as I build haiku myself I could try anything you want to fix it.

Attachments (5)

haiku_loader.zip (112.8 KB ) - added by mmlr 15 years ago.
loader with added pauses after debug lines
pause.diff (621 bytes ) - added by mmlr 15 years ago.
Diff for the paused debug output.
kernel_slow_debug_output.diff (1010 bytes ) - added by mmlr 15 years ago.
boot_mtrr_dump.diff (3.2 KB ) - added by mmlr 15 years ago.
mtrr_uncachable.diff (4.0 KB ) - added by mmlr 15 years ago.

Download all attachments as: .zip

Change History (47)

comment:1 by luroh, 15 years ago

Hi, have you ruled out the cause mentioned in ticket #4325?

in reply to:  1 comment:2 by diver, 15 years ago

Replying to luroh:

Hi, have you ruled out the cause mentioned in ticket #4325?

Sure, I only have 1 partition with haiku, so that's not my case.

comment:3 by diver, 15 years ago

Is there anything I could try before I flash my bios back to get a working haiku again?

comment:4 by mmlr, 15 years ago

I find it very odd that it reboots at that place. From what you tell you can enter the boot loader menu, so the bootloader, including disk access to look for possibly bootable partitions, seems to work. That rules out quite a few parts like ACPI table parsing to detect MP configuration, disk access using BIOS calls, initial VESA setup, or getting loaded in the first place. Some of these parts are known to have issues on some BIOSes. What it doesn't rule out is the initial SMP setup, but if you say no safemode options change anything, including disabling SMP, then I guess this is unlikely as well. The only thing you could do is to enable early debug output by pressing the escape key instead of space at the time the boot loader starts. It should then print out messages from the early boot stages. If you can capture this info using a camera for example, or by just looking at it, you could see if it gets to the kernel at all. The output should look something like this:

APM version 1.2 available, flags 3.
smp: using ACPI to detect MP configuration
smp: local apic address is 0xfee00000
smp: found local APIC with id 0
smp: found io APIC with id 1 and address 0xfec00000
VESA version = 2.0, capabilities 0
OEM string: VGABIOS Cirrus extension
  101: 640 x 480 x 8 (a = 187, mem = 4, phy = f0000000, p = 1, b = 1)
  110: 640 x 480 x 15 (a = 187, mem = 6, phy = f0000000, p = 1, b = 1)
  111: 640 x 480 x 16 (a = 187, mem = 6, phy = f0000000, p = 1, b = 1)
  112: 640 x 480 x 24 (a = 187, mem = 6, phy = f0000000, p = 1, b = 1)
  103: 800 x 600 x 8 (a = 187, mem = 4, phy = f0000000, p = 1, b = 1)
  113: 800 x 600 x 15 (a = 187, mem = 6, phy = f0000000, p = 1, b = 1)
  114: 800 x 600 x 16 (a = 187, mem = 6, phy = f0000000, p = 1, b = 1)
  115: 800 x 600 x 24 (a = 187, mem = 6, phy = f0000000, p = 1, b = 1)
  105: 1024 x 768 x 8 (a = 187, mem = 4, phy = f0000000, p = 1, b = 1)
  116: 1024 x 768 x 15 (a = 187, mem = 6, phy = f0000000, p = 1, b = 1)
  117: 1024 x 768 x 16 (a = 187, mem = 6, phy = f0000000, p = 1, b = 1)
  118: 1024 x 768 x 24 (a = 187, mem = 6, phy = f0000000, p = 1, b = 1)
  107: 1280 x 1024 x 8 (a = 187, mem = 4, phy = f0000000, p = 1, b = 1)
  119: 1280 x 1024 x 15 (a = 187, mem = 6, phy = f0000000, p = 1, b = 1)
  11a: 1280 x 1024 x 16 (a = 187, mem = 6, phy = f0000000, p = 1, b = 1)
VESA compatible graphics!
Welcome to the Haiku boot loader!
legacy base address 1f0
ATA device, master
number of drives: 1
add_partitions_for(0x001051cc, mountFS = no)
add_partitions_for(fd = 0, mountFS = no)
0x00105320 Partition::Partition
0x00105320 Partition::Scan()
check for partitioning_system: EFI GUID Partition Map
check for partitioning_system: Intel Partition Map
  priority: 500
check for partitioning_system: Intel Extended Partition
boot partition offset: 0
0x00105320 Partition::_Mount check for file_system: BFS Filesystem
load kernel...
video mode: 1024x768x24
smp: found 1 cpu
smp: apic_phys = 0xfee00000
smp: ioapic_phys = 0xfec00000
smp: apic = 0x806ca000
smp: ioapic = 0x806cb000
APIC ticks/sec = 998576174
kernel entry at 80043a50

If you get down to the kernel entry line then it'd be a problem in the kernel. If not it'd be somewhere in the boot loader. The closer you can narrow it down the easier it will be to identify the offending part.

If all of this doesn't yield anything I don't think there's much more you can do. If you are going to flash the BIOS anyway though it would be interesting to know what update caused that behaviour. From the BIOS release notes it looks like you updated from the very first BIOS version straight up to the newest one. This makes pinpointing something specific very hard as there are a lot of changes. If you don't fear killing your BIOS flash binary searching the BIOS version that causes this could help narrow down the issue as well.

comment:5 by diver, 15 years ago

Thanks mmlr, I didn't knew that early debug output trick with Esc key. BTW it would be useful if it would turn onscreen debug output too, is that possible?
Unfortunately the only thing I could see before reboot is "APM version".
So I decided to downgrade BIOS, starting with previous bios version, so...

Last working BIOS version: 1056

I now have working haiku again and almost newest bios revision. Please yell if you want me to try something else, I will flash my bios again if needed.
Thanks again!

comment:6 by mmlr, 15 years ago

The changes from there to newest don't really ring a bell for me. MTRRs were an issue some time ago, but not that early in the boot process anyway. You could build a debug version of the bootloader that adds a delay after outputting a line of info. I did that here to debug similar rebooting issues. If you don't want to build it yourself I could attach such a modified bootloader.

Concerning enabling on-screen debug output, I'm not sure, but I thought it did that some time ago. Could certainly be done and would make sense.

by mmlr, 15 years ago

Attachment: haiku_loader.zip added

loader with added pauses after debug lines

comment:7 by mmlr, 15 years ago

I've attached a loader which adds a one second pause after each line of debug output. Just in case you want to try that again.

by mmlr, 15 years ago

Attachment: pause.diff added

Diff for the paused debug output.

comment:8 by mmlr, 15 years ago

Attached a patch in case you want to build it yourself.

comment:9 by diver, 15 years ago

Ok, I've flashed my bios again to the problematic version and applied your patch.
The last line before reboot is kernel entry at something. So it seems that boot loader is ok, and reboot happens just after kernel is loaded. So now I need a debug version of the kernel? Thanks you so far!

by mmlr, 15 years ago

comment:10 by mmlr, 15 years ago

I've added a patch that enables the on-screen debug output and adds the same 1 second pause per line. Note that the debug output can only be enabled after initializing the vm. So it's possible that the reset happens before the first line. That'd tell us something as well though.

comment:11 by diver, 15 years ago

I've applied your new patch and tested it in qemu to be sure that I did it right and it works (it is!).
Then i've rebooted but as you said reset happened before the first line. Doest is tell you something?

comment:12 by diver, 15 years ago

Sorry for the typos, I wish I could edit my own comments, oh well.

comment:13 by mmlr, 15 years ago

Well it tells me that it happens quite early. The easiest would be to add an infinite loop to the kernel main, then moving it around until you found the subsystem init that causes the problem. If you got to src/system/kernel/main.cpp you see the _start() function. Now add an infinite loop just before the "arch_platform_init" call (just add a line "while(1);"). Then check if this will hang instead of reboot. If it hangs then move it after the "arch_platform_init", then after "debug_init", then after "cpu_init", "cpu_init_percpu", "int_init" and finally "vm_init". Then tell where it starts to reboot instead of hang.

comment:14 by mmlr, 15 years ago

I've just updated my DG33TL to the newest BIOS as that one has pretty similar changes listed in the release notes, but I am not able to reproduce the issue over here.

comment:15 by diver, 15 years ago

Here is what I did:

Index: src/system/kernel/main.cpp
===================================================================
--- src/system/kernel/main.cpp  (revision 32949)
+++ src/system/kernel/main.cpp  (working copy)
@@ -102,6 +102,7 @@
        // if we're not a boot cpu, spin here until someone wakes us up
        if (smp_trap_non_boot_cpus(currentCPU)) {
                // init platform
+               while(1); // diver
                arch_platform_init(&sKernelArgs);

                // setup debug output

Testing it in qemu make it wait forever showing boot logo.
Testing on real iron make it show boot logo (it didn't before!) BUT!!! after 5 seconds reboot happens.

comment:16 by mmlr, 15 years ago

Very odd. Since the boot CPU is spinning, and the other CPUs should be trapped as well, this should definitely not happen. Since this only happens with that BIOS version applied and other OSes don't seem to have a problem, broken hardware is unlikely. If you move the infinite loop before that (smp_trap_non_boot_cpus), does it then at least hang solid? If not then there's something wrong with SMP setup maybe so that not all processors enter the kernel correctly. If it hangs, then something with the inter-CPU communication is messed up. If it doesn't, please try to move the loop upwards until you find out where it stops rebooting.

comment:17 by diver, 15 years ago

Placing

while(1);

before

smp_set_num_cpus(bootKernelArgs->num_cpus);

make it hang solid and doesn't reboot.

comment:18 by mmlr, 15 years ago

That's interesting. This call doesn't actually do much. It just sets the static variable to the CPU count. Maybe the CPU count is messed up? You could do something like:

for (uint32 i = 0; i < bootKernelArgs->num_cpus; i++) spin(10 * 1000 * 1000);

This would add a 10 second delay for every CPU. Then you could simply time the time it takes to reboot and deduce the count from that. Can you also post a picture of the early debug output you get when pressing escape? The numbers should also be there.

comment:19 by diver, 15 years ago

With this change qemu and real hardware are stuck showing boot logo.
Am I doing something worng here?

svn diff

Index: src/system/kernel/main.cpp
===================================================================
--- src/system/kernel/main.cpp  (revision 32984)
+++ src/system/kernel/main.cpp  (working copy)
@@ -83,7 +83,7 @@
                        "kernel!\n");
                return -1;
        }
-
+for (uint32 i = 0; i < bootKernelArgs->num_cpus; i++) spin(10 * 1000 * 1000);
        smp_set_num_cpus(bootKernelArgs->num_cpus);

        // wait for all the cpus to get here

comment:20 by mmlr, 15 years ago

If the CPU count is messed up then it will likely wait for a very long time. I'm pretty sure this is the case. The reboot then happens because of spinlock debugging that panics if it can't get a spinlock for a long time (5 seconds in your case). Can you please get a picture of the initial debug output (by pressing escape) or transcribe here all the CPU/SMP related info. Relevant are CPUs found, APIC addresses and such. Everything with the "smp:" prefix basically.

comment:21 by diver, 15 years ago

svn up

At revision 33013.

svn diff

Index: src/system/kernel/main.cpp
===================================================================
--- src/system/kernel/main.cpp  (revision 33010)
+++ src/system/kernel/main.cpp  (working copy)
@@ -83,7 +83,7 @@
                        "kernel!\n");
                return -1;
        }
-
+while(1);
        smp_set_num_cpus(bootKernelArgs->num_cpus);

        // wait for all the cpus to get here
Index: src/system/boot/platform/bios_ia32/console.cpp
===================================================================
--- src/system/boot/platform/bios_ia32/console.cpp      (revision 33010)
+++ src/system/boot/platform/bios_ia32/console.cpp      (working copy)
@@ -71,9 +71,10 @@
                return bufferSize;

        for (uint32 i = 0; i < bufferSize; i++) {
-               if (string[0] == '\n')
+               if (string[0] == '\n') {
                        sScreenOffset += sScreenWidth - (sScreenOffset % sScreenWidth);
-               else
+                       spin(1 * 1000 * 1000);
+               } else
                        sScreenBase[sScreenOffset++] = sColor | string[0];

                if (sScreenOffset >= sScreenWidth * sScreenHeight)

Initial debug output

APM version 1.2 available, flags 3.
smp: using acpi to detect MP configuration
smp: local apic address os 0xfee00000
smp: found local APIC with id 0
smp: found local APIC with id 2
smp: found local APIC with id 1
smp: found local APIC with id 3
smp: found io APIC with id 2 and address 0xfec00000

At this point screen is cleared (ouch!) and starts again with vesa and partition scanning stuff, so it's skipped.

load kernel...
smp: found 4 cpus
smp: apic_phys = 0xfee00000
smp: ioapic_phys = 0xfee00000
smp: apic = 0x845c000
smp: ioapic = 0x845d000
APIC ticks/sec = 332394184
trampolining other cpus
wait for delivery
deassert INIT
wait for delivery
num startup = 2
send STARTUP
wait for delivery
send STARTUP
wait for delivery
wait for delivery
deassert INIT
wait for delivery
num startup = 2
send STARTUP
wait for delivery
send STARTUP
wait for delivery
wait for delivery
deassert INIT
wait for delivery
num startup = 2
send STARTUP
wait for delivery
send STARTUP
wait for delivery
done trampolining
kernel entry at 80043e5c

HANG

comment:22 by mmlr, 15 years ago

And this really is a quad core? It looks pretty much fine so far. I'm a bit clueless. If the CPU count gets messed up this only leaves some memory corruption or an incompatible kernel args structure. But since it doesn't happen with the older BIOS this would seem very strange.

comment:23 by diver, 15 years ago

Yes, it's quad core, see ticket #4427. Anything else I could try?

comment:24 by mmlr, 15 years ago

I can only imagine it to be a MTRR issue. If some of the variables involved in startup code are set to cached then this can cause problems as soon as more than one CPU is involved. Adding an MTRR dump to the early output could verify that. Let me prepare a patch.

by mmlr, 15 years ago

Attachment: boot_mtrr_dump.diff added

comment:25 by mmlr, 15 years ago

Can you please apply the boot_mtrr_dump patch and capture the early debug output again? It should be the very first thing dumped. Just the lines starting with "mtrr" are of interest.

comment:26 by diver, 15 years ago

Sorry for the long delay, here it is:

mtrr: 7 variable ranges
mtrr: default type: 0xc06 (write-back, variable enabled, fixed enabled)
mtrr: entry 0 : base:  0xcf300000; lenght: 0x1007ff; type: 0 uncacheable
mtrr: entry 1 : base:  0xcf400000; lenght: 0x4007ff; type: 0 uncacheable
mtrr: entry 2 : base:  0xcf800000; lenght: 0x8007ff; type: 0 uncacheable
mtrr: entry 3 : base:  0xd0000000; lenght: 0x100007ff; type: 0 uncacheable
mtrr: entry 4 : base:  0xe0000000; lenght: 0x200007ff; type: 0 uncacheable
mtrr: entry 5 : base:  0xcf278000; lenght: 0x17ff; type: 6 write-back
mtrr: entry 6 : empty

by mmlr, 15 years ago

Attachment: mtrr_uncachable.diff added

comment:27 by mmlr, 15 years ago

It looks like the MTRRs are basically fine. It seems common for newer Intel BIOSes to use the "cachable default and then mask out uncachable parts"-approach. Traditionally this has been the other way around. But since the ranges marked uncacheable look fine it shouldn't pose a problem. I've added another patch that sets the default to uncacheable in any case. It likely won't change anything as it will only update the MTRRs on the boot CPU which probably isn't at fault here. A possibility is that the non-boot CPUs have set their MTRRs differently or even invalid, which is hard to debug or correct though.

Still it'd seem strange that the CPU count would go corrupt due to caching behaviour, as the values are filled from the boot CPU which seems to be fine in this regard. I see the problem is with the spin loop as spin does relay on system_time() which isn't really set up this early. I'm running out of options on how the get debug output from such a stage though. There's no serial so the early output won't go anywhere and there is no on-screen debug yet either. What you could try is adding a reboot in case an unexpected CPU count is determined like so:

if (bootKernelArgs->num_cpus != 4) panic("will reset");
while (true);

Adding that to the place you had the loop before in _start(). This will either panic (and therefore reboot because nothing is setup to handle the panic) or it will hang. If it hangs then the cpu count is setup fine, if it reboots something's up.

comment:28 by diver, 15 years ago

After adding

if (bootKernelArgs->num_cpus != 4) panic("will reset");
while (true);

It hangs now. Should I apply mtrr_uncachable.diff?

comment:29 by diver, 15 years ago

ping!

comment:30 by jackburton, 15 years ago

Is it correct that the apic_phys and ioapic_phys are the same address ? I mean:

smp: apic_phys = 0xfee00000
smp: ioapic_phys = 0xfee00000

comment:31 by diver, 15 years ago

It could be a type, do you want me to check again?

comment:32 by diver, 15 years ago

aghrrr, type==typo

comment:33 by jackburton, 15 years ago

So you typed that stuff yourself ? Sorry, I thought you got that from serial output. Then could be a typo, indeed.

comment:34 by diver, 15 years ago

mmlr, is there anything else I could do? Current revision is still no-go.

comment:35 by diver, 15 years ago

Version: R1/pre-alpha1R1/Development

comment:36 by diver, 15 years ago

Just my usual nudging once in 3 month :-)

in reply to:  36 comment:37 by mmlr, 15 years ago

Replying to diver:

Just my usual nudging once in 3 month :-)

I'm still rather clueless, sorry. I presume there hasn't been another BIOS update yet?

comment:38 by diver, 15 years ago

There is another update - 1126, but it behaves the same unfortunately.

comment:39 by bonefish, 15 years ago

It doesn't seem related, but I guess it doesn't harm to test after the changes made in hrev35726 and hrev35736.

comment:40 by diver, 15 years ago

Awesome! I just updated to hrev35758 and it doesn't reboot anymore! It KDL'ed somewhere near the last icon with some MTRR message, but this would be another problem. Please close this one. Thanks a lot!

comment:41 by bonefish, 15 years ago

Resolution: fixed
Status: newclosed

Thanks for the update!

comment:42 by bonefish, 15 years ago

Forgot to mention: The MTRR panic() should be continuable (might happen several times till fully booted, though). When you open a new ticket for it, a syslog would be helpful.

Note: See TracTickets for help on using tickets.