Opened 13 years ago

Closed 13 years ago

#8085 closed bug (fixed)

SMI storm on USB handover on AMD 970/SB950 AMD AM3+ UEFI motherboard

Reported by: kallisti5 Owned by: mmlr
Priority: normal Milestone: R1/alpha4
Component: System/Kernel Version: R1/Development
Keywords: SB950, UEFI, AHCI, APIC, IRQ Cc: greggd@…
Blocked By: Blocking: #8456
Platform: All

Description (last modified by kallisti5)

Booting latest hrev43091 gcc4 image results in multiple problems on an Asus M5A97 EVO UEFI mainboard.

last log message seen before lockup:

ACPI Error: [RAMB] Namespace lookup failure, AE_NOT_FOUND
.
.
usb ohci -1: smm is in control of the host controller

This board doesn't have a serial port built-in so logs will be difficult given the lock up is so early.

Disabling APIC and ACPI gives a continuous loop of...

ahci: AHCIPort::InterruptErrorHandler port 0, fCommandsActive 0x00000000, is 0x00000040, ci 0x00000000
ahci: ssts 0x00000001
ahci: sctl 0x00000301
ahci: serr 0x40800000
ahci: sact 0x00000000
ahci: Port Connect Change
ahci: AHCIPort::InterruptErrorHandler port 0, fCommandsActive 0x00000000, is 0x00000040, ci 0x00000000
ahci: ssts 0x00000000
ahci: sctl 0x00000301
ahci: serr 0x40800000
ahci: sact 0x00000000
ahci: Port Connect Change

This may reflect another issue completely and may not be related.

Attachments (14)

normalboot1.jpg (235.7 KB ) - added by kallisti5 13 years ago.
error on normal boot top
normalboot2.jpg (168.6 KB ) - added by kallisti5 13 years ago.
error on normal boot bottom
acpi+apic-disabled.jpg (164.9 KB ) - added by kallisti5 13 years ago.
acpi + apic disabled result in looping ahci errors
lspciverbose.txt (19.5 KB ) - added by kallisti5 13 years ago.
verbose lspci output from linux on SB950 board
dmidecode.txt (14.1 KB ) - added by kallisti5 13 years ago.
dmidecode output
dsdt (31.2 KB ) - added by kallisti5 13 years ago.
system dsdt obtained from linux
normalboot-ahci-fail.log (33.9 KB ) - added by kallisti5 13 years ago.
normal boot - ahci - failure lockup
bootlog-noapic-fail.log (107.1 KB ) - added by kallisti5 13 years ago.
no-local-apic - ahci - ahci boot errors / loop
bootlog-ide-napic-success.log (109.2 KB ) - added by kallisti5 13 years ago.
no-local-apic - IDE - successful boot
debug-stock-boot.cap (69.9 KB ) - added by kallisti5 13 years ago.
stock boot with extra tracing enabled.
debug-ide-no-local-apic.cap (138.5 KB ) - added by kallisti5 13 years ago.
successful boot with no-local-apic and ide, extra tracing enabled.
ohciFix.diff (1001 bytes ) - added by kallisti5 13 years ago.
Haiku version of ReactOS hrev55170
usbOHCIStartupV1.diff (3.1 KB ) - added by kallisti5 13 years ago.
usbOHCIStartupV2.diff (3.2 KB ) - added by kallisti5 13 years ago.
better ensure interrupts are disabled on smm takeover in case of reset

Download all attachments as: .zip

Change History (35)

by kallisti5, 13 years ago

Attachment: normalboot1.jpg added

error on normal boot top

by kallisti5, 13 years ago

Attachment: normalboot2.jpg added

error on normal boot bottom

comment:1 by kallisti5, 13 years ago

Description: modified (diff)
Keywords: AHCI ACPI added

by kallisti5, 13 years ago

Attachment: acpi+apic-disabled.jpg added

acpi + apic disabled result in looping ahci errors

by kallisti5, 13 years ago

Attachment: lspciverbose.txt added

verbose lspci output from linux on SB950 board

by kallisti5, 13 years ago

Attachment: dmidecode.txt added

dmidecode output

by kallisti5, 13 years ago

Attachment: dsdt added

system dsdt obtained from linux

comment:2 by kallisti5, 13 years ago

Just noticed same AE_NOT_FOUND error on linux, so potentially it's not related:

[    0.343440] ACPI Error (psargs-0359): [RAMB] Namespace lookup failure, AE_NOT_FOUND
[    0.343444] ACPI Exception: AE_NOT_FOUND, Could not execute arguments for [RAMW] (Region) (20090903/nsinit-338)

comment:3 by kallisti5, 13 years ago

Component: Drivers/ACPISystem/Kernel
Keywords: APIC IRQ added; ACPI removed
Owner: changed from tqh to axeld
Summary: Severe boot issues on AMD AM3+ UEFI motherboard AMD 970/SB950Interrupt routing issues on AMD 970/SB950 AMD AM3+ UEFI motherboard

comment:4 by kallisti5, 13 years ago

Attaching 3 logs:

  • normal boot - ahci - failure lockup
  • no-local-apic - ahci - ahci boot errors / loop
  • no-local-apic - IDE - successful boot

by kallisti5, 13 years ago

Attachment: normalboot-ahci-fail.log added

normal boot - ahci - failure lockup

by kallisti5, 13 years ago

Attachment: bootlog-noapic-fail.log added

no-local-apic - ahci - ahci boot errors / loop

by kallisti5, 13 years ago

no-local-apic - IDE - successful boot

comment:5 by kallisti5, 13 years ago

Cc: mmlr@… added

I removed ohci from the latest Haiku revision as per mmlr. Haiku now hangs during normal boot on the following message:

module: Search for busses/usb/ohci failed.
add_memory_type_range(137, 0xfeb09000, 0x100, 0)
usb ehci -1: the host controller is bios owned, claiming ownership
usb ehci -1: controller is still bios owned, waiting
Last edited 13 years ago by kallisti5 (previous) (diff)

in reply to:  5 comment:6 by mmlr, 13 years ago

Cc: mmlr@… removed
Owner: changed from axeld to mmlr
Status: newin-progress
Summary: Interrupt routing issues on AMD 970/SB950 AMD AM3+ UEFI motherboardSMI storm on USB handover on AMD 970/SB950 AMD AM3+ UEFI motherboard

Replying to kallisti5:

usb ehci -1: controller is still bios owned, waiting

So again it is the BIOS handover. Both the OHCI and the EHCI handover code is pretty much by the book as per the specs with some tweaks to work around problems from the BSDs and Linux. So I can't say I would know anything obvious to do from here on out. It is obvious that an SMI hits as soon as the legacy handover is attempted for both OHCI and EHCI. There's really nothing we can do about those, as the system management code is priviledged. The only thing we can do is trying to work around the issue that triggers it. Without having the hardware it's very hard to do anything of course, but I'll try to come up with some patches that you could try and see.

BTW it is enough to add the username to the CC field for it to work, but in my case, since I'm on the bugs mailing list getting all ticket changes anyway, it isn't needed to CC me at all.

by kallisti5, 13 years ago

Attachment: debug-stock-boot.cap added

stock boot with extra tracing enabled.

by kallisti5, 13 years ago

Attachment: debug-ide-no-local-apic.cap added

successful boot with no-local-apic and ide, extra tracing enabled.

comment:7 by kallisti5, 13 years ago

Two new debugs attached, with a little extra tracing as seen below...

diff --git a/src/system/boot/platform/bios_ia32/acpi.cpp b/src/system/boot/platform/bios_ia32/acpi.cpp
index d9207fd..d18b689 100644
--- a/src/system/boot/platform/bios_ia32/acpi.cpp
+++ b/src/system/boot/platform/bios_ia32/acpi.cpp
@@ -20,7 +20,7 @@
 #include <arch/x86/arch_acpi.h>
 
 
-//#define TRACE_ACPI
+#define TRACE_ACPI
 #ifdef TRACE_ACPI
 #      define TRACE(x) dprintf x
 #else
diff --git a/src/system/kernel/arch/x86/ioapic.cpp b/src/system/kernel/arch/x86/ioapic.cpp
index 78f6129..00eccb4 100644
--- a/src/system/kernel/arch/x86/ioapic.cpp
+++ b/src/system/kernel/arch/x86/ioapic.cpp
@@ -24,7 +24,7 @@
 #include "acpi.h"
 
 
-//#define TRACE_IOAPIC
+#define TRACE_IOAPIC
 #ifdef TRACE_IOAPIC
 #      define TRACE(x) dprintf x
 #else
diff --git a/src/system/kernel/arch/x86/irq_routing_table.cpp b/src/system/kernel/arch/x86/irq_routing_table.cpp
index a9e46c7..e0ec967 100644
--- a/src/system/kernel/arch/x86/irq_routing_table.cpp
+++ b/src/system/kernel/arch/x86/irq_routing_table.cpp
@@ -16,7 +16,7 @@
 #include <PCI.h>
 
 
-//#define TRACE_PRT
+#define TRACE_PRT
 #ifdef TRACE_PRT
 #      define TRACE(x...) dprintf("IRQRoutingTable: "x)
 #else

comment:8 by cb88, 13 years ago

I have that same hang BIOS handoff hang on my laptop as of the last time I checked should I rebuild a more recent revision and test that out?

I have an 880G laptop which I believe has an rs780 pci bridge I was having a KDL at boot #5815 but that may have cleared up as I am now having the same error as reported above.

usb ehci -1: controller is still bios owned, waiting

comment:9 by kallisti5, 13 years ago

10:08 <@mmlr_thinkpad> it's not that the problem isn't understood, really
10:09 <@mmlr_thinkpad> the bios still has an interrupt handler due to legacy support being still enabled
10:09 <@mmlr_thinkpad> and doing pretty much anything to the controller will result in those being triggered
10:09 <@mmlr_thinkpad> if the bios isn't clever enough to handle the situation then things stall
Last edited 13 years ago by kallisti5 (previous) (diff)

in reply to:  10 comment:11 by umccullough, 13 years ago

Replying to kvdman:

related?

http://svn.reactos.org/svn/reactos/?view=rev&revision=55170

Very possible. Amine Khaldi from ReactOS told me that the USB devs on their end suggested that we apply their patches from hrev55170 and hrev55171

comment:12 by kallisti5, 13 years ago

attaching the Haiku version of ReactOS hrev55170.. although I'm not sure about removing that reset in ReactOS hrev55170. Need to research 55171 as well.

by kallisti5, 13 years ago

Attachment: ohciFix.diff added

Haiku version of ReactOS hrev55170

comment:13 by kallisti5, 13 years ago

patch: 01

comment:14 by kallisti5, 13 years ago

I'm working on this.. there are lots of cases our code doesn't check for.

I'm pretty much implementing what ReactOS does, then I can test on my hardware at home.

comment:15 by mmlr, 13 years ago

While I agree that disabling all interrupts may very well be a problem, not disabling them at all is not an option. The reason why they are disabled is that we don't want to get interrupts while we haven't set up an interrupt handler yet and therefore would cause an interrupt storm on our end when the handover happens and doesn't clear the interrupts. Hence disabling everything besides the ownership change would possibly work. The sequence itself has been carefully crafted, so please don't just shoot into it (it is mostly the same as FreeBSD as the comment there suggests). Checking for the run state, additionally to the interrupt routing state, may make sense as well.

Regarding "many cases we don't handle": Quite a few of these aren't relevant as we always reset the controller. And, as the comment there suggests, this has been noticed as needed on some controllers at least, so just throwing that out isn't necessarily a good idea IMO.

in reply to:  15 ; comment:16 by kallisti5, 13 years ago

Replying to mmlr:

While I agree that disabling all interrupts may very well be a problem, not disabling them at all is not an option. The reason why they are disabled is that we don't want to get interrupts while we haven't set up an interrupt handler yet and therefore would cause an interrupt storm on our end when the handover happens and doesn't clear the interrupts. Hence disabling everything besides the ownership change would possibly work.

Yup.. Attaching the current patch I'm playing with. Don't worry, I wouldn't commit anything like this until extensive testing and review :)

Pretty much I'm waiting until after requesting control of usb from SMM to disable interrupts, or disabling them before a USB resume / reset (depending on the situation)

Last edited 13 years ago by kallisti5 (previous) (diff)

by kallisti5, 13 years ago

Attachment: usbOHCIStartupV1.diff added

by kallisti5, 13 years ago

Attachment: usbOHCIStartupV2.diff added

better ensure interrupts are disabled on smm takeover in case of reset

in reply to:  16 comment:17 by mmlr, 13 years ago

Replying to kallisti5:

Don't worry, I wouldn't commit anything like this until extensive testing and review :)

Yet the patches do exactly the two things I mentioned as not necessarily being a good idea...

Pretty much I'm waiting until after requesting control of usb from SMM to disable interrupts, or disabling them before a USB resume / reset (depending on the situation)

As I said above, this is not really an option. You can't delay it until after requesting control. This is a race condition that depends entirely on how the startup is timed. If there is a device and the SMM is in control, then that device can (and will) cause interrupts. If you request control now you may very well end up with interrupts being delivered before you even return to your code. Those aren't handled and will cause an interrupt storm. Hence why I suggested to leave the code in the place where it is, but exclude disabling of the ownership change interrupt only.

Avoiding the reset is certainly a nice thing considering the delays the reset introduces. But if it doesn't work on some controllers then either we need to blacklist/whitelist them or we should IMO remain with the more conservative approach of always resetting.

Just keep in mind that the ReactOS implementation is new. It is in part based on other implementations (including ours), but basically one has to assume that it isn't broadly tested just yet. So doing the same thing as a younger implementation instead of doing what more established ones came to do in the end just doesn't seem like a reasonable approach to me. Our implementation started out "by the book" originally and the handover code was later adjusted based on input taken from Linux and FreeBSD.

comment:18 by kallisti5, 13 years ago

Yup. I tested the patches last night, and while they enabled us to take control of the usb host controller, it also caused an interrupt storm as you said causing random lockups shortly afterward.

A note to anyone looking at this ticket: don't commit usbOHCIStartupV1/usbOHCIStartupV2

comment:19 by diver, 13 years ago

Blocking: 8456 added

(In #8456) Most likely a dupe of #8085.

comment:20 by HAL, 13 years ago

Cc: greggd@… added

comment:21 by kallisti5, 13 years ago

Milestone: R1R1/alpha4
Resolution: fixed
Status: in-progressclosed

Verified the severe blocker issue no longer exists after hrev44025. I still need to disable AHCI to boot... however that can be another issue (#7562) :)

Note: See TracTickets for help on using tickets.