Opened 13 years ago
Closed 13 years ago
#8085 closed bug (fixed)
SMI storm on USB handover on AMD 970/SB950 AMD AM3+ UEFI motherboard
Reported by: | kallisti5 | Owned by: | mmlr |
---|---|---|---|
Priority: | normal | Milestone: | R1/alpha4 |
Component: | System/Kernel | Version: | R1/Development |
Keywords: | SB950, UEFI, AHCI, APIC, IRQ | Cc: | greggd@… |
Blocked By: | Blocking: | #8456 | |
Platform: | All |
Description (last modified by )
Booting latest hrev43091 gcc4 image results in multiple problems on an Asus M5A97 EVO UEFI mainboard.
last log message seen before lockup:
ACPI Error: [RAMB] Namespace lookup failure, AE_NOT_FOUND . . usb ohci -1: smm is in control of the host controller
This board doesn't have a serial port built-in so logs will be difficult given the lock up is so early.
Disabling APIC and ACPI gives a continuous loop of...
ahci: AHCIPort::InterruptErrorHandler port 0, fCommandsActive 0x00000000, is 0x00000040, ci 0x00000000 ahci: ssts 0x00000001 ahci: sctl 0x00000301 ahci: serr 0x40800000 ahci: sact 0x00000000 ahci: Port Connect Change ahci: AHCIPort::InterruptErrorHandler port 0, fCommandsActive 0x00000000, is 0x00000040, ci 0x00000000 ahci: ssts 0x00000000 ahci: sctl 0x00000301 ahci: serr 0x40800000 ahci: sact 0x00000000 ahci: Port Connect Change
This may reflect another issue completely and may not be related.
Attachments (14)
Change History (35)
by , 13 years ago
Attachment: | normalboot1.jpg added |
---|
comment:1 by , 13 years ago
Description: | modified (diff) |
---|---|
Keywords: | AHCI ACPI added |
by , 13 years ago
Attachment: | acpi+apic-disabled.jpg added |
---|
acpi + apic disabled result in looping ahci errors
by , 13 years ago
Attachment: | lspciverbose.txt added |
---|
verbose lspci output from linux on SB950 board
comment:2 by , 13 years ago
Just noticed same AE_NOT_FOUND error on linux, so potentially it's not related:
[ 0.343440] ACPI Error (psargs-0359): [RAMB] Namespace lookup failure, AE_NOT_FOUND [ 0.343444] ACPI Exception: AE_NOT_FOUND, Could not execute arguments for [RAMW] (Region) (20090903/nsinit-338)
comment:3 by , 13 years ago
Component: | Drivers/ACPI → System/Kernel |
---|---|
Keywords: | APIC IRQ added; ACPI removed |
Owner: | changed from | to
Summary: | Severe boot issues on AMD AM3+ UEFI motherboard AMD 970/SB950 → Interrupt routing issues on AMD 970/SB950 AMD AM3+ UEFI motherboard |
comment:4 by , 13 years ago
Attaching 3 logs:
- normal boot - ahci - failure lockup
- no-local-apic - ahci - ahci boot errors / loop
- no-local-apic - IDE - successful boot
by , 13 years ago
Attachment: | bootlog-noapic-fail.log added |
---|
no-local-apic - ahci - ahci boot errors / loop
by , 13 years ago
Attachment: | bootlog-ide-napic-success.log added |
---|
no-local-apic - IDE - successful boot
follow-up: 6 comment:5 by , 13 years ago
Cc: | added |
---|
I removed ohci from the latest Haiku revision as per mmlr. Haiku now hangs during normal boot on the following message:
module: Search for busses/usb/ohci failed. add_memory_type_range(137, 0xfeb09000, 0x100, 0) usb ehci -1: the host controller is bios owned, claiming ownership usb ehci -1: controller is still bios owned, waiting
comment:6 by , 13 years ago
Cc: | removed |
---|---|
Owner: | changed from | to
Status: | new → in-progress |
Summary: | Interrupt routing issues on AMD 970/SB950 AMD AM3+ UEFI motherboard → SMI storm on USB handover on AMD 970/SB950 AMD AM3+ UEFI motherboard |
Replying to kallisti5:
usb ehci -1: controller is still bios owned, waiting
So again it is the BIOS handover. Both the OHCI and the EHCI handover code is pretty much by the book as per the specs with some tweaks to work around problems from the BSDs and Linux. So I can't say I would know anything obvious to do from here on out. It is obvious that an SMI hits as soon as the legacy handover is attempted for both OHCI and EHCI. There's really nothing we can do about those, as the system management code is priviledged. The only thing we can do is trying to work around the issue that triggers it. Without having the hardware it's very hard to do anything of course, but I'll try to come up with some patches that you could try and see.
BTW it is enough to add the username to the CC field for it to work, but in my case, since I'm on the bugs mailing list getting all ticket changes anyway, it isn't needed to CC me at all.
by , 13 years ago
Attachment: | debug-ide-no-local-apic.cap added |
---|
successful boot with no-local-apic and ide, extra tracing enabled.
comment:7 by , 13 years ago
Two new debugs attached, with a little extra tracing as seen below...
diff --git a/src/system/boot/platform/bios_ia32/acpi.cpp b/src/system/boot/platform/bios_ia32/acpi.cpp index d9207fd..d18b689 100644 --- a/src/system/boot/platform/bios_ia32/acpi.cpp +++ b/src/system/boot/platform/bios_ia32/acpi.cpp @@ -20,7 +20,7 @@ #include <arch/x86/arch_acpi.h> -//#define TRACE_ACPI +#define TRACE_ACPI #ifdef TRACE_ACPI # define TRACE(x) dprintf x #else diff --git a/src/system/kernel/arch/x86/ioapic.cpp b/src/system/kernel/arch/x86/ioapic.cpp index 78f6129..00eccb4 100644 --- a/src/system/kernel/arch/x86/ioapic.cpp +++ b/src/system/kernel/arch/x86/ioapic.cpp @@ -24,7 +24,7 @@ #include "acpi.h" -//#define TRACE_IOAPIC +#define TRACE_IOAPIC #ifdef TRACE_IOAPIC # define TRACE(x) dprintf x #else diff --git a/src/system/kernel/arch/x86/irq_routing_table.cpp b/src/system/kernel/arch/x86/irq_routing_table.cpp index a9e46c7..e0ec967 100644 --- a/src/system/kernel/arch/x86/irq_routing_table.cpp +++ b/src/system/kernel/arch/x86/irq_routing_table.cpp @@ -16,7 +16,7 @@ #include <PCI.h> -//#define TRACE_PRT +#define TRACE_PRT #ifdef TRACE_PRT # define TRACE(x...) dprintf("IRQRoutingTable: "x) #else
comment:8 by , 13 years ago
I have that same hang BIOS handoff hang on my laptop as of the last time I checked should I rebuild a more recent revision and test that out?
I have an 880G laptop which I believe has an rs780 pci bridge I was having a KDL at boot #5815 but that may have cleared up as I am now having the same error as reported above.
usb ehci -1: controller is still bios owned, waiting
comment:9 by , 13 years ago
10:08 <@mmlr_thinkpad> it's not that the problem isn't understood, really 10:09 <@mmlr_thinkpad> the bios still has an interrupt handler due to legacy support being still enabled 10:09 <@mmlr_thinkpad> and doing pretty much anything to the controller will result in those being triggered 10:09 <@mmlr_thinkpad> if the bios isn't clever enough to handle the situation then things stall
follow-up: 11 comment:10 by , 13 years ago
comment:11 by , 13 years ago
Replying to kvdman:
related?
http://svn.reactos.org/svn/reactos/?view=rev&revision=55170
Very possible. Amine Khaldi from ReactOS told me that the USB devs on their end suggested that we apply their patches from hrev55170 and hrev55171
comment:12 by , 13 years ago
comment:13 by , 13 years ago
patch: | 0 → 1 |
---|
comment:14 by , 13 years ago
I'm working on this.. there are lots of cases our code doesn't check for.
I'm pretty much implementing what ReactOS does, then I can test on my hardware at home.
follow-up: 16 comment:15 by , 13 years ago
While I agree that disabling all interrupts may very well be a problem, not disabling them at all is not an option. The reason why they are disabled is that we don't want to get interrupts while we haven't set up an interrupt handler yet and therefore would cause an interrupt storm on our end when the handover happens and doesn't clear the interrupts. Hence disabling everything besides the ownership change would possibly work. The sequence itself has been carefully crafted, so please don't just shoot into it (it is mostly the same as FreeBSD as the comment there suggests). Checking for the run state, additionally to the interrupt routing state, may make sense as well.
Regarding "many cases we don't handle": Quite a few of these aren't relevant as we always reset the controller. And, as the comment there suggests, this has been noticed as needed on some controllers at least, so just throwing that out isn't necessarily a good idea IMO.
follow-up: 17 comment:16 by , 13 years ago
Replying to mmlr:
While I agree that disabling all interrupts may very well be a problem, not disabling them at all is not an option. The reason why they are disabled is that we don't want to get interrupts while we haven't set up an interrupt handler yet and therefore would cause an interrupt storm on our end when the handover happens and doesn't clear the interrupts. Hence disabling everything besides the ownership change would possibly work.
Yup.. Attaching the current patch I'm playing with. Don't worry, I wouldn't commit anything like this until extensive testing and review :)
Pretty much I'm waiting until after requesting control of usb from SMM to disable interrupts, or disabling them before a USB resume / reset (depending on the situation)
by , 13 years ago
Attachment: | usbOHCIStartupV1.diff added |
---|
by , 13 years ago
Attachment: | usbOHCIStartupV2.diff added |
---|
better ensure interrupts are disabled on smm takeover in case of reset
comment:17 by , 13 years ago
Replying to kallisti5:
Don't worry, I wouldn't commit anything like this until extensive testing and review :)
Yet the patches do exactly the two things I mentioned as not necessarily being a good idea...
Pretty much I'm waiting until after requesting control of usb from SMM to disable interrupts, or disabling them before a USB resume / reset (depending on the situation)
As I said above, this is not really an option. You can't delay it until after requesting control. This is a race condition that depends entirely on how the startup is timed. If there is a device and the SMM is in control, then that device can (and will) cause interrupts. If you request control now you may very well end up with interrupts being delivered before you even return to your code. Those aren't handled and will cause an interrupt storm. Hence why I suggested to leave the code in the place where it is, but exclude disabling of the ownership change interrupt only.
Avoiding the reset is certainly a nice thing considering the delays the reset introduces. But if it doesn't work on some controllers then either we need to blacklist/whitelist them or we should IMO remain with the more conservative approach of always resetting.
Just keep in mind that the ReactOS implementation is new. It is in part based on other implementations (including ours), but basically one has to assume that it isn't broadly tested just yet. So doing the same thing as a younger implementation instead of doing what more established ones came to do in the end just doesn't seem like a reasonable approach to me. Our implementation started out "by the book" originally and the handover code was later adjusted based on input taken from Linux and FreeBSD.
comment:18 by , 13 years ago
Yup. I tested the patches last night, and while they enabled us to take control of the usb host controller, it also caused an interrupt storm as you said causing random lockups shortly afterward.
A note to anyone looking at this ticket: don't commit usbOHCIStartupV1/usbOHCIStartupV2
comment:20 by , 13 years ago
Cc: | added |
---|
comment:21 by , 13 years ago
Milestone: | R1 → R1/alpha4 |
---|---|
Resolution: | → fixed |
Status: | in-progress → closed |
error on normal boot top