Opened 15 years ago

Last modified 2 months ago

#4555 assigned bug

PANIC: Fatal Exception "NMI Interrupt" occured

Reported by: MasterNetra Owned by: mmlr
Priority: normal Milestone: R1
Component: System/Kernel Version: R1/alpha1
Keywords: boot-failure Cc:
Blocked By: Blocking: #8097, #8741, #11684, #17271, #18506
Platform: x86

Description

At bootup I get this after after dot load: Panic: Fatal Exception "NMI Interrupt" has occured. Error code: 0x0 (I think I got it as displayed, had to memorize it)

Attachments (6)

nmi_syslog (4.2 KB ) - added by Duggan 13 years ago.
snippet of a syslog showing general area where NMI occurs as well as KDL output
nmi_syslog_2 (122.4 KB ) - added by Duggan 13 years ago.
Here's a much larger chunk of syslog that should cover the info you need... Thanks, mmlr!
change_pci_bridge_config_order.diff (458 bytes ) - added by mmlr 13 years ago.
Change the order of error clearing and enabling error reporting.
syslog_lenovo (245.8 KB ) - added by dsjonny 11 years ago.
kdl.jpg (505.4 KB ) - added by luroh 10 years ago.
T42 KDL hrev47815
syslog.txt (338.4 KB ) - added by jens 19 months ago.

Download all attachments as: .zip

Change History (35)

comment:1 by stippi, 15 years ago

Component: SystemSystem/Kernel

These "should" not happen. That's a so called non-maskable interrupt. You should be able to type co[ntinue] at the KDL prompt. Does that work?

comment:2 by MasterNetra, 15 years ago

Actually yes it does.

comment:3 by stippi, 15 years ago

Cool. I am not sure, but IIRC, the NMI panic is kind of a warning, and we may be able to just ignore the NMI. But kernel devs would know this stuff much better than myself.

comment:4 by vodoomoth, 14 years ago

Hope I am not hijacking this ticket but I came to report the same 'PANIC: fatal exception "NMI Interrupt" occurred! Error code 0x0' error and did a search first.

This happened after I inserted the R1Alpha2 (not alpha1 as was the case for the original poster) CD in my laptop in order to install Haiku to a partition created with GParted. That same laptop which will also receive FreeBSD 8.1 and Ubuntu 10.10 in the coming days also features a freshly (re)installed Vista SP2.

Laptop: Fujitsu Siemens Amilo Xi2528, 2 Ghz Core 2 Duo, 2x250 GB HDD (freshly unRAIDed), 2 GB of RAM. This 2008 laptop can't boot any bootable CD from my external DVD drive although it boots from USB keys...

Anyway, I followed the 'press shift for fail-safe mode' tip but the message is the same. I also used "co[ntinue]" as offered by @stippi: command unknown. I ended up using "continue" but this got the PC locked: Ctrl+Alt+Del does nothing and doesn't reboot. Will wait for the next release.

comment:5 by Duggan, 13 years ago

I have this problem as well. I have an HP Pavilion dv6000 series (not sure the exact number) laptop with a Core2 Duo processor. This has been a problem for me as long as I've been booting on native hw. Pre-A3 the NMI occurred after the 3rd icon lit up on boot, post-A3 it now occurs before the first icon lights up. Typing 'es' works and the NMI only pops up once. I'll get a syslog posted up in a day or two and I'll try to hunt down the exact revision where the time the NMI happens changes.

in reply to:  5 comment:6 by mmlr, 13 years ago

Replying to Duggan:

Pre-A3 the NMI occurred after the 3rd icon lit up on boot, post-A3 it now occurs before the first icon lights up. ... and I'll try to hunt down the exact revision where the time the NMI happens changes.

Most probably due to the IO-APIC changes. The IO-APIC isn't causing it, but causes the PCI module to be loaded and initialized earlier in boot which probably triggers this. In either case a syslog would confirm.

comment:7 by mmlr, 13 years ago

Keywords: NMI added
Owner: changed from axeld to mmlr
Status: newassigned
Summary: Unable Run on Dell Latitude D530PANIC: Fatal Exception "NMI Interrupt" has occured

I'm going to generalize this ticket to NMI (presumably at PCI init time) so that it can more easily be found. I'll also try to look into it, but can't really promise anything as I can't reproduce the issue over here.

comment:8 by mmlr, 13 years ago

Blocking: 8097 added

(In #8097) I'm going to close this as a duplicate of #4555, which I just made more general so that it can more easily be found. The stacktrace in this screenshot clearly points at PCI init, so this helps already narrowing it down.

by Duggan, 13 years ago

Attachment: nmi_syslog added

snippet of a syslog showing general area where NMI occurs as well as KDL output

comment:9 by Duggan, 13 years ago

Hey mmlr, I attached a small snipet of my syslog confirming that it's PCI related. If you want the entire syslog let me know and I'll try to get a clean one up soon.

in reply to:  9 comment:10 by mmlr, 13 years ago

Replying to Duggan:

Hey mmlr, I attached a small snipet of my syslog confirming that it's PCI related. If you want the entire syslog let me know and I'll try to get a clean one up soon.

That's fine, the attached one is great, no need to dig further for now. I've browsed through the PCI specs to see if there was anything obvious and saw that the parity error handling might be to blame, as that one can be configured to trigger an SERR assert that'd lead to the NMIs. Though even if that was the case we'd still need to investigate/fix the reason for the parity error in the first place. Also SERR may be asserted in other situations, like incompatible transactions (as in lower bit width busses that can't complete input addressing) or other errors. We might not account for things like that, but I haven't yet looked into our code to check.

by Duggan, 13 years ago

Attachment: nmi_syslog_2 added

Here's a much larger chunk of syslog that should cover the info you need... Thanks, mmlr!

by mmlr, 13 years ago

Change the order of error clearing and enabling error reporting.

comment:11 by mmlr, 13 years ago

Can you please try the attached patch and see if that gets things going? Basically it causes error conditions to be cleared first and only then enables error reporting on the PCI bridges. This may avoid reporting for still pending errors at configuration time that could lead to the NMIs.

comment:12 by Duggan, 13 years ago

Sure :) I'll do that a little later today. Is that a proper fix? I don't know alot about this stuff, but should those errors be handled? Anyway, I'll let you know how it works. Thanks!

comment:13 by Duggan, 13 years ago

Nope, that didn't do it :/

in reply to:  13 comment:14 by mmlr, 13 years ago

Replying to Duggan:

Nope, that didn't do it :/

Ah well, at least that means it's not just some lingering error condition that triggers as soon as error reporting is enabled (or that our error reset doesn't actually work). I guess some more debug output is needed to determine what's going on exactly.

In either case the error condition seems to come from the PCIe port that has the wired ethernet controller attached (PCI device 28 function 5 with secondary bus 8). From the data available alone it doesn't really look any different from the other PCIe ports...

That reminds me though that the error condition clearing only every uses kprintf to output the data to the kernel debugger. What you could do when the NMI kicks in is execute the "pcistatus" KDL command, which should then list any present error conditions before trying to clear them. If you could attach that output (you can simply continue after executing that command) as well, that might give some clue.

comment:15 by diver, 12 years ago

Summary: PANIC: Fatal Exception "NMI Interrupt" has occuredPANIC: Fatal Exception "NMI Interrupt" occured

by dsjonny, 11 years ago

Attachment: syslog_lenovo added

comment:16 by dsjonny, 11 years ago

I have the same problem on 2 Lenovo desktop PC at my workplace. I have attached the syslog.

KERN: PANIC: Fatal exception "NMI Interrupt" occurred! Error code: 0x0
KERN: 
KERN: Welcome to Kernel Debugging Land...
KERN: Thread 1 "idle thread 1" running on CPU 0
KERN: stack trace for thread 1 "idle thread 1"
KERN:     kernel stack: 0x81001000 to 0x81005000
KERN: frame               caller     <image>:function + offset
KERN:  0 81004be8 (+  32) 8013313e   <kernel_x86> arch_debug_stack_trace() + 0x12
KERN:  1 81004c08 (+  16) 80093353   <kernel_x86> stack_trace_trampoline__FPv() + 0x0b
KERN:  2 81004c18 (+  12) 80125bd2   <kernel_x86> arch_debug_call_with_fault_handler() + 0x1b
KERN:  3 81004c24 (+  48) 80094df6   <kernel_x86> debug_call_with_fault_handler() + 0x5e
KERN:  4 81004c54 (+  64) 80093573   <kernel_x86> kernel_debugger_loop__FPCcT0Pcl() + 0x21b
KERN:  5 81004c94 (+  48) 800938d7   <kernel_x86> kernel_debugger_internal__FPCcT0Pcl() + 0x53
KERN:  6 81004cc4 (+  48) 80095182   <kernel_x86> panic() + 0x36
KERN:  7 81004cf4 (+  64) 801344c8   <kernel_x86> x86_fatal_exception() + 0x30
KERN:  8 81004d34 (+  12) 80128680   <kernel_x86> int_bottom() + 0x30
KERN: kernel iframe at 0x81004d40 (end = 0x81004d90)
KERN:  eax 0x817e86f0    ebx 0x816ec434     ecx 0xf0        edx 0xb2
KERN:  esi 0xb2          edi 0x8            ebp 0x81004d90  esp 0x81004d74
KERN:  eip 0x817e864a eflags 0x210046  
KERN:  vector: 0x2, error code: 0x0
KERN:  9 81004d40 (+  80) 817e864a   <pci> pci_write_io_8() + 0x0a
KERN: 10 81004d90 (+  48) 816bf252   <acpi> AcpiOsWritePort() + 0x4a
KERN: 11 81004dc0 (+  48) 816c4f31   <acpi> AcpiHwWritePort() + 0x3d
KERN: 12 81004df0 (+  48) 816c44e0   <acpi> AcpiHwSetMode() + 0x8c
KERN: 13 81004e20 (+  48) 816c30b0   <acpi> AcpiEnable() + 0x30
KERN: 14 81004e50 (+  32) 816cebe2   <acpi> AcpiEnableSubsystem() + 0x22
KERN: 15 81004e70 (+  96) 816bf8e8   <acpi> acpi_std_ops__Fle() + 0x278
KERN: 16 81004ed0 (+  64) 8005fc3b   <kernel_x86> get_module() + 0x15b
KERN: 17 81004f10 (+  80) 80139649   <kernel_x86> ioapic_init__FP11kernel_args() + 0x71
KERN: 18 81004f60 (+  64) 80134a6c   <kernel_x86> arch_int_init_io() + 0x1c
KERN: 19 81004fa0 (+  32) 80059a96   <kernel_x86> int_init_io() + 0x12
KERN: 20 81004fc0 (+  48) 8005c239   <kernel_x86> _start() + 0x279

Using hrev46023 nightly anyboot.

Last edited 11 years ago by dsjonny (previous) (diff)

comment:17 by ptaylor1982, 10 years ago

KERN: Welcome to Kernel Debugging Land... KERN: Thread 3518 "kernel_debugger" running on CPU 0 KERN: kdebug> pcistatusdomain 0, bus 0, dev 0, func 0, PCI device status 0x3090 KERN: Received Master-Abort KERN: Received Target-Abort KERN: domain 0, bus 0, dev 2, func 0, PCI device status 0x0090 KERN: domain 0, bus 0, dev 2, func 1, PCI device status 0x0090 KERN: domain 0, bus 0, dev 27, func 0, PCI device status 0x0018 KERN: domain 0, bus 0, dev 28, func 0, PCI device status 0x0010 KERN: domain 0, bus 0, dev 28, func 0, PCI bridge secondary status 0x2000 KERN: Received Master-Abort KERN: domain 0, bus 0, dev 28, func 0, PCI bridge control 0x0003 KERN: domain 0, bus 0, dev 28, func 1, PCI device status 0x0010 KERN: domain 0, bus 0, dev 28, func 1, PCI bridge secondary status 0x0000 KERN: domain 0, bus 0, dev 28, func 1, PCI bridge control 0x0003 KERN: domain 0, bus 12, dev 0, func 0, PCI device status 0x0018 KERN: domain 0, bus 0, dev 29, func 0, PCI device status 0x0280 KERN: domain 0, bus 0, dev 29, func 1, PCI device status 0x0280 KERN: domain 0, bus 0, dev 29, func 2, PCI device status 0x0280 KERN: domain 0, bus 0, dev 29, func 3, PCI device status 0x0280 KERN: domain 0, bus 0, dev 29, func 7, PCI device status 0x0290 KERN: domain 0, bus 0, dev 30, func 0, PCI device status 0x4810 KERN: Signalled System Error KERN: Signalled Target-Abort KERN: domain 0, bus 0, dev 30, func 0, PCI bridge secondary status 0xb380 KERN: Detected Parity Error KERN: Received Master-Abort KERN: Received Target-Abort KERN: Data Parity Reported KERN: domain 0, bus 0, dev 30, func 0, PCI bridge control 0x0823 KERN: domain 0, bus 2, dev 0, func 0, PCI device status 0x0810 KERN: Signalled Target-Abort KERN: domain 0, bus 2, dev 1, func 0, PCI device status 0x0410 KERN: domain 0, bus 2, dev 1, func 4, PCI device status 0x0218 KERN: domain 0, bus 0, dev 31, func 0, PCI device status 0x0210 KERN: domain 0, bus 0, dev 31, func 2, PCI device status 0x02b8 KERN: domain 0, bus 0, dev 31, func 3, PCI device status 0x0280 KERN: kdebug> es[iprowifi3945] (wpi) fatal firmware error

the following machine does the same nmi interrupt error on boot and goes into kdl.

Version 0, edited 10 years ago by ptaylor1982 (next)

comment:18 by luroh, 10 years ago

This seems to happen on my IBM T42, hrev47815. This laptop could be shipped if someone is curious to have a look at it.

by luroh, 10 years ago

Attachment: kdl.jpg added

T42 KDL hrev47815

comment:19 by waddlesplash, 6 years ago

Keywords: boot-failure added; NMI removed

comment:20 by waddlesplash, 6 years ago

Blocking: 8741 added

comment:21 by waddlesplash, 6 years ago

Still an issue?

comment:22 by waddlesplash, 6 years ago

Blocking: 11684 added

comment:23 by diver, 3 years ago

Blocking: 17271 added

comment:24 by jens, 3 years ago

Yes, still an issue, either a regression or another instance on different hardware.

This is a fresh install from haiku-r1beta3-x86_64-anyboot.iso , and the machine is a Lenovo ThinkCentre 7360 whp. I can provide logs and other information as required.

Typing "co" or disabling ACPI in the boot options allows the boot to complete.

comment:25 by jens, 19 months ago

Update, I booted R1/beta4 from a USB drive and got the same exception.

comment:26 by pulkomandy, 19 months ago

yes, a syslog would be helpful to see where this is happening. You can find it in /var/log/syslog once the system is booted

by jens, 19 months ago

Attachment: syslog.txt added

comment:27 by jens, 19 months ago

Added a new syslog

comment:28 by waddlesplash, 17 months ago

Blocking: 18506 added

comment:29 by pulkomandy, 2 months ago

Several of these syslogs seem to be during ACPI tables access.

In the last syslog attached, it is in AcpiHwSetMode, which does a PCI IO on an address computed from the FADT table. So, if the table is invalid or mis-parsed, it could end up accessing an invalid address.

I'm not sure if we can sanitize this by checking if the IO access is in a valid range at the PCI level? And ignore it (with a syslog message) instead of doing a panic?

Note: See TracTickets for help on using tickets.