Opened 11 years ago

Closed 5 years ago

#3113 closed bug (invalid)

Fatal Exception "NMI Interrupt" on kernel boot

Reported by: graham Owned by: axeld
Priority: normal Milestone: R1
Component: System/Kernel Version: R1/pre-alpha1
Keywords: Cc: marcusoverhagen
Blocked By: Blocking:
Has a Patch: no Platform: x86

Description

Recently I tried to boot Haiku off of a usb hard drive using my laptop.

Upon loading the kernel the spash screen remains grey (no color in icons) for 5 seconds (even the virtual machine boots faster) where upon it jumps to the 3rd icon and the kernel debugger starts.

As the kernel isn't fully loaded yet I cannot save the output to a file so I had to manually write down the text and reproduce the output.

As it looks, the error seams to be with the PCI module. I tried to use this bug as an oppertunity to learn the internals of the kernel and have learned a lot in the past week about programming a kernel than I have done reading about it for a year in books.

Unfortunatly I got lost at the device tree code (where are the preloaded modules listed?) and since I don't have experiance in programming hardware beyond basic assembly code I suspect I jumped in at the deep end. So I will leave this to you.

I (hopefully) have uploaded a typed version of the kenel debugger output and a basic profile of my hardware including the controllers and devices on the PCI bus.

This was done using a windows program but if you require more information I can boot Arch Linux off of another usb drive and give you another profile.

Attachments (6)

hdebug.txt (2.1 KB ) - added by graham 11 years ago.
hardware.html (95.8 KB ) - added by graham 11 years ago.
syslog_pcistatus.zip (62.9 KB ) - added by graham 11 years ago.
pcistatus.txt (2.3 KB ) - added by graham 11 years ago.
syslog.txt (103.2 KB ) - added by graham 11 years ago.
pcistatus boot.txt (2.4 KB ) - added by graham 11 years ago.

Download all attachments as: .zip

Change History (30)

by graham, 11 years ago

Attachment: hdebug.txt added

by graham, 11 years ago

Attachment: hardware.html added

comment:1 by graham, 11 years ago

Forgot to mention that the image I used was: haiku-pre-alpha-hrev28555-raw.zip

This bug isn't version specific however.

comment:2 by diver, 11 years ago

Back trace looks like in #2680

comment:3 by graham, 11 years ago

diver: Yes it does look similar after the driver is removed from his build. I can't really tell if it's related though.

I noticed someone else had used the 'pcistatus' command but it hasn't been registered at the point where I get dropped into the kernel debugger.

comment:4 by marcusoverhagen, 11 years ago

Cc: marcusoverhagen added

comment:5 by marcusoverhagen, 11 years ago

Hint: Parity error reporting gets enabled at PCI::_ConfigureBridges() in pci.cpp

in reply to:  5 comment:6 by graham, 11 years ago

Replying to marcusoverhagen:

Hint: Parity error reporting gets enabled at PCI::_ConfigureBridges() in pci.cpp

I've learned a bit more now and found where the modules get loaded.

I looked over my attached list of PCI devices and all of the PCI to PCI bridges are configured judging by the syslog.

So the error is probably in PCI::ClearDeviceStatus()

I wonder if it's worthwhile recompiling it with dumpstatus = true set?

comment:7 by graham, 11 years ago

After a fiasco with updating Arch linux I finnaly managed to build an image in ubuntu with the dumpstatus=true.

No difference in the syslog.

I'm going to try it with some dprintf's to trace how far it gets.

comment:8 by graham, 11 years ago

Sorry for the delay.

As far as I can make out the error occurs just after it outputs the configuration of the last PCI to PCI bridge. It doesn't get as far as the part where it children of the bus are checked.

Any advice on how I should proceed next?

comment:9 by graham, 11 years ago

Looked up what that bridge actually is and it is the South bridge.

The plot thickens!

comment:10 by graham, 11 years ago

If I'm not mistaken the status of the PCI to PCI bridges looks rather suspicious. Take a look at the attached PDF and note the red values.

Is this the correct behaviour?

comment:11 by graham, 11 years ago

Looks like the file uploader is broken. =(

PDF is now uploaded here:

http://binxtrone.110mb.com/tmp/PCI%20Bridge%20Control%20Status.pdf

in reply to:  5 comment:12 by graham, 11 years ago

Replying to marcusoverhagen:

Hint: Parity error reporting gets enabled at PCI::_ConfigureBridges() in pci.cpp

Hi. I managed to get Haiku to boot perfectly by disabling the SERR# Enable bit in the control register.

Sure enough checking the PCI status after boot yields:

domain 0, bus 0, dev 30, func 0, PCI bridge secondary status 0x4280

Recieved System Error

domain 0, bus 0, dev 30, func 0, PCI bridge control 0x0825

I haven't a clue why it's signalling a system error. =(

You can download the specification for the ICH8 here:

http://www.intel.com/design/chipsets/datashts/313056.htm

According to the device list I origionally attached I have the Base Mobile Version.

Do check out the strange configuration behaviour shown in the PDF I uploaded yesterday though.

comment:13 by marcusoverhagen, 11 years ago

Please attach a complete syslog, I'm interested in the PCI configuration. Please also attach the complete PCI status (you have to enter KDL once for that)

Especially what is the secondary bus number of bridge on bus 0, dev 30, func 0, and what devices are on that bus. Has any of the devices a pending error status?

comment:14 by graham, 11 years ago

Replying to marcusoverhagen:

Please attach a complete syslog, I'm interested in the PCI configuration. Please also attach the complete PCI status (you have to enter KDL once for that)

Especially what is the secondary bus number of bridge on bus 0, dev 30, func 0, and what devices are on that bus. Has any of the devices a pending error status?


Whew! I had to take a picture of the pcistatus output and type it but I managed to extract the syslog files using the skyfs/befs viewer. This is the output with the SERR# Enable bit disabled. Files are zipped because the syslog file is bigger than the maximum upload limit.

It's probably also worth mentioning that the mouse seems to lock up after a minute or so after the boot completes. This didn't happen when I modified the code to skip bus 0, dev 30.

comment:15 by graham, 11 years ago

Ignore the last character of "KERN: PCI: dom 0, bus 0, dev 28, func 5, changed PCI bridge control from 0x0004 to 0x0007_"

That was a conversion error when I changed the line breaks. =)

by graham, 11 years ago

Attachment: syslog_pcistatus.zip added

comment:16 by graham, 11 years ago

Never mind. I noticed a mistake in the other file so I just reuploaded the corrected version.

comment:17 by graham, 11 years ago

No wonder the syslog is so huge. It has the output of the last couple of boots I did.

Ignore the zip file. I'll post the smaller versions. Grrr.

by graham, 11 years ago

Attachment: pcistatus.txt added

by graham, 11 years ago

Attachment: syslog.txt added

comment:18 by graham, 11 years ago

Ok I re-copied my image onto the usb disk and my mouse problem and lockups disappeared so ignore what I said about those.

Again this is the output with the SERR# Enable bit disabled.

by graham, 11 years ago

Attachment: pcistatus boot.txt added

in reply to:  14 comment:19 by graham, 11 years ago

Replying to marcusoverhagen:

Please attach a complete syslog, I'm interested in the PCI configuration. Please also attach the complete PCI status (you have to enter KDL once for that)

Especially what is the secondary bus number of bridge on bus 0, dev 30, func 0, and what devices are on that bus. Has any of the devices a pending error status?


I changed the ClearDeviceStatus() code to dprintf's so that it outputed the status during the PCI module initialization and and extracted this from the resulting syslog, since the status is cleared during initalization.

(bus 7, dev 9) looks the same as the after-boot status but the other bridges have devices that signalled system errors. How come they don't generate a NMI since they still have the SERR# Enable bit set?

comment:20 by graham, 11 years ago

Tried setting the SERR# Response bit to 0 in the command register of devices on bus 7 but no effect.

comment:21 by graham, 11 years ago

By that I mean the functions of device 9.

comment:22 by scottmc, 8 years ago

Can you recheck this with a recent Haiku build? It may have been fixed recently.

comment:23 by graham, 8 years ago

The GPU of that machine no longer works thanks to poor quality control and poor customer support.

I have a machine with that same bridge in the same position that might be of some use so I'll check it with haiku-pre-alpha-hrev28555-raw.zip to see if I encounter the original bug.

If I get that error I'll test it with a more recent build.

comment:24 by tqh, 5 years ago

Resolution: invalid
Status: newclosed

Closing this as it is six years old, with no activity in three years.

Note: See TracTickets for help on using tickets.