Opened 11 years ago

Closed 5 years ago

#2243 closed bug (fixed)

Firewire driver provokes PCI parity error and KDL upon boot (before Tracker loads)

Reported by: koki Owned by: modeenf
Priority: high Milestone: R1
Component: Drivers/FireWire Version: R1/pre-alpha1
Keywords: Cc: marcusoverhagen, anevilyak
Blocked By: Blocking:
Has a Patch: no Platform: All

Description

Haiku hrev25566 on an HP Pavilion zv5400us laptop.

Haiku KDLs upon boot, after the Deskbar & Terminal are loaded, and before Tracker is run. 100% reproducible.

Haiku will finish to load Tracker and run OK after entering "Continue" command in KDL.

pcistatus/backtraces/listdev outputs and syslog attached.

Attachments (7)

pcistatus.jpg (319.9 KB ) - added by koki 11 years ago.
pcistatus output
kdl.jpg (25.8 KB ) - added by koki 11 years ago.
KDL initial screen
bt.jpg (194.9 KB ) - added by koki 11 years ago.
Backtrace
listdev.txt (3.0 KB ) - added by koki 11 years ago.
listdev output
syslog.txt (107.9 KB ) - added by koki 11 years ago.
Syslog
broken_hardware_patch (941 bytes ) - added by absabs 11 years ago.
patch2 (2.2 KB ) - added by absabs 11 years ago.

Download all attachments as: .zip

Change History (34)

by koki, 11 years ago

Attachment: pcistatus.jpg added

pcistatus output

by koki, 11 years ago

Attachment: kdl.jpg added

KDL initial screen

by koki, 11 years ago

Attachment: bt.jpg added

Backtrace

by koki, 11 years ago

Attachment: listdev.txt added

listdev output

by koki, 11 years ago

Attachment: syslog.txt added

Syslog

comment:1 by koki, 11 years ago

Cc: marcusoverhagen added

Forgot to mention: Haiku does not KDL if fw_raw driver is removed. FWIW.

comment:2 by mmlr, 11 years ago

As the firewire driver seems to cause instability for GCC4 builds too, I'd vote for removing it from the image until the issues have been sorted out.

comment:3 by axeld, 11 years ago

Priority: normalhigh

+1! I have a machine that also only boots okay when I remove the firewire driver. I've removed it in hrev25572 for now.

comment:4 by marcusoverhagen, 11 years ago

The firewire controller seems to generate a parity error, when DMA is enabled in fwohci_rx_enable().

in reply to:  description ; comment:5 by absabs, 11 years ago

First, there's no problem with my 1394 card on my box.

This problem may be due to the PCI sub system changeset 25550 two days ago. Now PCI bus enable Parity Error and SERR by default. Some broken PCI-1394 card does not clear all the memory on-chip during boot(hardware reset?), then PCI bus parity errors =>NMI interrupt. The broken_hardware_patch may fix the bug, if so please close this ticket. Would someone help test it?

ps:can we just enable bus master bit by default?

Regards, JiSheng

by absabs, 11 years ago

Attachment: broken_hardware_patch added

in reply to:  5 comment:6 by absabs, 11 years ago

once a bug, please report it, so I can dig into it and fix.

PS: thanks mmu_man for reminding me which maybe caused the bug

comment:7 by marcusoverhagen, 11 years ago

Some more info.

With hrev25550 I enabled the PCI-PCI bridge reporting of parity errors on it's secondary side. That means, when a PCI device attached to the bridge generates a parity error, the bridge will report it as SERR, usually generating an NMI.

PCI bridge configuration happens here: 97 KERN: PCI: dom 0, bus 0, dev 10, func 0, changed PCI bridge control from 0x0200 to 0x0823 98 KERN: PCI: dom 0, bus 0, dev 11, func 0, changed PCI bridge control from 0x000f to 0x082f

The bridge is: 243 KERN: PCI: [dom 0, bus 0] bus 0, device 10, function 0: vendor 10de, device 00dd, revision a2

and the secondary bus is number 2 250 KERN: PCI: primary_bus 00, secondary_bus 02, subordinate_bus 02, secondary_latency 80

Where the firewire controller is located: 262 KERN: PCI: [dom 0, bus 2] bus 2, device 0, function 0: vendor 104c, device 8026, revision 00 264 KERN: PCI: vendor 104c: Texas Instruments 265 KERN: PCI: device 8026: TSB43AB21 IEEE-1394a-2000 Controller (PHY/Link)

While it might be possible to not enable parity error reporting at all, or to disable it for a blacklist of broken devices, I'm not sure if it isn't the firewire driver that is guilty here. Masquerading of errors usually only leads to undetected data corruption.

comment:10 by anevilyak, 11 years ago

Cc: anevilyak added

in reply to:  9 comment:11 by absabs, 11 years ago

yep. There's a same bug in FreeBSD's stack. Perhaps FreeBSD now disable parity error(?I'm not sure), so the line is removed

comment:12 by axeld, 11 years ago

Just for the record, while I didn't have a KDL, when the FW driver is installed, the system hangs completely. I haven't yet tested again with the parity check enabled.

comment:13 by marcusoverhagen, 11 years ago

Summary: KDL upon boot (before Tracker loads)Firewire driver provokes PCI parity error and KDL upon boot (before Tracker loads)

Regarding this issue in general, why is the firewire device transmitting that data using DMA to the system RAM?

As I understand it, it is still with bad parity, because it's RAM has never been written to before (assuming the above idea is correct). That seems to happen when receiving is enabled in fwohci_rx_enable().

comment:14 by absabs, 11 years ago

because once bus reset, the all sid packets(include itself) will be received. DMA is used to transmit these packets.

IMO, it is still the parity problem. Because the firewire stack is ok before on koki's box. I need to find a pc with the same problem and test it, for there's no problem on my box. Any suggestions?

axeld, what's the serial debug information when the system hangs?

comment:15 by axeld, 11 years ago

IIRC it didn't dump anything helpful, and I couldn't even enter KDL. If you have any idea on how I can dig into this more, let me know.

Is it possible to gracefully handle a parity problem by turning the check off and dump a warning to syslog? If that is not possible, I think the only solution would be to turn parity checking off by default, and make it available via a config setting only (that defaults to off).

comment:16 by koki, 11 years ago

Because the firewire stack is ok before on koki's box.

FWIW, I actually don't know if the FW stack was OK before, as I never used it for anything. What I can say is that, if there was a problem with FW before hrev25566, it did not manifest itself the way it does now.

in reply to:  16 ; comment:17 by absabs, 11 years ago

it means that the firewire stack is initialized OK.

FWIW, I actually don't know if the FW stack was OK before, as I never >used it for anything. What I can say is that, if there was a problem with >FW before hrev25566, it did not manifest itself the way it does now.

Is it possible to gracefully handle a parity problem by turning the >check off and dump a warning to syslog? If that is not possible, I think >the only solution would be to turn parity checking off by default, and >make it available via a config setting only (that defaults to off).

the broken_hardware_patch should turn off pci-1394 card parity check. But result is the there is a parity error with pci-bridge.

would someone help test this new patch

by absabs, 11 years ago

Attachment: patch2 added

in reply to:  17 comment:18 by absabs, 11 years ago

would someone help test this new patch

I mean only the patch2. a few lines fixed and enable postedWriteEnable bit of HCControl register.

thank DeakYak explain the "posted" mean;)

comment:19 by koki, 11 years ago

I tried a build with the patched fw_raw (thanks Deadyak!), and it still KDLs.

in reply to:  19 comment:20 by absabs, 11 years ago

Replying to koki:

I tried a build with the patched fw_raw (thanks Deadyak!), and it still KDLs.

hmmm. Then moudule patched is firewire,not fw_raw. replaced wrong file?

comment:21 by anevilyak, 11 years ago

The correct file was patched, I sent him a new image with your patch applied, he didn't just replace one file.

in reply to:  15 comment:22 by absabs, 11 years ago

I examined about 10 box, found a linux pc with a similar problem. But the Linux kernel's NMI default behavior is emit a message and continue, so there is no problem usually. I sent a patch to lkml:http://lkml.org/lkml/2008/5/21/163. But After a lot of test today, I found that the problem is the same even the patch applied.

And about FreeBSD, the NMI ISR is just panic, but their PCI only enable master bit by default(no sure). So there is also no problem usually.

IMHO, Could we just only enable pci master or emit a message when NMI interrupt happened?

comment:23 by absabs, 11 years ago

hmmm, there is no such problem with the 1394card on my box. So I think it is the Linux pc's pci slot problem. I can't install other os on that box because the box is very important. I need to find another box

comment:24 by koki, 11 years ago

I submitted this bug report, but unfortunately the laptop that was showing the problem has died and I don't have it anymore. So, unfortunately, I will not be able to provide any feedback. Sorry.

comment:25 by modeenf, 7 years ago

This patch has been added to trunk.. long ago.

So no one can test this?

comment:26 by modeenf, 7 years ago

Owner: changed from absabs to modeenf
Status: newassigned

comment:27 by tqh, 5 years ago

Resolution: fixed
Status: assignedclosed

Not tested, but it seems it was fixed. Two or more years ago.

Note: See TracTickets for help on using tickets.