Opened 11 years ago

Closed 5 years ago

#2243 closed bug (fixed)

Firewire driver provokes PCI parity error and KDL upon boot (before Tracker loads)

Reported by: koki Owned by: modeenf
Priority: high Milestone: R1
Component: Drivers/FireWire Version: R1/pre-alpha1
Keywords: Cc: marcusoverhagen, anevilyak
Blocked By: Blocking:
Has a Patch: no Platform: All

Description

Haiku hrev25566 on an HP Pavilion zv5400us laptop.

Haiku KDLs upon boot, after the Deskbar & Terminal are loaded, and before Tracker is run. 100% reproducible.

Haiku will finish to load Tracker and run OK after entering "Continue" command in KDL.

pcistatus/backtraces/listdev outputs and syslog attached.

Attachments (7)

pcistatus.jpg (319.9 KB) - added by koki 11 years ago.
pcistatus output
kdl.jpg (25.8 KB) - added by koki 11 years ago.
KDL initial screen
bt.jpg (194.9 KB) - added by koki 11 years ago.
Backtrace
listdev.txt (3.0 KB) - added by koki 11 years ago.
listdev output
syslog.txt (107.9 KB) - added by koki 11 years ago.
Syslog
broken_hardware_patch (941 bytes) - added by absabs 11 years ago.
patch2 (2.2 KB) - added by absabs 11 years ago.

Download all attachments as: .zip

Change History (34)

Changed 11 years ago by koki

Attachment: pcistatus.jpg added

pcistatus output

Changed 11 years ago by koki

Attachment: kdl.jpg added

KDL initial screen

Changed 11 years ago by koki

Attachment: bt.jpg added

Backtrace

Changed 11 years ago by koki

Attachment: listdev.txt added

listdev output

Changed 11 years ago by koki

Attachment: syslog.txt added

Syslog

comment:1 Changed 11 years ago by koki

Cc: marcusoverhagen added

Forgot to mention: Haiku does not KDL if fw_raw driver is removed. FWIW.

comment:2 Changed 11 years ago by mmlr

As the firewire driver seems to cause instability for GCC4 builds too, I'd vote for removing it from the image until the issues have been sorted out.

comment:3 Changed 11 years ago by axeld

Priority: normalhigh

+1! I have a machine that also only boots okay when I remove the firewire driver. I've removed it in hrev25572 for now.

comment:4 Changed 11 years ago by marcusoverhagen

The firewire controller seems to generate a parity error, when DMA is enabled in fwohci_rx_enable().

comment:5 in reply to:  description ; Changed 11 years ago by absabs

First, there's no problem with my 1394 card on my box.

This problem may be due to the PCI sub system changeset 25550 two days ago. Now PCI bus enable Parity Error and SERR by default. Some broken PCI-1394 card does not clear all the memory on-chip during boot(hardware reset?), then PCI bus parity errors =>NMI interrupt. The broken_hardware_patch may fix the bug, if so please close this ticket. Would someone help test it?

ps:can we just enable bus master bit by default?

Regards, JiSheng

Changed 11 years ago by absabs

Attachment: broken_hardware_patch added

comment:6 in reply to:  5 Changed 11 years ago by absabs

once a bug, please report it, so I can dig into it and fix.

PS: thanks mmu_man for reminding me which maybe caused the bug

comment:7 Changed 11 years ago by marcusoverhagen

Some more info.

With hrev25550 I enabled the PCI-PCI bridge reporting of parity errors on it's secondary side. That means, when a PCI device attached to the bridge generates a parity error, the bridge will report it as SERR, usually generating an NMI.

PCI bridge configuration happens here: 97 KERN: PCI: dom 0, bus 0, dev 10, func 0, changed PCI bridge control from 0x0200 to 0x0823 98 KERN: PCI: dom 0, bus 0, dev 11, func 0, changed PCI bridge control from 0x000f to 0x082f

The bridge is: 243 KERN: PCI: [dom 0, bus 0] bus 0, device 10, function 0: vendor 10de, device 00dd, revision a2

and the secondary bus is number 2 250 KERN: PCI: primary_bus 00, secondary_bus 02, subordinate_bus 02, secondary_latency 80

Where the firewire controller is located: 262 KERN: PCI: [dom 0, bus 2] bus 2, device 0, function 0: vendor 104c, device 8026, revision 00 264 KERN: PCI: vendor 104c: Texas Instruments 265 KERN: PCI: device 8026: TSB43AB21 IEEE-1394a-2000 Controller (PHY/Link)

While it might be possible to not enable parity error reporting at all, or to disable it for a blacklist of broken devices, I'm not sure if it isn't the firewire driver that is guilty here. Masquerading of errors usually only leads to undetected data corruption.

comment:10 Changed 11 years ago by anevilyak

Cc: anevilyak added

comment:11 in reply to:  9 Changed 11 years ago by absabs

yep. There's a same bug in FreeBSD's stack. Perhaps FreeBSD now disable parity error(?I'm not sure), so the line is removed

comment:12 Changed 11 years ago by axeld

Just for the record, while I didn't have a KDL, when the FW driver is installed, the system hangs completely. I haven't yet tested again with the parity check enabled.

comment:13 Changed 11 years ago by marcusoverhagen

Summary: KDL upon boot (before Tracker loads)Firewire driver provokes PCI parity error and KDL upon boot (before Tracker loads)

Regarding this issue in general, why is the firewire device transmitting that data using DMA to the system RAM?

As I understand it, it is still with bad parity, because it's RAM has never been written to before (assuming the above idea is correct). That seems to happen when receiving is enabled in fwohci_rx_enable().

comment:14 Changed 11 years ago by absabs

because once bus reset, the all sid packets(include itself) will be received. DMA is used to transmit these packets.

IMO, it is still the parity problem. Because the firewire stack is ok before on koki's box. I need to find a pc with the same problem and test it, for there's no problem on my box. Any suggestions?

axeld, what's the serial debug information when the system hangs?

comment:15 Changed 11 years ago by axeld

IIRC it didn't dump anything helpful, and I couldn't even enter KDL. If you have any idea on how I can dig into this more, let me know.

Is it possible to gracefully handle a parity problem by turning the check off and dump a warning to syslog? If that is not possible, I think the only solution would be to turn parity checking off by default, and make it available via a config setting only (that defaults to off).

comment:16 Changed 11 years ago by koki

Because the firewire stack is ok before on koki's box.

FWIW, I actually don't know if the FW stack was OK before, as I never used it for anything. What I can say is that, if there was a problem with FW before hrev25566, it did not manifest itself the way it does now.

comment:17 in reply to:  16 ; Changed 11 years ago by absabs

it means that the firewire stack is initialized OK.

FWIW, I actually don't know if the FW stack was OK before, as I never >used it for anything. What I can say is that, if there was a problem with >FW before hrev25566, it did not manifest itself the way it does now.

Is it possible to gracefully handle a parity problem by turning the >check off and dump a warning to syslog? If that is not possible, I think >the only solution would be to turn parity checking off by default, and >make it available via a config setting only (that defaults to off).

the broken_hardware_patch should turn off pci-1394 card parity check. But result is the there is a parity error with pci-bridge.

would someone help test this new patch

Changed 11 years ago by absabs

Attachment: patch2 added

comment:18 in reply to:  17 Changed 11 years ago by absabs

would someone help test this new patch

I mean only the patch2. a few lines fixed and enable postedWriteEnable bit of HCControl register.

thank DeakYak explain the "posted" mean;)

comment:19 Changed 11 years ago by koki

I tried a build with the patched fw_raw (thanks Deadyak!), and it still KDLs.

comment:20 in reply to:  19 Changed 11 years ago by absabs

Replying to koki:

I tried a build with the patched fw_raw (thanks Deadyak!), and it still KDLs.

hmmm. Then moudule patched is firewire,not fw_raw. replaced wrong file?

comment:21 Changed 11 years ago by anevilyak

The correct file was patched, I sent him a new image with your patch applied, he didn't just replace one file.

comment:22 in reply to:  15 Changed 11 years ago by absabs

I examined about 10 box, found a linux pc with a similar problem. But the Linux kernel's NMI default behavior is emit a message and continue, so there is no problem usually. I sent a patch to lkml:http://lkml.org/lkml/2008/5/21/163. But After a lot of test today, I found that the problem is the same even the patch applied.

And about FreeBSD, the NMI ISR is just panic, but their PCI only enable master bit by default(no sure). So there is also no problem usually.

IMHO, Could we just only enable pci master or emit a message when NMI interrupt happened?

comment:23 Changed 11 years ago by absabs

hmmm, there is no such problem with the 1394card on my box. So I think it is the Linux pc's pci slot problem. I can't install other os on that box because the box is very important. I need to find another box

comment:24 Changed 10 years ago by koki

I submitted this bug report, but unfortunately the laptop that was showing the problem has died and I don't have it anymore. So, unfortunately, I will not be able to provide any feedback. Sorry.

comment:25 Changed 7 years ago by modeenf

This patch has been added to trunk.. long ago.

So no one can test this?

comment:26 Changed 7 years ago by modeenf

Owner: changed from absabs to modeenf
Status: newassigned

comment:27 Changed 5 years ago by tqh

Resolution: fixed
Status: assignedclosed

Not tested, but it seems it was fixed. Two or more years ago.

Note: See TracTickets for help on using tickets.