Opened 8 months ago

Closed 3 months ago

#14506 closed bug (not reproducible)

XHCI: faulty event ring handling?

Reported by: smallstepforman Owned by: waddlesplash
Priority: normal Milestone: R1/beta2
Component: Drivers/USB/XHCI Version: R1/Development
Keywords: XHCI KDL Cc:
Blocked By: Blocking:
Has a Patch: no Platform: All

Description

I've finally found a method to 100% reproduce the XHCI KDL on my box (before it was hit / miss, with the typical symptoms - first the mouse would die, or eventually the keyboard would constantly repeat, eventually USB hard disk would die). Now I can finally reproduce the issue 100% of the time, and have a KDL screenshot with XHCI errors. Maybe investigating this issue will finally show the XHCI timing bug effecting everyone.

Steps to reproduce:

int main(int argc, char * * argv) {

during setup of my video editor, after setting up opengl/windows/media kit

...

simple assert will KDL

assert(0);

}

On my box, 100% KDL with XHCI errors.

KDL trace (see screenshot): summary in text format below:

<kernel_x86_64> memcpy + 0x51

<xhci> XHCI::ReadDescriptorChain(xhci_td*, iovec*, unsigned long) + 0xa2

<xhci> XHCI::FinishTransfers() + 0x1d1

<xhci> XHCI::FinishThread(void *) + 0x09

<kernel_x86_64> common_thread_entry(void *) + 0x37

Attachments (3)

xhci.jpg (1.7 MB) - added by smallstepforman 8 months ago.
Screenshot XHCI KDL
syslog (145.8 KB) - added by smallstepforman 8 months ago.
syslog (after 52357)
IMG-6112.JPG (2.6 MB) - added by smallstepforman 7 months ago.
Similar KDL, 2 byte difference in mempy

Change History (12)

Changed 8 months ago by smallstepforman

Attachment: xhci.jpg added

Screenshot XHCI KDL

comment:1 Changed 8 months ago by smallstepforman

Looking at XHCI::FinishTransfers(), can we be dealing with a threading issue, where transfer->Vector() is modified by another thread?. This will invalidate transfer->VectorCount().

Also, we never check the return value of Transfer::PrepareKernelAccess(), asking for trouble.

comment:2 Changed 8 months ago by smallstepforman

Seems to have been fixed in hrev52357. The assert(0) will no longer trigger the KDL.

comment:3 Changed 8 months ago by mmlr

That would suggest that there is an interrupt related issue that leads the XHCI driver to do something (finish a descriptor chain early?) which is then repeated. The fact that the IO-APIC is now probably used apparently masks the issue. It should still be fixed though as it may come up in a different incarnation. Can you please provide a syslog with the new revision and try to reproduce the issue with IO-APICs disabled from the boot menu?

Changed 8 months ago by smallstepforman

Attachment: syslog added

syslog (after 52357)

comment:4 Changed 8 months ago by smallstepforman

Sadly, the XHCI issue still exists (losing mouse/keyboard) even without disabling IO-APIC. Michael, you were right, it just masked the issue from my initial reproducable scenario. I no longer get KDL after the assert(0) however, so there is a small benefit :)

See attached syslog (above). Lots of USB resets. I use external USB2 hard disk to boot Haiku on 2014 MacBookPro (11.3)

comment:5 Changed 7 months ago by smallstepforman

Another KD almost identical KDL stack trace, except this time memcpy() is 2 bytes away from previous crash. See attachment (uploading)

Changed 7 months ago by smallstepforman

Attachment: IMG-6112.JPG added

Similar KDL, 2 byte difference in mempy

comment:6 Changed 7 months ago by pulkomandy

Milestone: UnscheduledR1/beta2

comment:7 Changed 3 months ago by waddlesplash

USB resets were fixed in master (but not in beta1.) Random disconnects (pipe stalls) may now be fixed as of db360a20648 & hrev52890 by various reports, so please upgrade and test.

comment:8 Changed 3 months ago by waddlesplash

Owner: changed from nobody to waddlesplash
Status: newassigned
Summary: XHCI KDL - possible root causeXHCI: faulty event ring handling?

There were some locking issues in HandleTransfersComplete that I've fixed in hrev52931, which may have been the cause of this.

However, I haven't yet audited our Event Ring handling code, which may be the source of these issues. So I'll leave this ticket open until I do.

comment:9 Changed 3 months ago by waddlesplash

Resolution: not reproducible
Status: assignedclosed

The Event Ring code looks perfectly fine. So, closing this as not reproducible; we can make a new ticket if it reappears.

Note: See TracTickets for help on using tickets.