Opened 6 years ago

Closed 5 years ago

Last modified 4 years ago

#14506 closed bug (not reproducible)

XHCI: faulty event ring handling?

Reported by: smallstepforman Owned by: waddlesplash
Priority: normal Milestone: R1/beta2
Component: Drivers/USB/XHCI Version: R1/Development
Keywords: XHCI KDL Cc:
Blocked By: Blocking:
Platform: All

Description

I've finally found a method to 100% reproduce the XHCI KDL on my box (before it was hit / miss, with the typical symptoms - first the mouse would die, or eventually the keyboard would constantly repeat, eventually USB hard disk would die). Now I can finally reproduce the issue 100% of the time, and have a KDL screenshot with XHCI errors. Maybe investigating this issue will finally show the XHCI timing bug effecting everyone.

Steps to reproduce:

int main(int argc, char * * argv) {

during setup of my video editor, after setting up opengl/windows/media kit

...

simple assert will KDL

assert(0);

}

On my box, 100% KDL with XHCI errors.

KDL trace (see screenshot): summary in text format below:

<kernel_x86_64> memcpy + 0x51

<xhci> XHCI::ReadDescriptorChain(xhci_td*, iovec*, unsigned long) + 0xa2

<xhci> XHCI::FinishTransfers() + 0x1d1

<xhci> XHCI::FinishThread(void *) + 0x09

<kernel_x86_64> common_thread_entry(void *) + 0x37

Attachments (3)

xhci.jpg (1.7 MB ) - added by smallstepforman 6 years ago.
Screenshot XHCI KDL
syslog (145.8 KB ) - added by smallstepforman 6 years ago.
syslog (after 52357)
IMG-6112.JPG (2.6 MB ) - added by smallstepforman 6 years ago.
Similar KDL, 2 byte difference in mempy

Change History (13)

by smallstepforman, 6 years ago

Attachment: xhci.jpg added

Screenshot XHCI KDL

comment:1 by smallstepforman, 6 years ago

Looking at XHCI::FinishTransfers(), can we be dealing with a threading issue, where transfer->Vector() is modified by another thread?. This will invalidate transfer->VectorCount().

Also, we never check the return value of Transfer::PrepareKernelAccess(), asking for trouble.

comment:2 by smallstepforman, 6 years ago

Seems to have been fixed in hrev52357. The assert(0) will no longer trigger the KDL.

comment:3 by mmlr, 6 years ago

That would suggest that there is an interrupt related issue that leads the XHCI driver to do something (finish a descriptor chain early?) which is then repeated. The fact that the IO-APIC is now probably used apparently masks the issue. It should still be fixed though as it may come up in a different incarnation. Can you please provide a syslog with the new revision and try to reproduce the issue with IO-APICs disabled from the boot menu?

by smallstepforman, 6 years ago

Attachment: syslog added

syslog (after 52357)

comment:4 by smallstepforman, 6 years ago

Sadly, the XHCI issue still exists (losing mouse/keyboard) even without disabling IO-APIC. Michael, you were right, it just masked the issue from my initial reproducable scenario. I no longer get KDL after the assert(0) however, so there is a small benefit :)

See attached syslog (above). Lots of USB resets. I use external USB2 hard disk to boot Haiku on 2014 MacBookPro (11.3)

comment:5 by smallstepforman, 6 years ago

Another KD almost identical KDL stack trace, except this time memcpy() is 2 bytes away from previous crash. See attachment (uploading)

by smallstepforman, 6 years ago

Attachment: IMG-6112.JPG added

Similar KDL, 2 byte difference in mempy

comment:6 by pulkomandy, 6 years ago

Milestone: UnscheduledR1/beta2

comment:7 by waddlesplash, 5 years ago

USB resets were fixed in master (but not in beta1.) Random disconnects (pipe stalls) may now be fixed as of db360a20648 & hrev52890 by various reports, so please upgrade and test.

comment:8 by waddlesplash, 5 years ago

Owner: changed from nobody to waddlesplash
Status: newassigned
Summary: XHCI KDL - possible root causeXHCI: faulty event ring handling?

There were some locking issues in HandleTransfersComplete that I've fixed in hrev52931, which may have been the cause of this.

However, I haven't yet audited our Event Ring handling code, which may be the source of these issues. So I'll leave this ticket open until I do.

comment:9 by waddlesplash, 5 years ago

Resolution: not reproducible
Status: assignedclosed

The Event Ring code looks perfectly fine. So, closing this as not reproducible; we can make a new ticket if it reappears.

comment:10 by nielx, 4 years ago

Remove milestone for tickets with status = closed and resolution != fixed

Note: See TracTickets for help on using tickets.