Opened 3 years ago

Closed 22 months ago

Last modified 22 months ago

#13343 closed bug (fixed)

QEMU XHCI Devices not bootable

Reported by: kallisti5 Owned by: nobody
Priority: high Milestone: R1/beta1
Component: Drivers/USB/XHCI Version: R1/Development
Keywords: Cc:
Blocked By: Blocking: #12895
Has a Patch: yes Platform: All

Description

Haiku isn't able to detect partitions on XHCI boot media.

This is a high priority bug because most users could easily try to boot the haiku installation media over USB 3.

This can be reproduced in qemu:

XHCI (failure)

  1. plug a USB 3 flash device into a usb 3 port on your machine running linux.
  2. write haiku-anyboot to a USB 3 flash device
  3. start up a new QEMU vm attached to the USB vendor/product:
sudo qemu-system-x86_64 -device nec-usb-xhci,id=xhci,addr=0x5 -device usb-host,bus=xhci.0,vendorid=0x13fe,productid=0x5500 -net none --enable-kvm -m 2048 -serial stdio

haiku_loader can't find any partitions.

(XHCI_BOOT attached)

EHCI (Success)

  1. plug your USB 3 flash device into a usb 2 port on your machine running linux.
  2. start up a new QEMU vm attached to the USB vendor/product with an ehci bus:
sudo qemu-system-x86_64 -device usb-ehci,id=ehci,addr=0x5 -device usb-host,bus=ehci.0,vendorid=0x13fe,productid=0x5500 -net none --enable-kvm -m 2048 -serial stdio

Boots as expected.

(EHCI_BOOT attached)

Attachments (8)

XHCI_BOOT.txt (29.9 KB ) - added by kallisti5 3 years ago.
Failed XHCI boot
EHCI_BOOT.txt (55.2 KB ) - added by kallisti5 3 years ago.
Successful EHCI boot
XHCI_BOOT_QEMUDBG.txt (4.0 MB ) - added by kallisti5 3 years ago.
Failed XHCI boot w/qemu xhci debug inline
XHCI_BOOT_HKUDBG_QEMUDBG.txt (4.0 MB ) - added by kallisti5 3 years ago.
Failed XHCI boot w/qemu xhci debug inline and Haiku usb tracing
meld.png (311.1 KB ) - added by kallisti5 2 years ago.
meld of two allocatedevice calls on same port. first works, second fails.
xhci-out.txt (46.5 KB ) - added by kallisti5 2 years ago.
haiku syslogs with XHCI debug
xhci-qemudbg.txt (765.7 KB ) - added by kallisti5 2 years ago.
qemu logs with qemu DEBUG_XHCI compiled in.
0001-xhci-Transfer-complete-semaphore-was-not-being-relea.patch (6.0 KB ) - added by GregCrain 22 months ago.
A patch for xhci driver, but not directly impacting QEMU.

Change History (37)

by kallisti5, 3 years ago

Attachment: XHCI_BOOT.txt added

Failed XHCI boot

by kallisti5, 3 years ago

Attachment: EHCI_BOOT.txt added

Successful EHCI boot

comment:1 by kallisti5, 3 years ago

#13062 could be related... however these logs show a different error.

comment:2 by korli, 3 years ago

Summary: XHCI Devices not bootableQEMU XHCI Devices not bootable

comment:3 by korli, 3 years ago

Well you need to provide the Haiku hrev, the QEMU and SeaBIOS versions... Support for XHCI USB controllers appeared in SeaBIOS 1.7.4...

comment:4 by kallisti5, 3 years ago

Keep in mind this issue closely matches the issue we see on most moden real hardware... thus this is a good low-hanging fruit to troubleshoot some of our xhci boot issues.

comment:5 by kallisti5, 3 years ago

Blocking: 12895 added

#12895 is definitely related. This shows how to replicate it in qemu.

by kallisti5, 3 years ago

Attachment: XHCI_BOOT_QEMUDBG.txt added

Failed XHCI boot w/qemu xhci debug inline

comment:6 by kallisti5, 3 years ago

I've attached a debug log of Haiku's syslog output and QEMU's XHCI debugging inline

sudo qemu-system-x86_64 -device nec-usb-xhci,id=xhci,addr=0x5 -device usb-host,bus=xhci.0,vendorid=0x13fe,productid=0x5500 -net none --enable-kvm -m 2048 -serial stdio -d trace:usb_xhci_* > ~/XHCI_BOOT_QEMUDBG.txt 2>&1

Should be *really* interesting to see what Haiku is actually doing to the XHCI controller :-)

Last edited 3 years ago by kallisti5 (previous) (diff)

comment:8 by korli, 3 years ago

Keep in mind that FreeBSD couldn't boot on QEMU via XHCI at least in 2015. So having it working might not be so blocking in the end.

by kallisti5, 3 years ago

Failed XHCI boot w/qemu xhci debug inline and Haiku usb tracing

comment:9 by diver, 3 years ago

Component: Drivers/USBDrivers/USB/XHCI
Owner: changed from mmlr to nobody

comment:10 by kallisti5, 2 years ago

After looking at this for a while, I think part of the issue is the SetAddress fails:

usb xhci -1: Set Address
usb xhci -1: command[4] = b (0x00000000066bb000, 0x00000000, 0x02002c00)
usb xhci -1: Ding Dong! slot:0 endpoint 0
2055@1487952628.192117:usb_xhci_doorbell_write off 0x0000, val 0x00000000
2055@1487952628.192123:usb_xhci_fetch_trb addr 00000000064fed80, CR_ADDRESS_DEVICE, p 00000000066bb000, s 00000000, c 0x02002c01
2055@1487952628.192131:usb_xhci_slot_address slotid 2, port 1
2055@1487952628.192136:usb_xhci_queue_event v 0, idx 7, ER_COMMAND_COMPLETE, CC_TRB_ERROR, p 00000000064fed80, s 05000000, c 0x02008401
2055@1487952628.192141:usb_xhci_irq_msi nr 0
2055@1487952628.192147:usb_xhci_fetch_trb addr 00000000064fed90, TRB_RESERVED, p 0000000000000000, s 00000000, c 0x00000000
2055@1487952628.192161:usb_xhci_oper_read off 0x0004, ret 0x00000008
2055@1487952628.192172:usb_xhci_runtime_read off 0x0020, ret 0x00000003
2055@1487952628.192183:usb_xhci_oper_write off 0x0004, val 0x00000008
2055@1487952628.192194:usb_xhci_runtime_write off 0x0020, val 0x00000003
usb xhci -1: Event Interrupt
usb xhci -1: event[7] = 33 (0x00000000064fed80 0x05000000 0x02008401)
usb xhci -1: event[7] = 33 (0x00000000064fed80 0x05000000 0x02008401)
usb xhci -1: Received command event
usb xhci -1: event[8] = 0 (0x0000000000000000 0x00000000 0x00000000)
2055@1487952628.196027:usb_xhci_runtime_write off 0x0038, val 0x064fe0c8
2055@1487952628.196037:usb_xhci_runtime_write off 0x003c, val 0x00000000
2055@1487952628.196066:usb_xhci_doorbell_read off 0x0000, ret 0x00000000
usb xhci -1: Command Complete
usb error xhci -1: unsuccessful command TRB (5)
usb xhci -1: Storing trb 0x05000000 0x02008401
usb error xhci -1: unable to set address
usb xhci roothub: request: 1
usb xhci roothub: clear feature: 1

SetAddress() only gets called once and fails with an CC_TRB_ERROR from the XHCI host controller.

comment:11 by kallisti5, 2 years ago

More interesting, AllocateDevice gets called twice on hubAddress 1, hubPort 1 speed 3. The first time is successful. Second time fails.

I diffed' the similar calls and took of screenshot of the differences in meld.

by kallisti5, 2 years ago

Attachment: meld.png added

meld of two allocatedevice calls on same port. first works, second fails.

comment:12 by kallisti5, 2 years ago

Milestone: UnscheduledR1/beta1

comment:13 by kallisti5, 2 years ago

Looking through the XHCI tickets. We commonly see the following errors per the XHCI spec:

  • 5 TRB Error
    • Asserted when a TRB parameter error condition (e.g., out of range or invalid parameter) is detected in a TRB. Refer to section 4.10.2.2 for examples

      4.10.2.2 TRB Error A TRB Error indicates the TRB field values are out of range or that the xHC has determined that a TRB is incorrectly formed. This error condition may be reported in a Transfer Event or a Command Completion Event due to an error detected on a Transfer or Command TRB, respectively. This error will not be reported in any other Event TRB types. Note: A Transfer Ring TRB Error should transition an endpoint to the Error state (refer to section 4.8.3), however an xHC implementation may assert HCE18 due to the detection of TRB Error related error conditions. It is the responsibility of software to always present correctly formed TRBs to the xHC.

  • 17 Parameter Error
    • Asserted by a command if a Context parameter is invalid.

All of the errors we see seem to align to an incorrectly set parameter field.

comment:14 by kallisti5, 2 years ago

Interestingly, on qemu we see the following:

usb xhci -1: successfully started the controller
usb xhci -1: No-Op test...
usb xhci -1: Noop
usb xhci -1: command[0] = 17 (0x0000000000000000, 0x00000000, 0x00005c00)
usb xhci -1: Ding Dong! slot:0 endpoint 0
usb xhci -1: Event Interrupt
usb xhci -1: event[0] = 33 (0x0000000006503d40 0x05000000 0x00008401)
usb xhci -1: event[0] = 33 (0x0000000006503d40 0x05000000 0x00008401)
usb xhci -1: Received command event
usb xhci -1: event[1] = 0 (0x0000000000000000 0x00000000 0x00000000)
usb xhci -1: Command Complete. Result: 5
usb error xhci -1: unsuccessful command TRB (5)
usb xhci -1: Storing trb 0x05000000 0x00008401
usb xhci -1: No-Op unsuccessful
usb stack 0: module busses/usb/xhci successfully loaded

So:

  • Noop command issued
  • "something" sends command 17 (reset). 0x00005c00 is the reset TRB with everything else 0'ed out.
    • I say something because in these logs I never see the "XHCI::ResetDevice" trace message with tracing enabled.
  • We see successful results, followed by the initial No-Op being unsuccessful (DoCommand error)

comment:15 by kallisti5, 2 years ago

Ok.. that command 17 was actually (0x17) so 23 :-| (fixed via hrev51050)

This all makes a lot more sense now:

usb hub 2: initialised ok
usb xhci -1: successfully started the controller
usb xhci -1: No-Op test...
usb xhci -1: Issue No-Op
usb xhci -1: command[0] = 23 (0x0000000000000000, 0x00000000, 0x00005c00)
usb xhci -1: Ding Dong! slot:0 endpoint 0
usb xhci -1: Event Interrupt
usb xhci -1: event[0] = 33 (0x0000000006503d40 0x05000000 0x00008401)
usb xhci -1: event[0] = 33 (0x0000000006503d40 0x05000000 0x00008401)
usb xhci -1: Received command event
usb xhci -1: event[1] = 0 (0x0000000000000000 0x00000000 0x00000000)
usb xhci -1: Command Complete. Result: 5
usb error xhci -1: unsuccessful command TRB (5)
usb xhci -1: Storing trb 0x05000000 0x00008401
usb xhci -1: No-Op unsuccessful
usb xhci -1:   command[0] = 23 (0x0000000000000000, 0x00000000, 0x00005c01)
usb xhci -1:   command[1] = 0 (0x0000000000000000, 0x00000000, 0x00000000)
usb xhci -1:   command[2] = 0 (0x0000000000000000, 0x00000000, 0x00000000)
usb xhci -1:   command[3] = 0 (0x0000000000000000, 0x00000000, 0x00000000)
usb xhci -1:   command[4] = 0 (0x0000000000000000, 0x00000000, 0x00000000)
usb xhci -1:   command[5] = 0 (0x0000000000000000, 0x00000000, 0x00000000)
usb xhci -1:   command[6] = 0 (0x0000000000000000, 0x00000000, 0x00000000)
usb xhci -1:   command[7] = 0 (0x0000000000000000, 0x00000000, 0x00000000)
usb xhci -1:   command[8] = 0 (0x0000000000000000, 0x00000000, 0x00000000)
usb xhci -1:   command[9] = 0 (0x0000000000000000, 0x00000000, 0x00000000)
usb xhci -1:   command[10] = 0 (0x0000000000000000, 0x00000000, 0x00000000)
usb xhci -1:   command[11] = 0 (0x0000000000000000, 0x00000000, 0x00000000)
usb xhci -1:   command[12] = 0 (0x0000000000000000, 0x00000000, 0x00000000)
usb xhci -1:   command[13] = 0 (0x0000000000000000, 0x00000000, 0x00000000)
usb xhci -1:   command[14] = 0 (0x0000000000000000, 0x00000000, 0x00000000)
usb stack 0: module busses/usb/xhci successfully loaded

So we're inserting the No-Op, and the XHCI controller is reporting it failed. (updating the cycle bit to 1 as expected)

by kallisti5, 2 years ago

Attachment: xhci-out.txt added

haiku syslogs with XHCI debug

comment:16 by kallisti5, 2 years ago

Has a Patch: set

by kallisti5, 2 years ago

Attachment: xhci-qemudbg.txt added

qemu logs with qemu DEBUG_XHCI compiled in.

comment:17 by pulkomandy, 2 years ago

Has a Patch: unset

by GregCrain, 22 months ago

A patch for xhci driver, but not directly impacting QEMU.

comment:18 by GregCrain, 22 months ago

Has a Patch: set

comment:19 by GregCrain, 22 months ago

I attached a patch that seems to get things working on my NEC/Renesas chipsets. I'm sure some of my findings will impact several tickets. It is still unstable, especially for pre rev 1.0 chipsets.

The most critical item is that there was an difference between an offset value compared to trb value, which was always off by 1. The semaphore for transfers was never being released and timed out. Actual command transfers, such as the NoOp() worked consistently, and didn't rely on that semaphore mechanism. The driver never completed getting the device descriptors.

There is also a post in the discussion forum. Maybe I should continue the topic in the development board.

comment:20 by korli, 22 months ago

Thanks for the patch. Please consider splitting the patch in fixes and logging parts. It's then easier to review and commit separately if needed.

Last edited 22 months ago by korli (previous) (diff)

comment:21 by waddlesplash, 22 months ago

@korli: Huh? This is already in "git format-patch" format!

@Greg: Wow, thanks for investigating this!

For the "early revision" chipsets -- do you mean that this causes a regression, and that after this patch they no longer work? Or are they just unstable, and didn't work at all before this? (Do they even work on Linux?)

in reply to:  21 comment:22 by GregCrain, 22 months ago

Replying to waddlesplash:

For the "early revision" chipsets -- do you mean that this causes a regression, and that after this patch they no longer work? Or are they just unstable, and didn't work at all before this? (Do they even work on Linux?)

For me, nothing worked at all previously in Haiku. No detection of anything. I haven't tried any other OS. It would fail trying to get the device descriptors, and therefore, couldn't even detect anything connected. Now, it gets the device descriptors, and Haiku see's connected devices. When you try to mount one of those devices, it fails, and the interrupts from the host stop. So trying to access the published device on xhci with the early chipsets causes Haiku to slowdown for a minute or something. I think it does recover, but for a casual user, it could be frustrating. But this is only for NEC/Renesas. I don't have any other hardware to test. Maybe it's ok for others.

comment:23 by waddlesplash, 22 months ago

Testing on a recent Linux or FreeBSD might be nice, but not necessary.

Thanks for this! I've got it in my commitqueue, but waiting on @kallisti5's feedback before I push it anywhere.

comment:24 by kallisti5, 22 months ago

Great work!

Some feedback:

  • WriteOpReg(XHCI_CRCR_LO, CRCR_CA );

Style, remove space after "_CA"

  • TRACE_ERROR("AllocateDevice() failed enable slot %d\n",slot);

Style, add space after ,

  • TRACE_ERROR("_LinkDescriptorForPipe max transfers count exceeded %d\n",endpoint->used );

Style, add space after , Style, remove space after "used"

  • offset = offset / sizeof(xhci_trb) +1;

Style, Add space after +

  • #ifdef TRACE_USB

This seems a little messy. For now it's fine though.

  • - ReadOpReg(portRegister);
  • - ReadOpReg(portRegister);

A lot of times, the documentation for XHCI says a read is required after making adjustments. Was there a reason to remove these?

  • Make sure Command Ring is stopped
  • WriteOpReg(XHCI_CRCR_LO, CRCR_CA );

Good catch here. Looks like FreeBSD unsets CS and CA to stop though? (via the ~ mask of the bits) https://github.com/freebsd/freebsd/blob/master/sys/dev/usb/controller/xhci.c#L294

EDIT: Actually, no, you were correct. They do later set CS and CA to stop. We might want to follow suite and mimic them more? (aka, set CS bit, write HI 0, set CA bit, write HI 0) Keep in mind this is hardware, which can have lots of non-nonsensical writes.

Last edited 22 months ago by kallisti5 (previous) (diff)

comment:25 by waddlesplash, 22 months ago

korli committed one part of this (the 'zero struct field' bit: http://cgit.haiku-os.org/haiku/commit/?id=5ed6ef8e18df700101160a4fd5da7edb73408411)

in reply to:  24 comment:26 by GregCrain, 22 months ago

Replying to Alexander von Gluck:

Some feedback:

Ok, thanks for feedback!

  • - ReadOpReg(portRegister);
  • - ReadOpReg(portRegister);

A lot of times, the documentation for XHCI says a read is required after making adjustments. Was there a reason to remove these?

I checked the spec and there is no behavior on the bits in this register that change based on a read. Further, I compared to the FreeBSD code, the similar function returned the read value, where we are just doing a read. So this seemed like superfluous code. But I can't seem to find it in the FreeBSD code at the moment.

  • Make sure Command Ring is stopped
  • WriteOpReg(XHCI_CRCR_LO, CRCR_CA );

Good catch here. Looks like FreeBSD unsets CS and CA to stop though? (via the ~ mask of the bits) https://github.com/freebsd/freebsd/blob/master/sys/dev/usb/controller/xhci.c#L294

EDIT: Actually, no, you were correct. They do later set CS and CA to stop. We might want to follow suite and mimic them more? (aka, set CS bit, write HI 0, set CA bit, write HI 0) Keep in mind this is hardware, which can have lots of non-nonsensical writes.

I can't prove that this was a problem, but I spent ages trying to figure out why I wasn't able to read back the address that was written. I finally found a note in the spec that said all reads of the CRCR register return 0. Then the spec also stated that all writes are ignored if it is running. So I thought a good idea to add this.

The MAIN issue I found was the function 'HandleTransferComplete', this statement:

offset = offset / sizeof(xhci_trb);

when that offset value was used in a comparision, it always failed, and therefore, would never reach the code:

release_sem(fFinishTransfersSem);

And then the semaphore would time out, and you can see lots of statements in debug logs:

usb xhci -1: cancel queued transfers for pipe 0xdfc65da0 (2)

usb_disk: acquire_sem failed while waiting for data transfer: Operation timed out

Offset was 0, when the trb_count would be 1. I added a correction,

offset = offset / sizeof(xhci_trb) + 1;

I checked with FreeBSD driver, and they did something similar, but in a different way:

for (i = (offset / 16) + 1; i < td->ntrb; i++) {

But their code structure and ours is a bit different in this area, so it seems ok for now. But please review !

comment:27 by kallisti5, 22 months ago

Ah yeah, ok. That all makes sense. It looks like korli jumped in and did the CS/CA fixes. (with you maintained as the author in git of course)

Nice catch on bumping the offset. I think that's the final part of these changes that'll need to be re-based and applied. I'll run some tests after the + 1 to see if there is an improvement (beyond the improvements you've already witnessed)

comment:28 by kallisti5, 22 months ago

Resolution: fixed
Status: newclosed

Pushed in hrev51535 I tested this in qemu (like the original tests) and it works! qemu can now boot from Haiku via XHCI with minimal errors.

usb xhci -1: interface version: 0x0100
usb xhci -1: structural parameters: 1:0x08001040 2:0x0000000f 3:0x00000000
usb xhci -1: capability params: 0x00087001
allocate_io_interrupt_vectors: allocated 1 vectors starting from 24
msi_allocate_vectors: allocated 1 vectors starting from 24
msi enabled: 0x0089
usb xhci -1: using message signaled interrupts
usb xhci -1: starting XHCI host controller
usb xhci -1: successfully started the controller
usb uhci: no devices found
usb ohci: no devices found
usb ehci: no devices found
usb hub 2: port 0: new device connected
usb error xhci -1: unsuccessful command TRB (5)
usb error xhci -1: unsuccessful command TRB (5)
usb xhci -1: cancel queued transfers for pipe 0xffffffff812cce30 (0)
usb hub 2: port 1: device removed
usb hub 2: port 2: device removed
usb hub 2: port 3: device removed
usb hub 2: port 4: device removed
usb hub 2: port 5: device removed
usb hub 2: port 6: device removed
usb hub 2: port 7: device removed
usb_disk: device reports a lun count of 1
usb_disk: vendor_identification    "        "
usb_disk: product_identification   "                "
usb_disk: product_revision_level   "PMAP"
get_boot_partitions(): boot volume message:

comment:29 by kallisti5, 22 months ago

Greg: Also, if you're seeing issues with your hardware with an early XHCI rev, could you open an additional ticket with information? I wasn't trying to cut you off closing this one... the original issue described is just now solved with your changes :-)

Note: See TracTickets for help on using tickets.