Opened 6 years ago

Last modified 3 years ago

#14835 new bug

libusb syscall stalls when USB errors

Reported by: pulkomandy Owned by: mmlr
Priority: normal Milestone: Unscheduled
Component: Drivers/USB Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Platform: All

Description

https://github.com/haikuports/haikuports/issues/1345

KERN: usb_disk: got device name "Canon MP620 series 0109": No error
KERN: usb error ehci -1: qtd (0xe4c7000) error: 0x40000148
KERN: usb error ehci -1: qtd (0xe30fa80) error: 0x801f8049
KERN: usb_disk: sending the command block wrapper failed: Device check-sum error
KERN: usb error ehci -1: qtd (0xe30fc00) error: 0x00080248
KERN: usb error ehci -1: qtd (0xe30fe00) error: 0x00080248
KERN: usb error ehci -1: qtd (0xe310000) error: 0x00080248
KERN: usb hub 18: port 4: device removed

Error 0x80248 means we failed sending an 8 byte setup token, transaction halted because device is in error state.

The transfer should either be retried, or the error raised to the calling code by the mean of the USB transfer completion callback. But it seems this never happens. As a result the ioctl syscall doing the operation never returns.

From a quick look at the EHCI code, it appears that we don't call the callback for cancelled transfers. We cancel transfers without calling the callback if doing a "force" cancel, which happens in bus manager Pipe destructor.

So basically, we could end up in this situation if the Pipe was deleted while transfers where queued, and they were not properly cancelled.

I can produce a similar issue with sigrok-cli and an fx2lafw device.

I will be studying the bus manager code, but if someone with more knowledge of the architecture could have a look, that would be great.

Change History (11)

comment:1 by waddlesplash, 6 years ago

I guess it makes sense to force-cancel transfers from Pipe destructor; but I'm not sure why the transfers were not cancelled earlier. Pipe destruction happens during Device teardown; likely the Device teardown should cancel any queued transfers "normally"?

comment:2 by waddlesplash, 6 years ago

http://xref.plausible.coop/source/xref/haiku/src/add-ons/kernel/bus_managers/usb/Device.cpp#556

Instead of just "delete"ing the Pipe here, try calling CancelQueuedTransfers with "force" as "false" before doing so. I think it is still acceptable to do that at this stage.

comment:3 by pulkomandy, 5 years ago

While running sigrok-cli:

KERN: usb xhci 0: KERN: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 0: TRB 0xb3b3da0 was not found in the endpoint!
KERN: usb xhci 0: KERN: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 0: TRB 0xb3b3de0 was not found in the endpoint!
KERN: usb xhci 0: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 0: KERN: TRB 0xb3b3e20 was not found in the endpoint!
KERN: usb xhci 0: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 0: KERN: TRB 0xb3b3e60 was not found in the endpoint!
KERN: usb xhci 0: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 0: TRB 0xb3b3ea0 was not found in the endpoint!
KERN: usb xhci 0: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 0: TRB 0xb3b3ee0 was not found in the endpoint!
KERN: usb xhci 0: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 0: KERN: TRB 0xb3b3f20 was not found in the endpoint!
KERN: usb xhci 0: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 0: TRB 0xb3b3f60 was not found in the endpoint!
KERN: usb xhci 0: KERN: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 0: TRB 0xb3b40a0 was not found in the endpoint!
KERN: usb xhci 0: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 0: KERN: TRB 0xb3b40e0 was not found in the endpoint!
KERN: usb xhci 0: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 0: KERN: TRB 0xb3b4120 was not found in the endpoint!
KERN: usb xhci 0: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 0: KERN: TRB 0xb3b4160 was not found in the endpoint!
KERN: usb xhci 0: KERN: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 0: KERN: TRB 0xb3b41a0 was not found in the endpoint!
KERN: usb xhci 0: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 0: KERN: TRB 0xb3b41e0 was not found in the endpoint!
KERN: usb xhci 0: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 0: KERN: TRB 0xb3b4220 was not found in the endpoint!
KERN: usb xhci 0: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 0: KERN: TRB 0xb3b4260 was not found in the endpoint!
KERN: usb xhci 0: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 0: KERN: TRB 0xb3b42a0 was not found in the endpoint!
KERN: usb xhci 0: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 0: KERN: TRB 0xb3b42e0 was not found in the endpoint!
KERN: usb xhci 0: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 0: KERN: TRB 0xb3b4320 was not found in the endpoint!
KERN: usb xhci 0: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 0: KERN: TRB 0xb3b4360 was not found in the endpoint!

At this point sigrok-cli says there is an error and stalls (I can't kill it with control C, the main thread is gone but there are worker threads still running).

Here I unplug the device:

KERN: usb hub 7: KERN: port 2: device removed
KERN: usb xhci 1: cancel queued transfers for pipe 0xf5719c38 (0)
KERN: usb hub 7: port 2: new device connected
KERN: usb xhci 1: KERN: cancel queued transfers for pipe 0x81feeeb4 (0)

And a little later:

KERN: usb xhci 1: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 1: TRB 0xb3b5260 was not found in the endpoint!
KERN: usb xhci 1: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 1: KERN: TRB 0xb3b52a0 was not found in the endpoint!
KERN: usb xhci 1: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 1: KERN: TRB 0xb3b52e0 was not found in the endpoint!
KERN: usb xhci 1: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 1: KERN: TRB 0xb3b5320 was not found in the endpoint!
KERN: usb xhci 1: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 1: KERN: TRB 0xb3b5360 was not found in the endpoint!
KERN: usb xhci 1: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 1: KERN: TRB 0xb3b53a0 was not found in the endpoint!
KERN: usb xhci 1: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 1: KERN: TRB 0xb3b5460 was not found in the endpoint!
KERN: usb xhci 1: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 1: KERN: TRB 0xb3b54a0 was not found in the endpoint!
KERN: usb xhci 1: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 1: KERN: TRB 0xb3b54e0 was not found in the endpoint!
KERN: usb xhci 1: got an interrupt for a non-Event Data TRB!
KERN: usb error xhci 1: KERN: TRB 0xb3b5520 was not found in the endpoint!
KERN: usb xhci 1: got an interrupt for a non-Event Data TRB!
KERN: usb hub 7: KERN: port 2: device removed
KERN: usb xhci 1: cancel queued transfers for pipe 0xf974d648 (0)
KERN: usb xhci 1: KERN: cancel queued transfers for pipe 0xf4840c40 (2)
KERN: usb error xhci 1: unsuccessful command 15, error Context state (19)

comment:4 by pulkomandy, 5 years ago

I tried a second time and got less errors:

Connecting the device:

KERN: usb hub 7: port 2: new device connected
KERN: usb xhci 1: cancel queued transfers for pipe 0x81feeeb4 (0)

Running the command:

KERN: usb hub 7: port 2: device removed
KERN: usb xhci 1: cancel queued transfers for pipe 0xf6c46410 (0)
KERN: usb hub 7: port 2: new device connected
KERN: usb xhci 1: cancel queued transfers for pipe 0x81feeeb4 (0)

Unplugging the device:

KERN: usb xhci 1: got an interrupt for a non-Event Data TRB!
KERN: usb hub 7: port 2: device removed
KERN: usb xhci 1: cancel queued transfers for pipe 0xf970cb60 (0)
KERN: usb xhci 1: KERN: cancel queued transfers for pipe 0xf7824910 (2)
KERN: usb error xhci 1: unsuccessful command 15, error Context state (19)

comment:5 by pulkomandy, 5 years ago

Sigrok log for reference:

sr: [00:00.000002] log: libsigrok loglevel set to 5.
sr: [00:00.000033] backend: libsigrok 0.5.1/5:0:1 (rt: 0.5.1/5:0:1).
sr: [00:00.000062] backend: Libs: glib 2.57.1 (rt: 2.57.1/5701:0), libzip 1.2.0, libusb-1.0 1.0.22.11312 API 0x01000106, libftdi 1.4.
sr: [00:00.000070] backend: Host: i586-pc-haiku, little-endian.
sr: [00:00.000083] backend: SCPI backends: TCP, USBTMC.
sr: [00:00.000092] backend: Firmware search paths:
sr: [00:00.000129] backend:  - /boot/home/config/non-packaged/data/sigrok-firmware
sr: [00:00.000138] backend:  - /packages/libsigrok_x86-0.5.1-4/.self/data/sigrok-firmware
sr: [00:00.000159] backend:  - /boot/system/non-packaged/data/sigrok-firmware
sr: [00:00.000196] backend:  - /boot/system/data/sigrok-firmware
sr: [00:00.000255] backend: Sanity-checking all drivers.
sr: [00:00.000266] backend: Sanity-checking all input modules.
sr: [00:00.000281] backend: Sanity-checking all output modules.
sr: [00:00.000297] backend: Sanity-checking all transform modules.
sr: [00:00.001151] ezusb: uploading firmware to device on 1.3
sr: [00:00.001196] ezusb: setting CPU reset mode on...
sr: [00:00.001779] resource: Attempt to open '/boot/home/config/non-packaged/data/sigrok-firmware/fx2lafw-saleae-logic.fw' failed: No such file or directory
sr: [00:00.001831] resource: Opened '/packages/libsigrok_x86-0.5.1-4/.self/data/sigrok-firmware/fx2lafw-saleae-logic.fw'.
sr: [00:00.001863] ezusb: Uploading firmware 'fx2lafw-saleae-logic.fw'.
sr: [00:00.010282] ezusb: Uploaded 4096 bytes.
sr: [00:00.018498] ezusb: Uploaded 4024 bytes.
sr: [00:00.018511] ezusb: Firmware upload done.
sr: [00:00.018521] ezusb: setting CPU reset mode off...
sr: [00:00.019287] hwdriver: Scan found 1 devices (fx2lafw).
sr: [00:00.019341] device: fx2lafw: Opening device instance.
sr: [00:00.019350] fx2lafw: Waiting for device to reset.
sr: [00:00.419390] fx2lafw: Waited 400ms.
sr: [00:00.519414] fx2lafw: Waited 500ms.
sr: [00:00.619440] fx2lafw: Waited 600ms.
sr: [00:00.719464] fx2lafw: Waited 700ms.
sr: [00:00.819486] fx2lafw: Waited 800ms.
sr: [00:00.919510] fx2lafw: Waited 900ms.
sr: [00:01.019539] fx2lafw: Waited 1000ms.
sr: [00:01.119574] fx2lafw: Waited 1100ms.
sr: [00:01.219604] fx2lafw: Waited 1200ms.
sr: [00:01.319628] fx2lafw: Waited 1300ms.
sr: [00:01.419655] fx2lafw: Waited 1400ms.
sr: [00:01.519690] fx2lafw: Waited 1500ms.
sr: [00:01.619715] fx2lafw: Waited 1600ms.
sr: [00:01.719739] fx2lafw: Waited 1700ms.
sr: [00:01.819759] fx2lafw: Waited 1800ms.
sr: [00:01.919782] fx2lafw: Waited 1900ms.
sr: [00:02.019814] fx2lafw: Waited 2000ms.
sr: [00:02.119846] fx2lafw: Waited 2100ms.
sr: [00:02.219880] fx2lafw: Waited 2200ms.
sr: [00:02.319920] fx2lafw: Waited 2300ms.
sr: [00:02.419955] fx2lafw: Waited 2400ms.
sr: [00:02.519991] fx2lafw: Waited 2500ms.
sr: [00:02.620027] fx2lafw: Waited 2600ms.
sr: [00:02.720063] fx2lafw: Waited 2700ms.
sr: [00:02.721193] fx2lafw: Opened device on 1.3 (logical) / usb/1-3 (physical), interface 0, firmware 1.3.
sr: [00:02.721209] fx2lafw: Detected REVID=1, it's a Cypress CY7C68013A (FX2LP).
sr: [00:02.721217] fx2lafw: Device came back after 2700ms.
sr: [00:02.721308] session: Using thread-default main context.
sr: [00:02.721326] session: Starting.
sr: [00:02.721339] hwdriver: fx2lafw: Starting acquisition.
sr: [00:02.721371] fx2lafw: submitting transfer: 0
sr: [00:02.721384] fx2lafw: submitting transfer: 1
sr: [00:02.721394] fx2lafw: submitting transfer: 2
sr: [00:02.721403] fx2lafw: submitting transfer: 3
sr: [00:02.721413] fx2lafw: submitting transfer: 4
sr: [00:02.721422] fx2lafw: submitting transfer: 5
sr: [00:02.721430] fx2lafw: submitting transfer: 6
sr: [00:02.721438] fx2lafw: submitting transfer: 7
sr: [00:02.721446] fx2lafw: submitting transfer: 8
sr: [00:02.721454] fx2lafw: submitting transfer: 9
sr: [00:02.721462] fx2lafw: submitting transfer: 10
sr: [00:02.721472] fx2lafw: submitting transfer: 11
sr: [00:02.721483] fx2lafw: submitting transfer: 12
sr: [00:02.721493] fx2lafw: submitting transfer: 13
sr: [00:02.721511] fx2lafw: submitting transfer: 14
sr: [00:02.721520] fx2lafw: submitting transfer: 15
sr: [00:02.721528] fx2lafw: submitting transfer: 16
sr: [00:02.721537] fx2lafw: submitting transfer: 17
sr: [00:02.721547] fx2lafw: submitting transfer: 18
sr: [00:02.721579] session: bus: Received SR_DF_HEADER packet.
cli: Received SR_DF_HEADER.
sr: [00:02.721677] hwdriver: sr_config_get(): key 30000 (samplerate) sdi 0x4b0b40 cg NULL -> uint64 20000
sr: [00:02.721695] fx2lafw: GPIF delay = 1499, clocksource = 30MHz.
sr: [00:02.821749] fx2lafw: Unable to send start command: LIBUSB_ERROR_TIMEOUT.
sr: [00:02.821773] session: Could not start fx2lafw device usb/1-3 acquisition.
sr: [00:02.821780] hwdriver: fx2lafw: Stopping acquisition.
Failed to start session.

comment:6 by waddlesplash, 5 years ago

https://review.haiku-os.org/c/haiku/+/1424 was the fix for this. I don't know if anyone ever tested it however.

comment:7 by pulkomandy, 5 years ago

I will check if the lower polling rate in the ehci driver helps with this, when I get back home and have access to the hardware again.

comment:8 by pulkomandy, 5 years ago

Still broken. As for the change in 1424, you called it "obviously wrong", so I'm happy to test a less wrong version if you can provide one :)

Version 0, edited 5 years ago by pulkomandy (next)

comment:9 by waddlesplash, 5 years ago

The first version of the change in 1424 was wrong, however, I think the current version is completely different and actually correct?

comment:10 by waddlesplash, 3 years ago

Component: Drivers/USB/EHCIDrivers/USB
Summary: libusb syscall stalls when EHCI errorslibusb syscall stalls when USB errors

This still happens, and under XHCI too. Example backtrace (trimmed):

stack trace for thread 20814 "cameralooper"
    kernel stack: 0xffffffff81f7b000 to 0xffffffff81f80000
      user stack: 0x00007fd327bf6000 to 0x00007fd327c36000
frame                       caller             <image>:function + offset
 0 ffffffff81f7fcc0 (+ 128) ffffffff800997b3   <kernel_x86_64> reschedule(int) + 0x433
 1 ffffffff81f7fd40 (+  48) ffffffff80089626   <kernel_x86_64> thread_block + 0xc6
 2 ffffffff81f7fd70 (+ 160) ffffffff80071195   <kernel_x86_64> switch_sem_etc + 0x3e5
 3 ffffffff81f7fe10 (+ 176) ffffffff8166436b   </boot/system/add-ons/kernel/drivers/dev/bus/usb_raw> usb_raw_ioctl(void*, unsigned int, void*, unsigned long) + 0x33b
 4 ffffffff81f7fec0 (+  64) ffffffff800e79da   <kernel_x86_64> fd_ioctl(bool, int, unsigned int, void*, unsigned long) + 0x5a
 5 ffffffff81f7ff00 (+  32) ffffffff800e86aa   <kernel_x86_64> _user_ioctl + 0x3a
 6 ffffffff81f7ff20 (+  16) ffffffff8014622f   <kernel_x86_64> x86_64_syscall_entry + 0xfb
user iframe at 0xffffffff81f7ff30 (end = 0xffffffff81f7fff8)
 rax 0x92                  rbx 0x1109fcd14420        rcx 0x8d6eeb410c
 rdx 0x7fd327c35490        rsi 0x4002                rdi 0xe
 rbp 0x7fd327c35480         r8 0x7fd327c35770         r9 0x1109fcc88340
 r10 0x30                  r11 0x202                 r12 0xc
 r13 0x7fd327c354f0        r14 0x1109fcc88340        r15 0x1109fcc883c8
 rip 0x8d6eeb410c          rsp 0x7fd327c35468     rflags 0x202
 vector: 0x63, error code: 0x0
 7 ffffffff81f7ff30 (+140546996393296) 0000008d6eeb410c   <libroot.so> _kern_ioctl + 0x0c
 8 00007fd327c35480 (+  96) 0000013f725c7eef   <libdevice.so> BUSBEndpoint::BulkTransfer const(void*, unsigned long) + 0x4f

There are various instances of this in the syslog:

usb xhci 0: transfer error on slot 2 endpoint 1: Stall
usb error control pipe 15: timeout waiting for queued request to complete
usb xhci 0: cancel queued transfers (1) for pipe 0xffffffffe5874e00 (0)
usb error xhci 0: cancel queued transfers: halted endpoint, reset!

However, the device was not actually disconnected, so we should not be stalled here.

comment:11 by waddlesplash, 3 years ago

I improved things a bit in hrev55806; at least applications are now killable without removing the USB devices entirely.

Note: See TracTickets for help on using tickets.