Opened 5 years ago

Closed 17 months ago

#15569 closed bug (fixed)

usb_disk stalls & unmounts when large I/O transfers are not queued

Reported by: X512 Owned by: nobody
Priority: normal Milestone: R1/beta5
Component: Drivers/Disk/USB Version: R1/Development
Keywords: Cc: pulkomandy, mmlr
Blocked By: Blocking: #17044
Platform: All

Description (last modified by X512)

This is hrev53639 x86_64.

System has 8GB of RAM and page file is disabled.

When attempting to run blender directly after boot, executable loading stops at loading executable image and trying to open new windows or context menus cause permanent freeze of window thread. Systems become almost uncontrollable, normal shutdown process is not working.

strace:

~/Blender> strace -l blender -p 10 10 800 600
[  1393] create_area("rld heap", 0x7f364467c978, 0x6, 0x10000, 0x0, 0x3) = 0x39de (3 us)
[  1393] create_area("_rld_debug_", 0x7f364467c9a8, 0x6, 0x1000, 0x0, 0x103) = 0x39df (5 us)
[  1393] open(0xffffffff, "./blender", 0x0, 0x0) = 0x3 (4 us)
[  1393] normalize_path("/boot/home/Blender/blender", true, 0x7f364467b510) = 0x0 No error (9 us)
[  1393] read(0x3, 0x0, 0x7f364467b4d0, 0x40) = 0x40 (3 us)
[  1393] read(0x3, 0x40, 0x7f364467b910, 0xe0) = 0xe0 (1 us)
[  1393] reserve_address_range([0x293a648000], 0x7, 0x402c000) = 0x0 No error (1 us)
[  1393] map_file("blender_seg0ro", 0x7f364467b3e8, 0x1, 0x2f5e000, 0x1003, 0x1, true, 0x0, 0x7f8276c008) = 0x39e0 (125 us)

But after cat blender > /dev/null is executed, blender runs normally.

Attachments (1)

CIMG4588.JPG (4.7 MB ) - added by X512 5 years ago.
Stack trace of blender thread after freeze.

Change History (30)

comment:1 by X512, 5 years ago

Description: modified (diff)

by X512, 5 years ago

Attachment: CIMG4588.JPG added

Stack trace of blender thread after freeze.

comment:2 by diver, 5 years ago

Did you run Blender from usb3 drive?

comment:3 by X512, 5 years ago

Yes. System booted from USB3 flash drive, and blender located on that drive.

comment:4 by waddlesplash, 5 years ago

It's likely trying to read the entire file at once, which requires more physical memory than the USB stack's PhysicalMemoryAllocator can handle. The Wait() there has a 2 second timeout, but it appears an infinite number of retries allowed if waiters keep returning, and they likely do, hence why this locks up.

I guess the USB disk driver just has some lock held and does not let it go while waiting for the transfer to come back, hence why the whole system locks up...

comment:5 by waddlesplash, 5 years ago

Cc: pulkomandy mmlr added

The deadlock inside the PhysicalMemoryAllocator is fixed in hrev53651; it now will fail to allocate memory in those situations instead of hanging.

However, this now leads to another problem: After the transfer is not executed, the USB disk device stalls (this is in QEMU, for whatever it's worth) and must be reset, leading to the drive being unmounted, so then of course attempting to run any programs, read/write, etc. now fail:

usb error xhci 0: unable to allocate space for a buffer (size 4096, count 2304)
usb_disk: failed to queue data transfer: Out of memory
usb_disk: read failed: Out of memory
usb xhci 0: transfer error on slot 1 endpoint 4: Stall
usb_disk: sending the command block wrapper failed: Device stalled
usb xhci 0: cancel queued transfers (1) for pipe 0xffffffff82249a08 (2)
usb_disk: acquire_sem failed while waiting for data transfer: Operation timed out
usb_disk: write failed: Operation timed out

Obviously, the usb_disk driver needs to break up long read/write requests (probably > 1MB should be segmented.) However, a NO_MEMORY error on the part of the PMA should not cause a device stall. I'm kind of lost as to how this is occuring, from what I can tell, nothing was submitted to the device at the time the NO_MEMORY error occurs?

CC'ing PulkoMandy and mmlr, who actually know about usb_disk operation.

comment:6 by X512, 5 years ago

This bug is starting getting annoying. Now it unmount boot volume when large binary is executed. I tried to call llvm-objdump for inspecting generated binaries of ARM haiku-loader.efi and this bug appear.

I am wondering that nobody else experiencing this bug.

comment:7 by waddlesplash, 5 years ago

I do not believe any of the developers run a primary install off of a USB disk (we all have installs to internal HDDs under SATA, etc.), and only the usb_disk driver seems unable to read > ~12MB of data at once, and only mmap() seems to trigger such reads normal circumstances (reads of smaller files seem to go through the disk cache and are segmented.)

comment:8 by waddlesplash, 5 years ago

Component: System/KernelDrivers/Disk/USB
Summary: System freeze when attempting to execute large binary not loaded to disk cacheusb_disk stalls & unmounts when large I/O transfers are not queued

As I noted in the comment above, there are 3 problems occurring here:

1) USB stack cannot run transfers > ~12MB in size, as it fails to allocate physical memory (not really a "bug", there is an upper limit on the USB stack physical buffer pool, and IIRC this is actually ~12-16MB of individual pages; i.e. if the driver requested memory in larger chunks it might work.)

2) usb_disk stalls when the USB stack fails to queue a transfer (that is too large). I'm not really sure why this is.

3) usb_disk does not segment I/O.

Realistically, we should solve both (2) and (3).

in reply to:  7 comment:9 by X512, 5 years ago

Replying to waddlesplash:

I do not believe any of the developers run a primary install off of a USB disk

I use USB 3 disk for x86_64 Haiku installation. x86 Haiku and most of data is stored on SATA HDD.

comment:10 by X512, 5 years ago

I can try to fix (3), but I am not familiar with USB stack, so I can't help with (1) and (2).

comment:11 by pulkomandy, 5 years ago

As to why the usb_disk device stalls:

https://git.haiku-os.org/haiku/tree/src/add-ons/kernel/drivers/disk/usb/usb_disk/usb_disk.cpp#n695

Here you can see that USB mass storage has a 2-stage process. First it sends a command, and then it read or writes the data. What happens here is that the first step succeeds, but the second one fails. This leaves the device waiting for the data transfer forever.

We can do better error handling to recover from the stall, I guess, rather than unmounting the disk. However, it's hard to know the state of the device then. Did it "gracefully" fail just that transfer? Or did it crash and also lost all the data that was still in its cache?

I don't see why it should be usb_disk responsibility to split the transfer in smaller chunks. It can't know the limitations of the underlying levels, and should let those decide how to handle things. At the hardware level there does not appear to be any constraint, so, if we know we will be reading a very large chunk, I don't see why we shouldn't do that in a single go.

There can be two reasons to limit the size, and both are out of control of usb_disk:

  • Limitations of the underlying USB stack, if we can't remove them, there should be a way to ask for the max allowed size, I would not hardcode it in usb_disk,
  • Reducing latency (as large transfers will keep the device busy for some time): this should be done in io_scheduler, the usb_disk driver is not the right place to take such decisions.

comment:12 by waddlesplash, 5 years ago

At the hardware level there does not appear to be any constraint, so, if we know we will be reading a very large chunk, I don't see why we shouldn't do that in a single go.

Well, for XHCI at least, "event data" notifications have a 24-bit size value (16MB max when unsigned), and after that it wraps. We could restructure the XHCI driver to insert more than one "event data" item as necessary, but this would require doing pretty different accounting vs. what we are doing now.

What happens here is that the first step succeeds, but the second one fails. This leaves the device waiting for the data transfer forever.

Is there any chance to send more than one transfer to handle one SCSI command? Or is that not allowed by the specification?

I don't see why it should be usb_disk responsibility to split the transfer in smaller chunks. It can't know the limitations of the underlying levels, and should let those decide how to handle things.

Well, I don't know what SATA does, but NVMe at least has a "max transfer size" and the driver internally breaks up transfers larger than that.

Also, why should it be the I/O scheduler's job to take care of this? If the I/O scheduler wants to transfer 16MB of data, the driver needs to decide the most efficient way to do that, not the I/O scheduler. All drivers currently do that (I hope) the most efficient way for them, so usb_disk should do the same here.

comment:13 by waddlesplash, 5 years ago

Also, it's probably possible to exhaust the physical memory pool simply by having a lot of USB drives (probably a dozen or something like that) and doing "regular" I/O on them all at once. So a small transfer could then fail due to being out-of-memory, so the driver needs to have some plan for this regardless.

comment:14 by waddlesplash, 5 years ago

It appears FreeBSD's storage layer has a "maxio" field, which the USB (umass) sets to "MAXPHYS", which is hard-coded to (128 * 1024). It appears their "CAM" layer takes care of the segmentation.

comment:15 by pulkomandy, 5 years ago

Yes, what FreeBSD does makes more sense, I think. The "cam" layer in FreeBSD is basically an abstracted storage device system which is more or less based on what SCSI provides in terms of commands. Underlying drivers implement the actual communication with devices. We have a somewhat similar scsi kernel module but currently usb_disk does not use it.

It is of course not possible (nor desirable) to transparently split the transfers at xhci level beyond its maximum size. But, from usb_disk point of view, the most efficient thing to do is "as large as possible" (if we consider bandwidth as the thing we want to optimize). So, ideally, there should be a way to know what's the limit of the underlying usb module, and usb_disk would split transfers to that size.

as for the splitting, the simplest thing to do is probably to cap the block count in usb_disk_needs_partial_buffer. The read() call of usb_disk will then do a partial read (it adjusts its returned length) and the upper layer should then call it again as needed (assuming it handles partial reads properly).

comment:16 by waddlesplash, 5 years ago

It appears our SCSI layer exports B_DMA_MAX_TRANSFER_BLOCKS: https://xref.landonf.org/source/xref/haiku/src/add-ons/kernel/bus_managers/scsi/devices.cpp#133

Which is handled by DMAResource: https://xref.landonf.org/source/xref/haiku/src/system/kernel/device_manager/dma_resources.cpp#134

usb_disk, however, does not use DMAResource, and even if it used the SCSI stack it probably would bypass DMAResource anyway.

comment:17 by waddlesplash, 5 years ago

Yes, what FreeBSD does makes more sense, I think.

Except FreeBSD has a global hard-coded limit of 128kB, and it seems that no matter what the underlying device reports, CAM caps I/O at this, so I guess it must do segmentation somewhere if reads are requested > 128kB.

and the upper layer should then call it again as needed (assuming it handles partial reads properly).

Most upper layers interpret transferred < requested as "the end of the drive was reached", which is not a good idea.

Since usb_disk is the only driver that does not use DMAResource (SCSI, etc.) or segment transfers manually (NVMe), it probably should be changed to segment transfers manually.

comment:18 by X512, 5 years ago

Can this be fixed before Beta 2 release? This is critical issue that block proper Haiku usage from USB disk. I think that hardcoded IO request size splitting is fine for start, it can be improved later.

comment:19 by waddlesplash, 5 years ago

Well, this has been a problem for over a decade and it's only now that it is beginning to be noticed as a problem (probably because USB 2 drives were so much slower than SATA that it did not make sense to run an install off them all the time.)

I think solving problem (2) above, namely that usb_disk can't recover from a self-incurred stall, should be solved first, which won't help with running large binaries but at least it will solve the most annoying part of this bug. Since the problem reproduces in QEMU with emulated USB drives, experimenting there may help. The driver already has some stall recovery mechanisms in it, so figuring out why those are insufficient is the place to start.

I don't think any of the developers have, or plan to allocate, time to work on this between now and the beta though.

comment:20 by waddlesplash, 4 years ago

Can you retest this after hrev54101? There was a case where file maps were reading too much. It of course does not fix the underlying usb_disk problems, but that may have been the culprit in this specific case.

comment:21 by X512, 4 years ago

Running blender (not packaged) without cat > /dev/null workaround still cause boot volume unmount in hrev54241.

comment:22 by X512, 4 years ago

I also get this when call git status on USB disk.

comment:23 by pulkomandy, 3 years ago

Here is how it is done in MMC devices:

I see the USB disk driver currently does not use the IO scheduler. It re-implements a large part of the same thing (bounce buffers, waiting for commands to execute) on its own. I think switching to IO Scheduler would solve this problem and also make the code simpler.

As I have said above, this allows to:

  • Have the constraints defined at the right place (in the xhci driver, which is the one putting that limit on things)
  • Have the requests be split by the IO scheduler before they are submitted to the lower layers, in the usb_disk driver

This way the usb_disk driver will automatically adjust to the constraints of the underlying device. And if we change the limitations in xhci or replace it with something else someday, it automatically benefits from the changes.

comment:24 by waddlesplash, 3 years ago

I think the IOScheduler requires physical maps be used? At present, the USB stack requires virtual memory buffers being passed to it. Changing that may be a good idea but would require breaking USB driver ABI (probably not that big of an issue?)

comment:25 by waddlesplash, 3 years ago

hrev55441 makes this happen less often, but it still may with especially large (hundreds of MB, etc.) transfers. The long-term solution of using the IO scheduler is still a necessity.

comment:26 by waddlesplash, 3 years ago

Blocking: 17044 added

comment:27 by waddlesplash, 18 months ago

hrev57037 makes the usb_disk driver recover from stalls properly, but it's still possible to cause problems with really large reads.

comment:29 by waddlesplash, 17 months ago

Milestone: UnscheduledR1/beta5
Resolution: fixed
Status: newclosed

Fixed in hrev57048.

Note: See TracTickets for help on using tickets.