Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#16049 closed bug (fixed)

NVMe KDL, page fault in nvme_disk_io->_mutex_unlock

Reported by: KapiX Owned by: waddlesplash
Priority: normal Milestone: R1/beta2
Component: Drivers/Disk/NVMe Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Platform: All

Description

I can browse the drive, but as soon as I do git status it KDLs.

Drive: Intel 760p 512GB.

hrev54185 x86_64

Attachments (1)

20200517_062733.jpg (850.1 KB ) - added by KapiX 4 years ago.

Download all attachments as: .zip

Change History (10)

by KapiX, 4 years ago

Attachment: 20200517_062733.jpg added

comment:1 by waddlesplash, 4 years ago

<KapiX> nothing in bootlog besides two timeouts waiting for interrupt
<KapiX> and as for first fail
<KapiX> KERN: [33mnvme_disk:[0m attempt to queue read I/O at LBA 273955312 of 3400 blocks failed!
<KapiX> is the best I can get you

comment:2 by waddlesplash, 4 years ago

Interrupts not occurring is also #15874, but at least in that ticket the boot will just lock up altogether without blacklisting the driver. No interrupts is pretty bad; that deserves investigation.

3400 blocks may be outside the maximum I/O size, but the driver should be able to break it up already, so it seems strange it failed to queue that I/O. I guess enabling libnvme tracing may help discern what the exact problem there is?

It does look like segmented IO may call the callback more than once, though. So that may be worth investigating as a potential cause of the KDL.

comment:3 by waddlesplash, 4 years ago

Yes, it appears segmented I/O can result in the callback being invoked more than once, and there is no way to detect the last one. I've filed https://github.com/hgst/libnvme/issues/7 about this. It appears Intel controllers have "stripes" that other controllers do not; so it may be necessary to work around this.

comment:4 by waddlesplash, 4 years ago

Actually, never mind, I read through the code more and that should not be an issue here. The code in libnvme looks "bad" but the later call to nvme_request_add_child should take care of this.

comment:5 by waddlesplash, 4 years ago

See if hrev54226 helps at all.

comment:6 by waddlesplash, 4 years ago

(If it doesn't, please run "syslog" at KDL prompt and get me a picture of the last page.)

comment:7 by waddlesplash, 4 years ago

Please retest after hrev54230.

comment:8 by KapiX, 4 years ago

Resolution: fixed
Status: newclosed

Updating from hrev54185 was very slow and riddled with I/O errors.

Since hrev54230 no I/O errors in syslog, speed is back to normal, so I'm assuming this is fixed.

If I see this KDL again I will reopen the ticket.

comment:9 by nielx, 4 years ago

Milestone: UnscheduledR1/beta2

Assign tickets with status=closed and resolution=fixed within the R1/beta2 development window to the R1/beta2 Milestone

(final time)

Note: See TracTickets for help on using tickets.