Opened 3 months ago

Closed 5 weeks ago

#19122 closed bug (fixed)

PANIC: rw_lock_destroy(): read-locked and caller doesn't hold the write lock

Reported by: bbjimmy Owned by: axeld
Priority: normal Milestone: R1/beta6
Component: File Systems/BFS Version: R1/beta5
Keywords: Cc:
Blocked By: Blocking: #8405, #19207
Platform: All

Description

hrev58163

browsing https://arstechnica.com/ in WebPositive

Attachments (1)

IMG_20240923_113442545.jpg (4.4 MB ) - added by bbjimmy 3 months ago.
screenshot of KDL

Change History (14)

by bbjimmy, 3 months ago

Attachment: IMG_20240923_113442545.jpg added

screenshot of KDL

comment:1 by waddlesplash, 3 months ago

Blocking: 8405 added
Component: System/KernelFile Systems/BFS
Owner: changed from nobody to axeld
Priority: highnormal

This is probably the real cause of #8405.

comment:2 by waddlesplash, 3 months ago

Summary: PANICK: rw_lock_destroy()PANIC: rw_lock_destroy(): read-locked and caller doesn't hold the write lock

comment:3 by waddlesplash, 3 months ago

Actually, I think this is a VFS bug and not a BFS bug. The VFS should keep references to vnodes that are currently performing asynchronous I/O.

comment:4 by waddlesplash, 3 months ago

Milestone: UnscheduledR1/beta6

comment:5 by bbjimmy, 2 months ago

It only seems to occur while using WebPositive. eIt is easy to re-produce, open Web+ then browse:

https://ArsTechnica.com

It takes abour 5 minutes.

hrev58228

Last edited 2 months ago by bbjimmy (previous) (diff)

comment:6 by waddlesplash, 2 months ago

Please retest after hrev58258, this should hopefully fix the problem.

comment:7 by waddlesplash, 8 weeks ago

Blocking: 19207 added

comment:8 by bbjimmy, 6 weeks ago

hrev58309. the problem persists.

comment:9 by waddlesplash, 5 weeks ago

I still can't manage to reproduce this. It's probably because my VMs run on top of an SSD and so are too fast to hit the bug.

When the bug occurs, some other thread will be reading or writing from the file. It would be good to check other threads that are waiting on I/O in KDL, one of them will be the thread that holds this lock, and that's the one that should also hold a reference but doesn't.

If someone manages to reproduce and can ping me on IRC, I can walk through a KDL session.

comment:10 by waddlesplash, 5 weeks ago

OK, with some hacks (a snooze in the I/O callback) I managed to reproduce this myself. The I/O callback in question is PageWriteTransfer, which is odd because it's supposed to have references to all the vnodes it's doing I/O from.

comment:11 by waddlesplash, 5 weeks ago

Ah, no actually it isn't, that was a different request. The real faulting request has no finished callback set.

comment:12 by waddlesplash, 5 weeks ago

Alright, I think I see the problem here.

The issue is that vfs_read/write_pages are synchronous, and just wait for the I/O requests to complete before returning. bfs_io, meanwhile, doesn't keep references to the inodes, and assumes its caller will do that. But as it happens, vfs_read is notified of completion before bfs_io is, because IORequest notifies the finished condition before invoking the callbacks, and so there's a small window in which the vnode can get un-referenced and deleted before BFS releases the read lock.

The solution here is probably just to move the finished condition notification after the callbacks are invoked.

comment:13 by waddlesplash, 5 weeks ago

Resolution: fixed
Status: newclosed

Fixed in hrev58351.

Note: See TracTickets for help on using tickets.