Opened 20 months ago
Last modified 3 weeks ago
#18390 new bug
[BFS] Reader-writer deadlock between I/O and file-cache resize
Reported by: | waddlesplash | Owned by: | axeld |
---|---|---|---|
Priority: | normal | Milestone: | Unscheduled |
Component: | File Systems/BFS | Version: | R1/beta4 |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Platform: | All |
Description
I've seen this one before, intermittently; it seems to mostly occur on boot, but it was only today that fully traced the problem's cause out through KDL.
The basic symptom is that the whole system stalls on I/O due to the BFS journal lock being held. For example, Deskbar is hung waiting on _kern_create_dir
:
stack trace for thread 1851 "menu_tracking" kernel stack: 0xffffffff92f77000 to 0xffffffff92f7c000 user stack: 0x00007ffc443b8000 to 0x00007ffc443f8000 frame caller <image>:function + offset 0 ffffffff92f7bba0 (+ 112) ffffffff8009ba44 <kernel_x86_64> reschedule(int) + 0x424 1 ffffffff92f7bc10 (+ 48) ffffffff8008b376 <kernel_x86_64> thread_block + 0xc6 2 ffffffff92f7bc40 (+ 80) ffffffff80097baa <kernel_x86_64> _mutex_lock + 0x21a 3 ffffffff92f7bc90 (+ 32) ffffffff80097c4e <kernel_x86_64> recursive_lock_lock + 0x3e 4 ffffffff92f7bcb0 (+ 64) ffffffff81b06926 <bfs> Journal::Lock[clone .localalias] (Transaction*, bool) + 0x26 5 ffffffff92f7bcf0 (+ 32) ffffffff81b079b5 <bfs> Transaction::Start(Volume*, long) + 0x35 6 ffffffff92f7bd10 (+ 128) ffffffff81b0f03d <bfs> bfs_create_dir(fs_volume*, fs_vnode*, char const*, int) + 0x8d 7 ffffffff92f7bd90 (+ 320) ffffffff80100ca6 <kernel_x86_64> dir_create(int, char*, int, bool) + 0x66 8 ffffffff92f7bed0 (+ 80) ffffffff8010738a <kernel_x86_64> _user_create_dir + 0x8a 9 ffffffff92f7bf20 (+ 16) ffffffff801493ef <kernel_x86_64> x86_64_syscall_entry + 0xfb user iframe at 0xffffffff92f7bf30 (end = 0xffffffff92f7bff8) rax 0x75 rbx 0x1133ceb0464d rcx 0x15d9cb7870c rdx 0x1ed rsi 0x1133ceb167e0 rdi 0xffffffff rbp 0x7ffc443f6fe0 r8 0x80 r9 0x0 r10 0x15d9cb7891c r11 0x206 r12 0x7ffc443f6f70 r13 0x7ffc443f6f40 r14 0x7ffc443f6f34 r15 0x7ffc443f6f38 rip 0x15d9cb7870c rsp 0x7ffc443f6f18 rflags 0x206 vector: 0x63, error code: 0x0 10 ffffffff92f7bf30 (+140723282751664) 0000015d9cb7870c <libroot.so> _kern_create_dir + 0x0c ...
The thread which currently holds the journal lock looks like this:
stack trace for thread 1769 "Tracker" kernel stack: 0xffffffff81c1d000 to 0xffffffff81c22000 user stack: 0x00007ffc094f2000 to 0x00007ffc0a4f2000 frame caller <image>:function + offset 0 ffffffff81c21a40 (+ 112) ffffffff8009ba44 <kernel_x86_64> reschedule(int) + 0x424 1 ffffffff81c21ab0 (+ 48) ffffffff8008b376 <kernel_x86_64> thread_block + 0xc6 2 ffffffff81c21ae0 (+ 64) ffffffff8014122c <kernel_x86_64> VMCache::WaitForPageEvents(vm_page*, unsigned int, bool) + 0x6c 3 ffffffff81c21b20 (+ 80) ffffffff8014138d <kernel_x86_64> VMCache::_FreePageRange(IteratableSplayTree<VMCachePagesTreeDefinition>::Iterator, unsigned long*) + 0x12d 4 ffffffff81c21b70 (+ 96) ffffffff80141549 <kernel_x86_64> VMCache::Resize(long, int) + 0x189 5 ffffffff81c21bd0 (+ 80) ffffffff8005566b <kernel_x86_64> file_cache_set_size + 0x4b 6 ffffffff81c21c20 (+ 64) ffffffff81b01fd9 <bfs> Inode::SetFileSize[clone .localalias] (Transaction&, long) + 0x49 7 ffffffff81c21c60 (+ 80) ffffffff81af953a <bfs> Attribute::_Truncate[clone .part.0] () + 0x6a 8 ffffffff81c21cb0 (+ 80) ffffffff81af9943 <bfs> Attribute::Create(char const*, unsigned int, int, attr_cookie**) + 0xd3 9 ffffffff81c21d00 (+ 128) ffffffff81b0c46d <bfs> bfs_create_attr(fs_volume*, fs_vnode*, char const*, unsigned int, int, void**) + 0x4d 10 ffffffff81c21d80 (+ 96) ffffffff800fee15 <kernel_x86_64> attr_create(int, char*, char const*, unsigned int, int, bool) + 0xa5 11 ffffffff81c21de0 (+ 320) ffffffff8010876f <kernel_x86_64> _user_write_attr + 0x8f 12 ffffffff81c21f20 (+ 16) ffffffff801493ef <kernel_x86_64> x86_64_syscall_entry + 0xfb user iframe at 0xffffffff81c21f30 (end = 0xffffffff81c21ff8) rax 0x83 rbx 0x124864fa3b20 rcx 0x125b048e7ec rdx 0x5649434e rsi 0x44272680c3 rdi 0x4 rbp 0x7ffc0a4f0df0 r8 0x124865007fb0 r9 0x15c r10 0x0 r11 0x202 r12 0x7ffc0a4f0e50 r13 0x7ffc0a4f0e20 r14 0x7ffc0a4f0e80 r15 0x44272680c3 rip 0x125b048e7ec rsp 0x7ffc0a4f0dd8 rflags 0x202 vector: 0x63, error code: 0x0 13 ffffffff81c21f30 (+140722599423680) 00000125b048e7ec <libroot.so> _kern_write_attr + 0x0c ...
The last time I ran into this and debugged what had happened, I could not figure out what the problem was because I could not determine what page was being waited on. So last summer in hrev56187, I modified VMCache to specify this in the wait information. It seems, however, that THREAD_BLOCK_TYPE_OTHER is always interpreted as a string, so in order to get at it I had to manually dump the page structure as raw data in KDL and find the pointer in it, which was quite annoying. After having done that, the page is:
PAGE: 0xffffffff84189a60 queue_next,prev: 0xffffffff8418f3c0, 0xffffffff84187cb0 physical_number: 0x382ee cache: 0xffffffff97e9f0e0 cache_offset: 0 cache_next: 0x0000000000000000 state: cached wired_count: 0 usage_count: 0 busy: 1 busy_writing: 0 accessed: 0 modified: 0 accessor: 1804 area mappings:
And thread 1804 is:
stack trace for thread 1804 "w>/boot/home/Desktop" kernel stack: 0xffffffff8216b000 to 0xffffffff82170000 user stack: 0x00007f33782d7000 to 0x00007f3378317000 frame caller <image>:function + offset 0 ffffffff8216f4e0 (+ 112) ffffffff8009ba44 <kernel_x86_64> reschedule(int) + 0x424 1 ffffffff8216f550 (+ 48) ffffffff8008b376 <kernel_x86_64> thread_block + 0xc6 2 ffffffff8216f580 (+ 80) ffffffff80096d2b <kernel_x86_64> rw_lock_wait(rw_lock*, bool, BPrivate::AutoLocker<spinlock, BPrivate::InterruptsSpinLocking>&) + 0x7b 3 ffffffff8216f5d0 (+ 64) ffffffff80097019 <kernel_x86_64> _rw_lock_read_lock + 0x89 4 ffffffff8216f610 (+ 48) ffffffff81b0d9b9 <bfs> bfs_io(fs_volume*, fs_vnode*, void*, IORequest*) + 0x79 5 ffffffff8216f640 (+ 80) ffffffff80109696 <kernel_x86_64> vfs_vnode_io + 0x36 6 ffffffff8216f690 (+ 368) ffffffff80109ea6 <kernel_x86_64> vfs_read_pages + 0xa6 7 ffffffff8216f800 (+ 944) ffffffff80053d7e <kernel_x86_64> read_into_cache(file_cache_ref*, void*, long, int, unsigned long, unsigned long, bool, vm_page_reservation*, unsigned long) + 0x1ce 8 ffffffff8216fbb0 (+ 208) ffffffff80054866 <kernel_x86_64> cache_io(void*, void*, long, unsigned long, unsigned long*, bool) + 0x446 9 ffffffff8216fc80 (+ 128) ffffffff81b0003a <bfs> Inode::ReadAttribute(char const*, int, long, unsigned char*, unsigned long*) + 0x18a 10 ffffffff8216fd00 (+ 128) ffffffff81b0c3a0 <bfs> bfs_read_attr(fs_volume*, fs_vnode*, void*, long, void*, unsigned long*) + 0x40 11 ffffffff8216fd80 (+ 96) ffffffff800eab72 <kernel_x86_64> common_user_io(int, long, void*, unsigned long, bool) + 0x1b2 12 ffffffff8216fde0 (+ 320) ffffffff80108690 <kernel_x86_64> _user_read_attr + 0x90 13 ffffffff8216ff20 (+ 16) ffffffff801493ef <kernel_x86_64> x86_64_syscall_entry + 0xfb user iframe at 0xffffffff8216ff30 (end = 0xffffffff8216fff8) rax 0x82 rbx 0x1248650b7e70 rcx 0x125b048e7dc rdx 0x0 rsi 0x1dad0f2f043 rdi 0x11 rbp 0x7f3378315ed0 r8 0x15c r9 0x15c r10 0x1248650be470 r11 0x206 r12 0x1dad0f2f043 r13 0x1248650b7e70 r14 0x1248650b7f50 r15 0x1248650be470 rip 0x125b048e7dc rsp 0x7f3378315eb8 rflags 0x206 vector: 0x63, error code: 0x0 14 ffffffff8216ff30 (+139861148983200) 00000125b048e7dc <libroot.so> _kern_read_attr + 0x0c ...
And the rwlock it is waiting on is:
rw lock 0xffffffff97e98c80: name: bfs inode holder: 1769 count: 0x10001 active readers 0 pending readers 0 owner count: 0x10000 flags: 0x0 waiting threads: 1804/r
But 1769 is the thread which holds the journal lock and is waiting for this I/O to complete. So we have a deadlock.
Change History (8)
comment:1 by , 20 months ago
comment:2 by , 20 months ago
Adjusted things in hrev56945 to make printing the page being waited on much easier.
comment:3 by , 20 months ago
Perhaps cache_io should also use busy_writing, thus VMCache::_FreePageRange() could skip this page.
comment:4 by , 20 months ago
busy_writing usually means the page itself is being written to disk, right?
That could get confusing if it also means the page is busy being written to. But yes, that looks like it is already most of the solution we need.
comment:5 by , 4 months ago
Actually, I think this is a problem in BFS? Inode::ReadAt() should only be called if the inode is read-locked, right? So this looks like an oversight.
comment:6 by , 4 months ago
Hmm, maybe not; it seems we don't acquire any locks when invoking ReadAt() at present.
comment:7 by , 3 weeks ago
Appears a TODO was added to BFS regarding pretty much this exact scenario back in 2008, in hrev26715.
My initial question is: Why does VMCache need to wait for these pages to become unbusy in _FreePageRange? Can't, instead, it just set some flag on the pages, so that whatever operation currently has then "busy" also gains the responsibility for freeing them when it completes?
That would solve this problem in the most general way possible. However, I recognize that may not really be feasible, and so we may have to solve this inside BFS somehow. Perhaps we can downgrade our write lock into a read lock before invoking the file cache's set-size?