#18921 closed bug (fixed)
Random small system-wide stalls
Reported by: | madmax | Owned by: | nobody |
---|---|---|---|
Priority: | critical | Milestone: | R1/beta5 |
Component: | System/Kernel | Version: | R1/Development |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Platform: | All |
Description
Build haiku.hpkg. While the log speeds by you may notice some split-second waits, but those could be subjective. When it gets to "Removing and re-creating package contents dir" and "Collecting package contents" phases its quite clear with ProcessController not showing any activity for some periods.
Reverting hrev57789 fixes it for me, but that one is a fix itself.
While in those stalls some other actions may also be affected. For example closing a window won't be effective until the stall ends, which for me I don't think has ever been longer than a second or two.
Also noticed more network errors in WebPositive, but again those may be just bad luck and not related.
Change History (7)
comment:1 by , 6 months ago
Priority: | normal → critical |
---|
comment:2 by , 6 months ago
comment:3 by , 6 months ago
I hadn't, yet. Here is something from ´rm´ and ´copyattr´:
stack trace for thread 2268 "rm" kernel stack: 0xffffffff82cb5000 to 0xffffffff82cba000 user stack: 0x00007f880f6e2000 to 0x00007f88106e2000 frame caller <image>:function + offset 0 ffffffff82cb94a0 (+ 128) ffffffff8009f9fd <kernel_x86_64> reschedule(int) + 0x43d 1 ffffffff82cb9520 (+ 80) ffffffff8008a447 <kernel_x86_64> thread_block_with_timeout + 0xe7 2 ffffffff82cb9570 (+ 112) ffffffff80058a2b <kernel_x86_64> ConditionVariableEntry::Wait(unsigned int, long) + 0x16b 3 ffffffff82cb95e0 (+ 128) ffffffff821ce5a1 <nvme_disk> await_status(nvme_disk_driver_info*, nvme_qpair*, int&) + 0xc1 4 ffffffff82cb9660 (+ 48) ffffffff821ce777 <nvme_disk> do_nvme_io_request(nvme_disk_driver_info*, nvme_io_request*) + 0x117 5 ffffffff82cb9690 (+ 336) ffffffff821cf976 <nvme_disk> nvme_disk_io(void*, IORequest*) + 0x496 6 ffffffff82cb97e0 (+ 352) ffffffff821cfe90 <nvme_disk> nvme_disk_write(void*, long, void const*, unsigned long*) + 0x80 7 ffffffff82cb9940 (+ 112) ffffffff800f029c <kernel_x86_64> _kern_writev + 0x10c 8 ffffffff82cb99b0 (+ 32) ffffffff8016ef6a <kernel_x86_64> writev_pos + 0x1a 9 ffffffff82cb99d0 (+ 416) ffffffff8213892e <bfs> Journal::_WriteTransactionToLog[clone .localalias] () + 0xa7e 10 ffffffff82cb9b70 (+ 48) ffffffff82139305 <bfs> Journal::Unlock(Transaction*, bool) + 0x35 11 ffffffff82cb9ba0 (+ 80) ffffffff8214078f <bfs> bfs_remove_vnode(fs_volume*, fs_vnode*, bool) + 0xaf 12 ffffffff82cb9bf0 (+ 64) ffffffff800fdb54 <kernel_x86_64> free_vnode(vnode*, bool) + 0xb4 13 ffffffff82cb9c30 (+ 80) ffffffff800ff32b <kernel_x86_64> dec_vnode_ref_count[clone .isra.0] (vnode*, bool, bool) + 0x33b 14 ffffffff82cb9c80 (+ 32) ffffffff801068f7 <kernel_x86_64> put_vnode + 0x97 15 ffffffff82cb9ca0 (+ 64) ffffffff82139101 <bfs> Transaction::NotifyListeners[clone .localalias] (bool) + 0x81 16 ffffffff82cb9ce0 (+ 48) ffffffff82139323 <bfs> Journal::Unlock(Transaction*, bool) + 0x53 17 ffffffff82cb9d10 (+ 128) ffffffff82140078 <bfs> bfs_unlink(fs_volume*, fs_vnode*, char const*) + 0xe8 18 ffffffff82cb9d90 (+ 320) ffffffff80105a10 <kernel_x86_64> common_unlink(int, char*, bool) + 0x60 19 ffffffff82cb9ed0 (+ 80) ffffffff8010c73c <kernel_x86_64> _user_unlink + 0x7c 20 ffffffff82cb9f20 (+ 16) ffffffff8014e6df <kernel_x86_64> x86_64_syscall_entry + 0xfb user iframe at 0xffffffff82cb9f30 (end = 0xffffffff82cb9ff8) rax 0x80 rbx 0x10c1c8282e00 rcx 0x1a1dbdd0a8c rdx 0x0 rsi 0x10c1c8282ef0 rdi 0x5 rbp 0x7f88106e0fc0 r8 0x3 r9 0x7f88106e109c r10 0xff726462 r11 0x246 r12 0x0 r13 0x10c1c80cf820 r14 0x7f88106e1170 r15 0x10c1c8282e00 rip 0x1a1dbdd0a8c rsp 0x7f88106e0fa8 rflags 0x246 vector: 0x63, error code: 0x0 21 ffffffff82cb9f30 (+140224468512912) 000001a1dbdd0a8c <libroot.so> _kern_unlink + 0x0c 22 00007f88106e0fc0 (+ 176) 0000000a67887b2d <rm> usage (nearest) + 0x44d 23 00007f88106e1070 (+ 224) 0000000a67888276 <rm> rm + 0x106 24 00007f88106e1150 (+ 240) 0000000a67887457 <rm> main + 0x3e7 25 00007f88106e1240 (+ 48) 0000000a678876cf <rm> _start + 0x3f 26 00007f88106e1270 (+ 48) 00000080efe64c95 </boot/system/runtime_loader@0x00000080efe55000> <unknown> + 0xfc95 27 00007f88106e12a0 (+ 0) 00007fe438e0e258 <commpage> commpage_thread_exit + 0x00
stack trace for thread 2657 "copyattr" kernel stack: 0xffffffff82dba000 to 0xffffffff82dbf000 user stack: 0x00007fb7127da000 to 0x00007fb7137da000 frame caller <image>:function + offset 0 ffffffff82dbe720 (+ 128) ffffffff8009f9fd <kernel_x86_64> reschedule(int) + 0x43d 1 ffffffff82dbe7a0 (+ 80) ffffffff8008a447 <kernel_x86_64> thread_block_with_timeout + 0xe7 2 ffffffff82dbe7f0 (+ 112) ffffffff80058a2b <kernel_x86_64> ConditionVariableEntry::Wait(unsigned int, long) + 0x16b 3 ffffffff82dbe860 (+ 128) ffffffff821ce5a1 <nvme_disk> await_status(nvme_disk_driver_info*, nvme_qpair*, int&) + 0xc1 4 ffffffff82dbe8e0 (+ 48) ffffffff821ce777 <nvme_disk> do_nvme_io_request(nvme_disk_driver_info*, nvme_io_request*) + 0x117 5 ffffffff82dbe910 (+ 336) ffffffff821cf976 <nvme_disk> nvme_disk_io(void*, IORequest*) + 0x496 6 ffffffff82dbea60 (+ 352) ffffffff821cfe90 <nvme_disk> nvme_disk_write(void*, long, void const*, unsigned long*) + 0x80 7 ffffffff82dbebc0 (+ 112) ffffffff800f029c <kernel_x86_64> _kern_writev + 0x10c 8 ffffffff82dbec30 (+ 32) ffffffff8016ef6a <kernel_x86_64> writev_pos + 0x1a 9 ffffffff82dbec50 (+ 416) ffffffff8213892e <bfs> Journal::_WriteTransactionToLog[clone .localalias] () + 0xa7e 10 ffffffff82dbedf0 (+ 48) ffffffff82139305 <bfs> Journal::Unlock(Transaction*, bool) + 0x35 11 ffffffff82dbee20 (+ 144) ffffffff821411bd <bfs> bfs_free_cookie(fs_volume*, fs_vnode*, void*) + 0x1ad 12 ffffffff82dbeeb0 (+ 32) ffffffff800ff983 <kernel_x86_64> file_free_fd(file_descriptor*) + 0x33 13 ffffffff82dbeed0 (+ 32) ffffffff800ee6a4 <kernel_x86_64> put_fd + 0x84 14 ffffffff82dbeef0 (+ 48) ffffffff800ee981 <kernel_x86_64> close_fd_index + 0xb1 15 ffffffff82dbef20 (+ 16) ffffffff8014e6df <kernel_x86_64> x86_64_syscall_entry + 0xfb user iframe at 0xffffffff82dbef30 (end = 0xffffffff82dbeff8) rax 0x9e rbx 0x10e74f7bbb60 rcx 0x10eb787ac69 rdx 0x0 rsi 0x10e74f7bbbe8 rdi 0x5 rbp 0x7fb7137d8f70 r8 0x80 r9 0x30 r10 0x7fb7137d8dc0 r11 0x206 r12 0x80006000 r13 0x10e74f7bfb60 r14 0x7fb7137d92e0 r15 0x7fb7137d9180 rip 0x10eb787ac69 rsp 0x7fb7137d8f58 rflags 0x206 vector: 0x63, error code: 0x0 16 ffffffff82dbef30 (+140426382254144) 0000010eb787ac69 <libroot.so> _kern_close + 0x09 17 00007fb7137d8f70 (+ 16) 00000024efffc492 <libroot_build.so> BPrivate::FileDescriptor::Close() + 0x22 18 00007fb7137d8f80 (+ 48) 00000024efffd44a <libroot_build.so> BPrivate::delete_descriptor(int) + 0x6a 19 00007fb7137d8fb0 (+ 32) 0000023cac29980b <libbe_build.so> BNode::close_fd() + 0x2b 20 00007fb7137d8fd0 (+ 32) 0000023cac2999ba <libbe_build.so> BNode::Unset() + 0x7a 21 00007fb7137d8ff0 (+2512) 00000095218770c7 <copyattr> _start (nearest) + 0x767 22 00007fb7137d99c0 (+ 480) 00000095218761dd <copyattr> main + 0x35d 23 00007fb7137d9ba0 (+ 48) 000000952187699e <copyattr> _start + 0x3e 24 00007fb7137d9bd0 (+ 48) 000000c0c68f0c95 </boot/system/runtime_loader@0x000000c0c68e1000> <unknown> + 0xfc95 25 00007fb7137d9c00 (+ 0) 00007fddcb203258 <commpage> commpage_thread_exit + 0x00
Others I got were also during write.
comment:4 by , 6 months ago
And I could have looked into that conditiona variable, just in case. Given the result, I don't know if I'm doing it right:
kdebug> thread 4153 THREAD: 0xffffffff832c0000 id: 4153 (0x1039) serial_number: 4153 name: "rm" hash_next: 0xffffffff80915380 team_next: 0x0000000000000000 priority: 10 (I/O: -1) state: waiting cpu: 0x0000000000000000 cpumask: 0xffffffff sig_pending: 0x0 (blocked: 0x0, before sigsuspend(): 0x0) in_kernel: 1 waiting for: condition variable 0xffffffff8309eef8 fault_handler: 0x0000000000000000 team: 0xffffffffadb25a08, "rm" exit.sem: 171770 exit.status: 0x0 (No error) exit.waiters: kernel_stack_area: 183139 kernel_stack_base: 0xffffffff82fcd000 user_stack_area: 183141 user_stack_base: 0x00007fb8f0d7b000 user_local_storage: 0x00007fb8f1d7b000 user_thread: 0x00007ff5e66b6000 kernel_errno: 0x0 (No error) kernel_time: 34160488 user_time: 12484 flags: 0x0 architecture dependant section: rsp: 0xffffffff82fd15d8 syscall_rsp: 0xffffffff82fd2000 user_rsp: 0x00007fb8f1d7a878 fpu_state at 0xffffffff832c0340 scheduler data: priority_penalty: 0 additional_penalty: 5 (38945) effective_priority: 5 time_used: 623 us (quantum: 3500 us) stolen_time: 33523470 us quantum_start: 1594064948 us needed_load: 0% went_sleep: 1594064967 went_sleep_active: 50377667 core: 2 kdebug> cvar 0xffffffff8309eef8 condition variable 0xffffffff8309eef8 object: 0x0000000000000000 (<NULL>) threads: 4153 kdebug> cvars variable object (type) waiting threads ------------------------------------------------------------ 0xffffffffad97d770 0xffffffffad97d708 pipe 0 0xffffffffcdf463f0 0xffffffffcdf46388 pipe 0
comment:5 by , 6 months ago
I missed an important case in hrev57789; should be fixed for real in hrev57800.
However, if that does fix the problem (and I don't see how it wouldn't), that just raises a lot more questions. For one, if this is how BFS writes its journal, why didn't I (or whoever else has been using nightlies the past few days) see disk corruption? For another, even if there was disk corruption, why did this change performance in such an important way?
comment:6 by , 6 months ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
It indeed fixes my problem.
Now, talking out of my arse: the journal may not be an issue unless there's another issue that crashes the system mid-transaction?
comment:7 by , 6 months ago
Not sure how that would happen. Ultimately this may be another instance of #15585, but how or why isn't clear to me.
Have you investigated any further than that? (What do KDL stacktraces show during a stall?)