Opened 5 months ago

Closed 5 months ago

Last modified 5 months ago

#18921 closed bug (fixed)

Random small system-wide stalls

Reported by: madmax Owned by: nobody
Priority: critical Milestone: R1/beta5
Component: System/Kernel Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Platform: All

Description

Build haiku.hpkg. While the log speeds by you may notice some split-second waits, but those could be subjective. When it gets to "Removing and re-creating package contents dir" and "Collecting package contents" phases its quite clear with ProcessController not showing any activity for some periods.

Reverting hrev57789 fixes it for me, but that one is a fix itself.

While in those stalls some other actions may also be affected. For example closing a window won't be effective until the stall ends, which for me I don't think has ever been longer than a second or two.

Also noticed more network errors in WebPositive, but again those may be just bad luck and not related.

Change History (7)

comment:1 by waddlesplash, 5 months ago

Priority: normalcritical

comment:2 by waddlesplash, 5 months ago

Have you investigated any further than that? (What do KDL stacktraces show during a stall?)

comment:3 by madmax, 5 months ago

I hadn't, yet. Here is something from ´rm´ and ´copyattr´:

stack trace for thread 2268 "rm"
    kernel stack: 0xffffffff82cb5000 to 0xffffffff82cba000
      user stack: 0x00007f880f6e2000 to 0x00007f88106e2000
frame                       caller             <image>:function + offset
 0 ffffffff82cb94a0 (+ 128) ffffffff8009f9fd   <kernel_x86_64> reschedule(int) + 0x43d
 1 ffffffff82cb9520 (+  80) ffffffff8008a447   <kernel_x86_64> thread_block_with_timeout + 0xe7
 2 ffffffff82cb9570 (+ 112) ffffffff80058a2b   <kernel_x86_64> ConditionVariableEntry::Wait(unsigned int, long) + 0x16b
 3 ffffffff82cb95e0 (+ 128) ffffffff821ce5a1   <nvme_disk> await_status(nvme_disk_driver_info*, nvme_qpair*, int&) + 0xc1
 4 ffffffff82cb9660 (+  48) ffffffff821ce777   <nvme_disk> do_nvme_io_request(nvme_disk_driver_info*, nvme_io_request*) + 0x117
 5 ffffffff82cb9690 (+ 336) ffffffff821cf976   <nvme_disk> nvme_disk_io(void*, IORequest*) + 0x496
 6 ffffffff82cb97e0 (+ 352) ffffffff821cfe90   <nvme_disk> nvme_disk_write(void*, long, void const*, unsigned long*) + 0x80
 7 ffffffff82cb9940 (+ 112) ffffffff800f029c   <kernel_x86_64> _kern_writev + 0x10c
 8 ffffffff82cb99b0 (+  32) ffffffff8016ef6a   <kernel_x86_64> writev_pos + 0x1a
 9 ffffffff82cb99d0 (+ 416) ffffffff8213892e   <bfs> Journal::_WriteTransactionToLog[clone .localalias] () + 0xa7e
10 ffffffff82cb9b70 (+  48) ffffffff82139305   <bfs> Journal::Unlock(Transaction*, bool) + 0x35
11 ffffffff82cb9ba0 (+  80) ffffffff8214078f   <bfs> bfs_remove_vnode(fs_volume*, fs_vnode*, bool) + 0xaf
12 ffffffff82cb9bf0 (+  64) ffffffff800fdb54   <kernel_x86_64> free_vnode(vnode*, bool) + 0xb4
13 ffffffff82cb9c30 (+  80) ffffffff800ff32b   <kernel_x86_64> dec_vnode_ref_count[clone .isra.0] (vnode*, bool, bool) + 0x33b
14 ffffffff82cb9c80 (+  32) ffffffff801068f7   <kernel_x86_64> put_vnode + 0x97
15 ffffffff82cb9ca0 (+  64) ffffffff82139101   <bfs> Transaction::NotifyListeners[clone .localalias] (bool) + 0x81
16 ffffffff82cb9ce0 (+  48) ffffffff82139323   <bfs> Journal::Unlock(Transaction*, bool) + 0x53
17 ffffffff82cb9d10 (+ 128) ffffffff82140078   <bfs> bfs_unlink(fs_volume*, fs_vnode*, char const*) + 0xe8
18 ffffffff82cb9d90 (+ 320) ffffffff80105a10   <kernel_x86_64> common_unlink(int, char*, bool) + 0x60
19 ffffffff82cb9ed0 (+  80) ffffffff8010c73c   <kernel_x86_64> _user_unlink + 0x7c
20 ffffffff82cb9f20 (+  16) ffffffff8014e6df   <kernel_x86_64> x86_64_syscall_entry + 0xfb
user iframe at 0xffffffff82cb9f30 (end = 0xffffffff82cb9ff8)
 rax 0x80                  rbx 0x10c1c8282e00        rcx 0x1a1dbdd0a8c
 rdx 0x0                   rsi 0x10c1c8282ef0        rdi 0x5
 rbp 0x7f88106e0fc0         r8 0x3                    r9 0x7f88106e109c
 r10 0xff726462            r11 0x246                 r12 0x0
 r13 0x10c1c80cf820        r14 0x7f88106e1170        r15 0x10c1c8282e00
 rip 0x1a1dbdd0a8c         rsp 0x7f88106e0fa8     rflags 0x246
 vector: 0x63, error code: 0x0
21 ffffffff82cb9f30 (+140224468512912) 000001a1dbdd0a8c   <libroot.so> _kern_unlink + 0x0c
22 00007f88106e0fc0 (+ 176) 0000000a67887b2d   <rm> usage (nearest) + 0x44d
23 00007f88106e1070 (+ 224) 0000000a67888276   <rm> rm + 0x106
24 00007f88106e1150 (+ 240) 0000000a67887457   <rm> main + 0x3e7
25 00007f88106e1240 (+  48) 0000000a678876cf   <rm> _start + 0x3f
26 00007f88106e1270 (+  48) 00000080efe64c95   </boot/system/runtime_loader@0x00000080efe55000> <unknown> + 0xfc95
27 00007f88106e12a0 (+   0) 00007fe438e0e258   <commpage> commpage_thread_exit + 0x00
stack trace for thread 2657 "copyattr"
    kernel stack: 0xffffffff82dba000 to 0xffffffff82dbf000
      user stack: 0x00007fb7127da000 to 0x00007fb7137da000
frame                       caller             <image>:function + offset
 0 ffffffff82dbe720 (+ 128) ffffffff8009f9fd   <kernel_x86_64> reschedule(int) + 0x43d
 1 ffffffff82dbe7a0 (+  80) ffffffff8008a447   <kernel_x86_64> thread_block_with_timeout + 0xe7
 2 ffffffff82dbe7f0 (+ 112) ffffffff80058a2b   <kernel_x86_64> ConditionVariableEntry::Wait(unsigned int, long) + 0x16b
 3 ffffffff82dbe860 (+ 128) ffffffff821ce5a1   <nvme_disk> await_status(nvme_disk_driver_info*, nvme_qpair*, int&) + 0xc1
 4 ffffffff82dbe8e0 (+  48) ffffffff821ce777   <nvme_disk> do_nvme_io_request(nvme_disk_driver_info*, nvme_io_request*) + 0x117
 5 ffffffff82dbe910 (+ 336) ffffffff821cf976   <nvme_disk> nvme_disk_io(void*, IORequest*) + 0x496
 6 ffffffff82dbea60 (+ 352) ffffffff821cfe90   <nvme_disk> nvme_disk_write(void*, long, void const*, unsigned long*) + 0x80
 7 ffffffff82dbebc0 (+ 112) ffffffff800f029c   <kernel_x86_64> _kern_writev + 0x10c
 8 ffffffff82dbec30 (+  32) ffffffff8016ef6a   <kernel_x86_64> writev_pos + 0x1a
 9 ffffffff82dbec50 (+ 416) ffffffff8213892e   <bfs> Journal::_WriteTransactionToLog[clone .localalias] () + 0xa7e
10 ffffffff82dbedf0 (+  48) ffffffff82139305   <bfs> Journal::Unlock(Transaction*, bool) + 0x35
11 ffffffff82dbee20 (+ 144) ffffffff821411bd   <bfs> bfs_free_cookie(fs_volume*, fs_vnode*, void*) + 0x1ad
12 ffffffff82dbeeb0 (+  32) ffffffff800ff983   <kernel_x86_64> file_free_fd(file_descriptor*) + 0x33
13 ffffffff82dbeed0 (+  32) ffffffff800ee6a4   <kernel_x86_64> put_fd + 0x84
14 ffffffff82dbeef0 (+  48) ffffffff800ee981   <kernel_x86_64> close_fd_index + 0xb1
15 ffffffff82dbef20 (+  16) ffffffff8014e6df   <kernel_x86_64> x86_64_syscall_entry + 0xfb
user iframe at 0xffffffff82dbef30 (end = 0xffffffff82dbeff8)
 rax 0x9e                  rbx 0x10e74f7bbb60        rcx 0x10eb787ac69
 rdx 0x0                   rsi 0x10e74f7bbbe8        rdi 0x5
 rbp 0x7fb7137d8f70         r8 0x80                   r9 0x30
 r10 0x7fb7137d8dc0        r11 0x206                 r12 0x80006000
 r13 0x10e74f7bfb60        r14 0x7fb7137d92e0        r15 0x7fb7137d9180
 rip 0x10eb787ac69         rsp 0x7fb7137d8f58     rflags 0x206
 vector: 0x63, error code: 0x0
16 ffffffff82dbef30 (+140426382254144) 0000010eb787ac69   <libroot.so> _kern_close + 0x09
17 00007fb7137d8f70 (+  16) 00000024efffc492   <libroot_build.so> BPrivate::FileDescriptor::Close() + 0x22
18 00007fb7137d8f80 (+  48) 00000024efffd44a   <libroot_build.so> BPrivate::delete_descriptor(int) + 0x6a
19 00007fb7137d8fb0 (+  32) 0000023cac29980b   <libbe_build.so> BNode::close_fd() + 0x2b
20 00007fb7137d8fd0 (+  32) 0000023cac2999ba   <libbe_build.so> BNode::Unset() + 0x7a
21 00007fb7137d8ff0 (+2512) 00000095218770c7   <copyattr> _start (nearest) + 0x767
22 00007fb7137d99c0 (+ 480) 00000095218761dd   <copyattr> main + 0x35d
23 00007fb7137d9ba0 (+  48) 000000952187699e   <copyattr> _start + 0x3e
24 00007fb7137d9bd0 (+  48) 000000c0c68f0c95   </boot/system/runtime_loader@0x000000c0c68e1000> <unknown> + 0xfc95
25 00007fb7137d9c00 (+   0) 00007fddcb203258   <commpage> commpage_thread_exit + 0x00

Others I got were also during write.

comment:4 by madmax, 5 months ago

And I could have looked into that conditiona variable, just in case. Given the result, I don't know if I'm doing it right:

kdebug> thread 4153
THREAD: 0xffffffff832c0000
id:                 4153 (0x1039)
serial_number:      4153
name:               "rm"
hash_next:          0xffffffff80915380
team_next:          0x0000000000000000
priority:           10 (I/O: -1)
state:              waiting
cpu:                0x0000000000000000 
cpumask:            0xffffffff
sig_pending:        0x0 (blocked: 0x0, before sigsuspend(): 0x0)
in_kernel:          1
waiting for:        condition variable 0xffffffff8309eef8
fault_handler:      0x0000000000000000
team:               0xffffffffadb25a08, "rm"
  exit.sem:         171770
  exit.status:      0x0 (No error)
  exit.waiters:
kernel_stack_area:  183139
kernel_stack_base:  0xffffffff82fcd000
user_stack_area:    183141
user_stack_base:    0x00007fb8f0d7b000
user_local_storage: 0x00007fb8f1d7b000
user_thread:        0x00007ff5e66b6000
kernel_errno:       0x0 (No error)
kernel_time:        34160488
user_time:          12484
flags:              0x0
architecture dependant section:
	rsp: 0xffffffff82fd15d8
	syscall_rsp: 0xffffffff82fd2000
	user_rsp: 0x00007fb8f1d7a878
	fpu_state at 0xffffffff832c0340
scheduler data:
	priority_penalty:	0
	additional_penalty:	5 (38945)
	effective_priority:	5
	time_used:		623 us (quantum: 3500 us)
	stolen_time:		33523470 us
	quantum_start:		1594064948 us
	needed_load:		0%
	went_sleep:		1594064967
	went_sleep_active:	50377667
	core:			2
kdebug> cvar 0xffffffff8309eef8
condition variable 0xffffffff8309eef8
  object:  0x0000000000000000 (<NULL>)
  threads: 4153
kdebug> cvars
  variable      object (type)                waiting threads
------------------------------------------------------------
0xffffffffad97d770  0xffffffffad97d708  pipe                               0
0xffffffffcdf463f0  0xffffffffcdf46388  pipe                               0

comment:5 by waddlesplash, 5 months ago

I missed an important case in hrev57789; should be fixed for real in hrev57800.

However, if that does fix the problem (and I don't see how it wouldn't), that just raises a lot more questions. For one, if this is how BFS writes its journal, why didn't I (or whoever else has been using nightlies the past few days) see disk corruption? For another, even if there was disk corruption, why did this change performance in such an important way?

comment:6 by madmax, 5 months ago

Resolution: fixed
Status: newclosed

It indeed fixes my problem.

Now, talking out of my arse: the journal may not be an issue unless there's another issue that crashes the system mid-transaction?

comment:7 by waddlesplash, 5 months ago

Not sure how that would happen. Ultimately this may be another instance of #15585, but how or why isn't clear to me.

Note: See TracTickets for help on using tickets.