Opened 20 months ago

Last modified 15 months ago

#18430 new bug

[kernel] System hangs or KDLs disabling CPUs

Reported by: waddlesplash Owned by: nobody
Priority: normal Milestone: Unscheduled
Component: System/Kernel Version: R1/beta4
Keywords: Cc:
Blocked By: Blocking:
Platform: All

Description

This one is pretty easy for me to trigger in my VMware setup, and also on at least one of my bare-metal machines. We used to shut down on the boot CPU but disabled that because it hangs on some machines, I suspect this is the same bug.

Change History (6)

comment:1 by waddlesplash, 20 months ago

The mouse cursor moves initially, but eventually that too stops working.

The system is deadlocked on the kernel address space:

rw lock 0xffffffff8280ef08:
  name:            kernel address space
  holder:          3
  count:           0x30001
  active readers   0
  pending readers  0
  owner count:     0x10000
  flags:           0x0
  waiting threads: 1813/w 19/w 1678/r

And the holding thread is the undertaker, which is stopped for no apparent reason:

stack trace for thread 3 "undertaker"
    kernel stack: 0xffffffff82190000 to 0xffffffff82195000
frame                       caller             <image>:function + offset
 0 ffffffff821947a0 (+ 112) ffffffff8009bc34   <kernel_x86_64> reschedule(int) + 0x424
 1 ffffffff82194810 (+  32) ffffffff80152c4c   <kernel_x86_64> x86_hardware_interrupt + 0xdc
 2 ffffffff82194830 (+ 888) ffffffff8014869c   <kernel_x86_64> int_bottom + 0x80
kernel iframe at 0xffffffff82194ba8 (end = 0xffffffff82194c70)
 rax 0x1                   rbx 0x1                   rcx 0x830
 rdx 0x90                  rsi 0x1                   rdi 0xffffffff801cfcc0
 rbp 0xffffffff82194ca0     r8 0xffffffff82a700b4     r9 0x1
 r10 0x1                   r11 0xffffffff82ff4c20    r12 0xffffffff82a70078
 r13 0x1                   r14 0xffffffff82a700b0    r15 0xffffffff82a700b4
 rip 0xffffffff80079e4f    rsp 0xffffffff82194c78 rflags 0x10286
 vector: 0xfd, error code: 0x0
 3 ffffffff82194ba8 (+ 248) ffffffff80079e4f   <kernel_x86_64> smp_send_broadcast_ici.part.0 + 0x18f
 4 ffffffff82194ca0 (+  64) ffffffff8015b51d   <kernel_x86_64> X86VMTranslationMap::Flush() + 0xbd
 5 ffffffff82194ce0 (+ 144) ffffffff8014a80b   <kernel_x86_64> X86VMTranslationMap64Bit::UnmapPages[clone .part.0] (VMArea*, unsigned long, unsigned long, bool) + 0x31b
 6 ffffffff82194d70 (+  80) ffffffff80125338   <kernel_x86_64> delete_area(VMAddressSpace*, VMArea*, bool) + 0xf8
 7 ffffffff82194dc0 (+ 256) ffffffff80128c6e   <kernel_x86_64> vm_delete_area + 0x19e
 8 ffffffff82194ec0 (+  48) ffffffff800889b2   <kernel_x86_64> BKernel::Thread::~Thread() + 0x62
 9 ffffffff82194ef0 (+  32) ffffffff80088bb2   <kernel_x86_64> BKernel::Thread::~Thread() + 0x12
10 ffffffff82194f10 (+  32) ffffffff8016c8d2   <kernel_x86_64> BReferenceable::ReleaseReference() + 0x22
11 ffffffff82194f30 (+ 128) ffffffff8008849c   <kernel_x86_64> undertaker(void*) + 0x29c
12 ffffffff82194fb0 (+  32) ffffffff8008dc58   <kernel_x86_64> common_thread_entry(void*) + 0x38
13 ffffffff82194fd0 (+2112270384) ffffffff82194fe0   141:undertaker_3_kstack@0xffffffff82190000 + 0x4fe0
kdebug> 

Trying to print the thread information ends in a read fault:

THREAD: 0xffffffff82ace9c0
id:                 3 (0x3)
serial_number:      3
name:               "undertaker"
hash_next:          0x0000000000000000
team_next:          0xffffffff801d0d00
priority:           15 (I/O: -1)
state:              ready
cpu:                0x0000000000000000 
sig_pending:        0x0 (blocked: 0x0, before sigsuspend(): 0x0)
in_kernel:          1
fault_handler:      0x0000000000000000
team:               0xffffffff828ae600, "kernel_team"
  exit.sem:         6
  exit.status:      0x0 (No error)
  exit.waiters:
kernel_stack_area:  141
kernel_stack_base:  0xffffffff82190000
user_stack_area:    -1
user_stack_base:    0x0000000000000000
user_local_storage: 0x0000000000000000
user_thread:        0x0000000000000000
kernel_errno:       0x0 (No error)
kernel_time:        58010
user_time:          0
flags:              0x0
architecture dependant section:
	rsp: 0xffffffff82194758
	syscall_rsp: 0xffffffff82195000
	user_rsp: 0x0000000000000000
	fpu_state at 0xffffffff82aced00
scheduler data:
	priority_penalty:	0
	additional_penalty:	14 (2084)
	effective_priority:	1

[*** READ FAULT at 0x30, pc: 0xffffffff800a23d5 ***]

The next thing to be dumped in the scheduler data is the time-used and quantum. Computing the quantum tries to access the thread's fCore, which I guess must be NULL. So, somehow this thread got de-scheduled from the disabled CPU and never rescheduled anywhere else.

comment:2 by waddlesplash, 20 months ago

Note that smp_send_broadcast_ici works in interrupts-disabled until it finishes, when it reenables interrupts. So on re-enabling interrupts, it simply never got rescheduled again?

comment:3 by waddlesplash, 20 months ago

Summary: [kernel] System hangs disabling CPUs[kernel] System hangs or KDLs disabling CPUs

comment:4 by waddlesplash, 20 months ago

Here's a test program I made:

#include <OS.h>

extern status_t		_kern_set_cpu_enabled(int32 cpu, bool enabled);

int main()
{
	for (int32 i = 0; i < (INT32_MAX-1); i++) {
		_kern_set_cpu_enabled(1, i%2);
	}
}

Running this with the system under load produces KDLs. Example:

PANIC: ASSERT FAILED (../haiku-git/src/system/kernel/scheduler/scheduler.cpp:425): nextThreadData->Core() == core
Welcome to Kernel Debugging Land...
Thread 1726 "daemon" running on CPU 1
stack trace for thread 1726 "daemon"
    kernel stack: 0xffffffff82087000 to 0xffffffff8208c000
      user stack: 0x00007f9764bca000 to 0x00007f9764c0a000
frame                       caller             <image>:function + offset
 0 ffffffff8208bb08 (+  24) ffffffff80146eac   <kernel_x86_64> arch_debug_call_with_fault_handler + 0x16
 1 ffffffff8208bb20 (+  80) ffffffff800b0518   <kernel_x86_64> debug_call_with_fault_handler + 0x78
 2 ffffffff8208bb70 (+  96) ffffffff800b1b83   <kernel_x86_64> kernel_debugger_loop(char const*, char const*, __va_list_tag*, int) + 0xf3
 3 ffffffff8208bbd0 (+  80) ffffffff800b1f1e   <kernel_x86_64> kernel_debugger_internal(char const*, char const*, __va_list_tag*, int) + 0x6e
 4 ffffffff8208bc20 (+ 240) ffffffff800b2277   <kernel_x86_64> panic + 0xb7
 5 ffffffff8208bd10 (+ 112) ffffffff8009baa6   <kernel_x86_64> reschedule(int) + 0x296
 6 ffffffff8208bd80 (+  48) ffffffff8008b566   <kernel_x86_64> thread_block + 0xc6
 7 ffffffff8208bdb0 (+ 112) ffffffff80057be5   <kernel_x86_64> ConditionVariableEntry::Wait(unsigned int, long) + 0x155
 8 ffffffff8208be20 (+ 160) ffffffff8006f4b1   <kernel_x86_64> read_port_etc + 0x111
 9 ffffffff8208bec0 (+  96) ffffffff80070401   <kernel_x86_64> _user_read_port_etc + 0x91
10 ffffffff8208bf20 (+  16) ffffffff8014899f   <kernel_x86_64> x86_64_syscall_entry + 0xfb
user iframe at 0xffffffff8208bf30 (end = 0xffffffff8208bff8)
 rax 0xde                  rbx 0x7f9764c07c50        rcx 0x15748583d6c
 rdx 0x7f9764c07c50        rsi 0x7f9764c07c4c        rdi 0x17
 rbp 0x7f9764c09c90         r8 0x0                    r9 0x0
 r10 0x2001                r11 0x246                 r12 0x7f15f5b7e290
 r13 0x7f9764c07c4c        r14 0xffffffff            r15 0x7f15f5b7e430
 rip 0x15748583d6c         rsp 0x7f9764c07c28     rflags 0x246
 vector: 0x63, error code: 0x0
11 ffffffff8208bf30 (+140290320489824) 0000015748583d6c   <libroot.so> _kern_read_port_etc + 0x0c
12 00007f9764c09c90 (+  16) 000001d3cbb59b29   <_APP_> SyslogDaemon::_DaemonThread(void*) + 0x09
13 00007f9764c09ca0 (+  32) 0000015748582b49   <libroot.so> _thread_do_exit_work (nearest) + 0x89
14 00007f9764c09cc0 (+   0) 00007fab621e1258   <commpage> commpage_thread_exit + 0x00

continuing from this just ends in a NULL dereference.

comment:5 by waddlesplash, 16 months ago

hrev57315 should improve things here.

comment:6 by waddlesplash, 15 months ago

Nope, retried the test program on hrev57316 and got this under load:

PANIC: ASSERT FAILED (../haiku-git/src/system/kernel/scheduler/scheduler_cpu.cpp:566): newKey >= 0
Welcome to Kernel Debugging Land...
Thread 25971 "sed" running on CPU 1
stack trace for thread 25971 "sed"
    kernel stack: 0xffffffff8073c000 to 0xffffffff80741000
      user stack: 0x00007fc0b26b7000 to 0x00007fc0b36b7000
frame                       caller             <image>:function + offset
 0 ffffffff80740470 (+  32) ffffffff8014ade0   <kernel_x86_64> arch_debug_call_with_fault_handler + 0x1a
 1 ffffffff80740490 (+  80) ffffffff800b3a98   <kernel_x86_64> debug_call_with_fault_handler + 0x78
 2 ffffffff807404e0 (+  96) ffffffff800b5144   <kernel_x86_64> kernel_debugger_loop(char const*, char const*, __va_list_tag*, int) + 0xf4
 3 ffffffff80740540 (+  80) ffffffff800b54de   <kernel_x86_64> kernel_debugger_internal(char const*, char const*, __va_list_tag*, int) + 0x6e
 4 ffffffff80740590 (+ 240) ffffffff800b5877   <kernel_x86_64> panic + 0xb7
 5 ffffffff80740680 (+  64) ffffffff800a35b1   <kernel_x86_64> Scheduler::CoreEntry::_UpdateLoad(bool) + 0x6c1
 6 ffffffff807406c0 (+ 112) ffffffff8009f727   <kernel_x86_64> reschedule(int) + 0x967
 7 ffffffff80740730 (+ 544) ffffffff8008bf2d   <kernel_x86_64> thread_exit + 0x84d
 8 ffffffff80740950 (+1440) ffffffff800752af   <kernel_x86_64> handle_signals + 0x99f
 9 ffffffff80740ef0 (+  48) ffffffff8008899e   <kernel_x86_64> thread_at_kernel_exit + 0x1e
10 ffffffff80740f20 (+  16) ffffffff8014cbe2   <kernel_x86_64> x86_64_syscall_entry + 0x31e
user iframe at 0xffffffff80740f30 (end = 0xffffffff80740ff8)
 rax 0x0                   rbx 0x0                   rcx 0x1373c0b8459
 rdx 0x198ba2a1b28         rsi 0x7fc0b36b6cb0        rdi 0x0
 rbp 0x7fc0b36b6db0         r8 0x0                    r9 0x1
 r10 0xf                   r11 0x202                 r12 0x4
 r13 0x7fc0b36b7670        r14 0x0                   r15 0x0
 rip 0x1373c0b8459         rsp 0x7fc0b36b6d98     rflags 0x202
 vector: 0x63, error code: 0x0
Note: See TracTickets for help on using tickets.