Opened 11 years ago

Closed 11 years ago

#2572 closed bug (fixed)

Haiku occasionally freezes with CPU halted

Reported by: andreasf Owned by: axeld
Priority: critical Milestone: R1/alpha1
Component: System/Kernel Version: R1/pre-alpha1
Keywords: Cc:
Blocked By: Blocking:
Has a Patch: no Platform: x86

Description

For some time, I am occasionally seeing Haiku freeze completely, i.e. it does not react to mouse movements or F12.

Serial output says

CPU 0 halted

(or CPU 1 halted).

There is however no indication what actually leads to this, i.e. the messages before differ, e.g., BFS warnings or "last message repeated two times" or runtime_loader unable to load _APP_ or some non-existing file.

In all instances this happened under load, running configure or make, last at hrev26810. (And it happened already before my local signal handler changes.)

Change History (29)

comment:1 by mmlr, 11 years ago

This is what you get when entering KDL. The usual procedure is that one CPU gets a panic, sends the other CPUs a message to stop them, these acknowledge and go into a loop (that's when the line "CPU x halted" is printed), and then the CPU where panic occured proceeds into the debugger. There are apparently still some race conditions in this, as it can happen that the original CPU doesn't wake up after stopping the others. I reworked those codepaths once, but I can confirm that it sometimes still happens. Though it could also be the code directly after that, as I've also seen a few times that the debugger starts, but then hangs after printing initial output when having the auto_stack_trace debugger module installed.

comment:2 by andreasf, 11 years ago

Thanks for these explanations. I believe I have no special settings or modules installed, so it would seem like the initial race condition.

Anything I should do to confirm? Is there debug output to enable somewhere?

comment:3 by mmlr, 11 years ago

You could take a look at src/system/kernel/debug/debug.cpp and there kernel_debugger() and kernel_debugger_loop(). The messaging and halting is done in src/system/kernel/smp.c. There's no specific debug output, but you can add to it. The problem is that this kind of stuff is hard to reproduce in general and adding debug output usually has the nasty side effect of hiding the problem because it delays/changes locking/scheduling and such. A good code review is probably the better approach here.

comment:4 by andreasf, 11 years ago

I can almost reliably reproduce it at hrev26820. It seems related to the new XSI semaphores.

At commit 5de9faef96e4d575a62a8525429c701eaf8cd7c6 of my Git branch of Mono, executing mono with either some copied-over hello.exe or the monolite mcs.exe (e.g., as part of bootstrapping with make) appears to make heavy use of them, according to serial output, and halts the CPU while aborting. If I configure with --disable-shared-handles, it shows no such XSI activity and aborts okay, without halting the CPU. Still not really suited for debugging though. ;-)

comment:5 by andreasf, 11 years ago

runtime_loader: cannot open file /ToBe/Mono/mcs/class/lib/monolite/mscorlib.dll.so                                 
xsi_semget: key = 3, numberOfSemaphores = 8, flags = 384                                                           
xsi_semget: key 3 does not exist, but the caller did not ask for creation                                          
xsi_semget: key = 1292045439, numberOfSemaphores = 8, flags = 1920                                                 
xsi_semctl: semaphoreID = 1, semaphoreNumber = 0, command = 9                                                      
XsiSemaphore::ClearUndos: semaphoreSetID = 1, semaphoreNumber = 0                                                  
Last message repeated 7 times.                                                                                     
xsi_semop: semaphoreID = 1, ops = 0x7ffedbd4, numOps = 1                                                           
xsi_semop: semaphoreNumber = 6, value = 1                                                                          
xsi_semop: semaphore acquired succesfully                                                                          
XsiSemaphore::RecordUndo: new record added. Team = 86564, semaphoreSetID = 1, semaphoreNumber = 6, value = -1      
xsi_semop: semaphoreID = 1, ops = 0x7ffedbd4, numOps = 1                                                           
xsi_semop: semaphoreNumber = 7, value = 0                                                                          
xsi_semop: semaphore acquired succesfully                                                                          
XsiSemaphore::RecordUndo: new record added. Team = 86564, semaphoreSetID = 1, semaphoreNumber = 7, value = 1       
xsi_semop: semaphoreID = 1, ops = 0x7ffedbc4, numOps = 1                                                           
xsi_semop: semaphoreNumber = 6, value = 0                                                                          
xsi_semop: semaphore acquired succesfully                                                                          
XsiSemaphore::RecordUndo: found record. Team = 86564, semaphoreSetID = 1, semaphoreNumber = 6, value = -1          
xsi_semop: semaphoreID = 1, ops = 0x7ffed844, numOps = 1                                                           
xsi_semop: semaphoreNumber = 3, value = 1                                                                          
xsi_semop: semaphore acquired succesfully                                                                          
XsiSemaphore::RecordUndo: new record added. Team = 86564, semaphoreSetID = 1, semaphoreNumber = 3, value = -1      
xsi_semop: semaphoreID = 1, ops = 0x7ffed844, numOps = 1                                                           
xsi_semop: semaphoreNumber = 3, value = 0                                                                          
xsi_semop: semaphore acquired succesfully                                                                          
XsiSemaphore::RecordUndo: found record. Team = 86564, semaphoreSetID = 1, semaphoreNumber = 3, value = -1          
runtime_loader: cannot open file /ToBe/Mono/mcs/class/lib/monolite/mcs.exe.so                                      
runtime_loader: cannot open file /ToBe/Mono/mcs/class/lib/monolite/mcs.exe.so                                      
xsi_semop: semaphoreID = 1, ops = 0x7ffed0c8, numOps = 1                                                           
xsi_semop: semaphoreNumber = 3, value = 1                                                                          
xsi_semop: semaphore acquired succesfully                                                                          
XsiSemaphore::RecordUndo: found record. Team = 86564, semaphoreSetID = 1, semaphoreNumber = 3, value = 0           
xsi_semop: semaphoreID = 1, ops = 0x7ffed0c8, numOps = 1                                                           
xsi_semop: semaphoreNumber = 3, value = 0                                                                          
xsi_semop: semaphore acquired succesfully                                                                          
XsiSemaphore::RecordUndo: found record. Team = 86564, semaphoreSetID = 1, semaphoreNumber = 3, value = -1          
xsi_semop: semaphoreID = 1, ops = 0x7ffecbc8, numOps = 1                                                           
xsi_semop: semaphoreNumber = 3, value = 1                                                                          
xsi_semop: semaphore acquired succesfully                                                                          
XsiSemaphore::RecordUndo: found record. Team = 86564, semaphoreSetID = 1, semaphoreNumber = 3, value = 0           
xsi_semop: semaphoreID = 1, ops = 0x7ffecbc8, numOps = 1                                                           
xsi_semop: semaphoreNumber = 3, value = 0                                                                          
xsi_semop: semaphore acquired succesfully                                                                          
XsiSemaphore::RecordUndo: found record. Team = 86564, semaphoreSetID = 1, semaphoreNumber = 3, value = -1          
xsi_semop: semaphoreID = 1, ops = 0x7ffed0e8, numOps = 1                                                           
xsi_semop: semaphoreNumber = 6, value = 1                                                                          
xsi_semop: semaphore acquired succesfully                                                                          
XsiSemaphore::RecordUndo: found record. Team = 86564, semaphoreSetID = 1, semaphoreNumber = 6, value = 0           
xsi_semctl: semaphoreID = 1, semaphoreNumber = 7, command = 4                                                      
xsi_semctl: semaphoreID = 1, semaphoreNumber = 0, command = 0                                                      
XsiSemaphoreSet::~XsiSemaphoreSet(): removing semaphore set                                                        
XsiSemaphore::ClearUndos: semaphoreSetID = 1, semaphoreNumber = 0                                                  
XsiSemaphore::ClearUndos: semaphoreSetID = 1, semaphoreNumber = 1                                                  
XsiSemaphore::ClearUndos: semaphoreSetID = 1, semaphoreNumber = 2                                                  
XsiSemaphore::ClearUndos: semaphoreSetID = 1, semaphoreNumber = 3                                                  
XsiSemaphore::ClearUndos: semaphoreSetID = 1, semaphoreNumber = 4                                                  
XsiSemaphore::ClearUndos: semaphoreSetID = 1, semaphoreNumber = 5                                                  
XsiSemaphore::ClearUndos: semaphoreSetID = 1, semaphoreNumber = 6                                                  
XsiSemaphore::ClearUndos: semaphoreSetID = 1, semaphoreNumber = 7                                                  
CPU 0 halted!

comment:6 by emitrax, 11 years ago

xsi semaphore debug output looks fine to me. I don't think it's related. Although they might lead to KDL, they have nothing to do with halting the CPU.

comment:7 by axeld, 11 years ago

I wouldn't be so sure about that, though :-)

Maybe you could add debug output at the beginning and end of every XSI sem function, and see if it's inside it or not.

comment:8 by axeld, 11 years ago

Component: - GeneralSystem/Kernel

comment:9 by axeld, 11 years ago

Priority: normalcritical

comment:10 by mmlr, 11 years ago

Entering KDL is what halts the CPUs, so when there is a panic from XSI sems then it indirectly does halt the CPUs. It would be interesting which panic is triggered as it might be in a situation that leads to this situation reliably. Maybe there is interrupt disabling or spinlocking involved? I could certainly imagine that a spinlock could lead to such a situation.

in reply to:  6 comment:11 by andreasf, 11 years ago

My wording was unfortunate, I regret; I'm not saying XSI semaphores are the cause of this, they just seem a way to trigger it.

As Michael said earlier, halting the CPU is part of KDL, so if the semaphores not only lead to KDL but also happen to lead to this deadlock, for whatever reason, we'd have a way to reproduce this bug. For me it is 100% reproducible this way (and apart from that, "occasionally" referred to multiple times a day).

comment:12 by emitrax, 11 years ago

Currently the only spinlock used is the global team one.

comment:13 by emitrax, 11 years ago

Then don't update, because I'm currently in the process of reworking the way undo requests are handled and I'm removing the global spinlock in favour of a private mutex. I'm glad my bugs help to find other bugs. :-).

Anyway, from the debug output it seems to be in the semctl function, and it seems that it happens when it tries to remove the last undo request. Functions involved is http://dev.haiku-os.org/browser/haiku/trunk/src/system/kernel/posix/xsi_semaphore.cpp#L513

That's what might trigger the panic, but not the cause of your problem as you said.

comment:14 by mmlr, 11 years ago

Andreas, you could disable SMP from the bootmenu and then retest that. Without SMP the original panic should run through so you can get a stacktrace. If you could attach that here that'd be most helpful.

in reply to:  14 comment:15 by andreasf, 11 years ago

I tried disabling CPU 1 in ProcessController (i.e., only CPU 0 left). In that case funnily it ends in

CPU 1 halted!

Will try the bootmenu next.

comment:16 by mmlr, 11 years ago

Yeah, that's expected. Disabling the CPU only instructs the scheduler not to schedule any threads on that CPU, but the CPU keeps running and entering KDL still requires halting it. That you get "CPU 1 halted" is therefore correct, as the thread that enters KDL must run on CPU 0 and therefore it's CPU 0 sending CPU 1 the halt message, hence the "CPU 1 halted" output (CPU 0 is not halted, it's running the kernel debugger loop).

comment:17 by andreasf, 11 years ago

Ha, no KDL without SMP! Instead this:

runtime_loader: cannot open file /ToBe/Mono/mcs/class/lib/monolite/mscorlib.dll.so
xsi_semget: key = 1176174592, numberOfSemaphores = 8, flags = 384
xsi_semget: key 1176174592 does not exist, but the caller did not ask for creation
xsi_semget: key = 1292045451, numberOfSemaphores = 8, flags = 1920
xsi_semctl: semaphoreID = 1, semaphoreNumber = 0, command = 9
XsiSemaphore::ClearUndos: semaphoreSetID = 1, semaphoreNumber = 0
Last message repeated 7 times.
xsi_semop: semaphoreID = 1, ops = 0x7ffedbd4, numOps = 1
xsi_semop: semaphoreNumber = 6, value = 1
xsi_semop: semaphore acquired succesfully
XsiSemaphore::RecordUndo: new record added. Team = 48549, semaphoreSetID = 1, semaphoreNumber = 6, value = -1
xsi_semop: semaphoreID = 1, ops = 0x7ffedbd4, numOps = 1
xsi_semop: semaphoreNumber = 7, value = 0
xsi_semop: semaphore acquired succesfully
XsiSemaphore::RecordUndo: new record added. Team = 48549, semaphoreSetID = 1, semaphoreNumber = 7, value = 1
xsi_semop: semaphoreID = 1, ops = 0x7ffedbc4, numOps = 1
xsi_semop: semaphoreNumber = 6, value = 0
xsi_semop: semaphore acquired succesfully
XsiSemaphore::RecordUndo: found record. Team = 48549, semaphoreSetID = 1, semaphoreNumber = 6, value = -1
xsi_semop: semaphoreID = 1, ops = 0x7ffed844, numOps = 1
xsi_semop: semaphoreNumber = 3, value = 1
xsi_semop: semaphore acquired succesfully
XsiSemaphore::RecordUndo: new record added. Team = 48549, semaphoreSetID = 1, semaphoreNumber = 3, value = -1
xsi_semop: semaphoreID = 1, ops = 0x7ffed844, numOps = 1
xsi_semop: semaphoreNumber = 3, value = 0
xsi_semop: semaphore acquired succesfully
XsiSemaphore::RecordUndo: found record. Team = 48549, semaphoreSetID = 1, semaphoreNumber = 3, value = -1
runtime_loader: cannot open file /ToBe/Mono/mcs/class/lib/monolite/mcs.exe.so
runtime_loader: cannot open file /ToBe/Mono/mcs/class/lib/monolite/mcs.exe.so
xsi_semop: semaphoreID = 1, ops = 0x7ffed1d4, numOps = 1
xsi_semop: semaphoreNumber = 2, value = 1
xsi_semop: semaphore acquired succesfully
XsiSemaphore::RecordUndo: new record added. Team = 48549, semaphoreSetID = 1, semaphoreNumber = 2, value = -1
xsi_semop: semaphoreID = 1, ops = 0x7ffed1c4, numOps = 1
xsi_semop: semaphoreNumber = 2, value = 0
xsi_semop: semaphore acquired succesfully
XsiSemaphore::RecordUndo: found record. Team = 48549, semaphoreSetID = 1, semaphoreNumber = 2, value = -1
xsi_semop: semaphoreID = 1, ops = 0x7ffed1d4, numOps = 1
xsi_semop: semaphoreNumber = 2, value = 1
xsi_semop: semaphore acquired succesfully
XsiSemaphore::RecordUndo: found record. Team = 48549, semaphoreSetID = 1, semaphoreNumber = 2, value = 0
xsi_semop: semaphoreID = 1, ops = 0x7ffed1c4, numOps = 1
xsi_semop: semaphoreNumber = 2, value = 0
xsi_semop: semaphore acquired succesfully
XsiSemaphore::RecordUndo: found record. Team = 48549, semaphoreSetID = 1, semaphoreNumber = 2, value = -1
xsi_semop: semaphoreID = 1, ops = 0x7ffed1d4, numOps = 1
xsi_semop: semaphoreNumber = 2, value = 1
xsi_semop: semaphore acquired succesfully
XsiSemaphore::RecordUndo: found record. Team = 48549, semaphoreSetID = 1, semaphoreNumber = 2, value = 0
xsi_semop: semaphoreID = 1, ops = 0x7ffed1c4, numOps = 1
xsi_semop: semaphoreNumber = 2, value = 0
xsi_semop: semaphore acquired succesfully
XsiSemaphore::RecordUndo: found record. Team = 48549, semaphoreSetID = 1, semaphoreNumber = 2, value = -1
bfs: bfs_read_link:1368: Invalid Argument
runtime_loader: cannot open file /ToBe/Mono/mcs/class/lib/monolite/System.Xml.dll.so
bfs: bfs_read_link:1368: Invalid Argument
runtime_loader: cannot open file /ToBe/Mono/mcs/class/lib/monolite/System.dll.so
xsi_semop: semaphoreID = 1, ops = 0x7ffed8b0, numOps = 1
xsi_semop: semaphoreNumber = 3, value = 1
xsi_semop: semaphore acquired succesfully
XsiSemaphore::RecordUndo: found record. Team = 48549, semaphoreSetID = 1, semaphoreNumber = 3, value = 0
xsi_semop: semaphoreID = 1, ops = 0x7ffed8b0, numOps = 1
xsi_semop: semaphoreNumber = 2, value = 1
xsi_semop: semaphore acquired succesfully
XsiSemaphore::RecordUndo: found record. Team = 48549, semaphoreSetID = 1, semaphoreNumber = 2, value = 0
xsi_semop: semaphoreID = 1, ops = 0x70013ed8, numOps = 1
xsi_semop: semaphoreNumber = 3, value = 0
XsiSemaphore::Add: potentially going to sleep
XsiSemaphore::Wait: going to sleep
vm_soft_fault: va 0xeea000 not covered by area in address space
vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0xeea01c, ip 0x31bd15, write 0, user 1, thread 0xbda5
vm_page_fault: sending team "mono" 0xbda5 SIGSEGV, ip 0x31bd15 ("mono_seg0ro" +0x11bd15)
stack trace:
  0x0031d6ec (mono_seg0ro + 0x11d6ec)
  0x0031d8ae (mono_seg0ro + 0x11d8ae)
  0x0031da99 (mono_seg0ro + 0x11da99)
  0x00415d06 (mono_seg0ro + 0x215d06)
  0x180711b1 (heap + 0x711b1)
  0x180706c1 (heap + 0x706c1)
  0x18070478 (heap + 0x70478)
  0x18070432 (heap + 0x70432)
  0x180703cd (heap + 0x703cd)
  0x18070258 (heap + 0x70258)
  0x1806fabb (heap + 0x6fabb)
  0x1806ee3a (heap + 0x6ee3a)
  0x18067983 (heap + 0x67983)
  0x180640a5 (heap + 0x640a5)
  0x00398691 (mono_seg0ro + 0x198691)
  0x002d9f0c (mono_seg0ro + 0xd9f0c)
  0x002db3af (mono_seg0ro + 0xdb3af)
  0x002dae9a (mono_seg0ro + 0xdae9a)
  0x0022d725 (mono_seg0ro + 0x2d725)
  0x0022d7f0 (mono_seg0ro + 0x2d7f0)
  0x0022e9c3 (mono_seg0ro + 0x2e9c3)
  0x0022c9b7 (mono_seg0ro + 0x2c9b7)
  0x0022c86f (mono_seg0ro + 0x2c86f)
  0x001008ea (runtime_loader_seg0ro + 0x8ea)
  0x7ffedfec (mono_main_stack + 0xffffec)
vm_soft_fault: va 0x0 not covered by area in address space
vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0x0, ip 0x800b6364, write 0, user 0, thread 0xbda5
xsi_semop: back to life
xsi_semop: semaphore set id 1 got destroyed
xsi_sem_undo: TeamID = 48549, SemaphoreSetID = 1, SemaphoreNumber = 6, undo value = 0
xsi_sem_undo: TeamID = 48549, SemaphoreSetID = 1, SemaphoreNumber = 7, undo value = 1
xsi_sem_undo: TeamID = 48549, SemaphoreSetID = 1, SemaphoreNumber = 3, undo value = -1
xsi_sem_undo: TeamID = 48549, SemaphoreSetID = 1, SemaphoreNumber = 2, undo value = -1

comment:18 by mmlr, 11 years ago

That output looks more like a "normal" application bug in mono though. A stack trace from the user debugger would be a bit more readable, but what you see there shouldn't be a kernel bug. Well, then it's "unfortunate" that you don't get KDL without SMP, as this makes it trickier to pinpoint. As suggested originally a code review of the SMP code that does the halting and the KDL entry codepath would likely be a good idea.

in reply to:  18 comment:19 by andreasf, 11 years ago

Possible that there's a bug in my port that's triggering other things, but I'm not complaining about Mono issues here. Those I'm tracking over at: http://ports.haiku-files.org/wiki/dev-lang/mono/2.x/1

Vanilla Mono does run fine on OpenSolaris and other platforms. The only vaguely related change of mine that comes to mind is that Haiku did not have madvise, mprotect and the POSIX sigaction with SA_SIGINFO yet. I just cross-checked with shared handles disabled and SMP disabled, aborts fine, no vm_soft_fault or freeze.

Anyhow, if it's a bug in a userland app such as mono, I'd expect an error (message / crash) in userland. If it's a bug in the kernel, I'd expect a KDL (to investigate), never just a freeze without info.

comment:20 by mmlr, 11 years ago

Sure, I'm not saying the original issue is not a bug in Haiku. Just the vm_soft_fault in the output you get without SMP is probably caused by mono and not the same issue seen with SMP enabled. Obviously something happens with SMP enabled that doesn't with SMP disabled that leads to a panic() in the kernel with some other factors that cause the hang.

comment:21 by andreasf, 11 years ago

I've found a hint ... when compiling with -D_REENTRANT -D_THREAD_SAFE I only get the vm_page_fault, no KDL or freeze, like with SMP disabled.

So I'd assume the problem leading to the freeze is triggered somewhere in libroot (or the headers) where a differentiation based on one of those defines is made. If that were true and there were no easy way to prevent the freeze, we could close this bug as invalid.

comment:22 by bonefish, 11 years ago

Might be fixed in hrev28223.

comment:23 by stippi, 11 years ago

I have tried building Haiku within Haiku after upgrading to hrev28224 twice and both times, the system completely freezed. I don't know if this is a new bug or this bug. I was running on real hardware both times. The second time, the build was probably running for less than 5 minutes until it happened. The command I have run was "time jam -q -j2 haiku-image". ActivityMonitor was the only other noteworthy program to be running.

in reply to:  23 comment:24 by bonefish, 11 years ago

Replying to stippi:

I have tried building Haiku within Haiku after upgrading to hrev28224 twice and both times, the system completely freezed. I don't know if this is a new bug or this bug. I was running on real hardware both times. The second time, the build was probably running for less than 5 minutes until it happened. The command I have run was "time jam -q -j2 haiku-image". ActivityMonitor was the only other noteworthy program to be running.

  • Does disabling KDEBUG (in kernel_debug_config.h) fix the problem? I have tried a Haiku build with a hrev28223 no-KDEBUG Haiku only once yet, and it went just fine.
  • Does disabling SMP fix the problem?

comment:25 by stippi, 11 years ago

It appears I was already running Haiku with KDEBUG=0. Will try disabling SMP next.

in reply to:  25 comment:26 by bonefish, 11 years ago

Replying to stippi:

It appears I was already running Haiku with KDEBUG=0. Will try disabling SMP next.

ATM, KDEBUG is inconsistently used with both "#if" and "#ifdef". So setting KDEBUG to 0 might actually cause the problem. Try commenting it out instead.

comment:27 by stippi, 11 years ago

Thanks a lot! I have commented out KDEBUG and now the build went through for the first time since I use your changes! (Will try a few more times later.) I am also happy to report that Haiku build itself in record time for me. The last measurement I took was 43:09min and now it's down to 35:14min! That's 18.3 or 22.5 % depending on the way you look at it. Very nice work, Ingo!

comment:28 by bonefish, 11 years ago

Milestone: R1R1/alpha1

It seems the hang is reproducible with KDEBUG=1 on an SMP machine (halfway through building an image).

comment:29 by bonefish, 11 years ago

Resolution: fixed
Status: newclosed

The bug introduced in hrev28223 is fixed in hrev28245. The original bug is probably fixed in hrev28223. Closing. Please reopen when encountered again.

Note: See TracTickets for help on using tickets.