Opened 4 years ago

Closed 13 months ago

#12306 closed bug (fixed)

Haiku intermittently won't shutdown

Reported by: vidrep Owned by: nobody
Priority: high Milestone: R1/beta1
Component: System Version: R1/Development
Keywords: shutdown detailed address_space Cc:
Blocked By: Blocking: #10883
Has a Patch: no Platform: All

Description

Recent builds of Haiku will often be very slow to shutdown. Today, it wouldn't shutdown at all. Syslogs attached.

Attachments (21)

syslog (4.4 KB) - added by vidrep 4 years ago.
previous_syslog (136.4 KB) - added by vidrep 4 years ago.
syslog.old (512.0 KB) - added by vidrep 4 years ago.
syslog.2 (137.6 KB) - added by vidrep 3 years ago.
IMG_0125.JPG (1.6 MB) - added by vidrep 3 years ago.
syslog_tail (1.4 MB) - added by vidrep 3 years ago.
teams (1.4 MB) - added by vidrep 3 years ago.
threads (1.8 MB) - added by vidrep 3 years ago.
rwlock (1.5 MB) - added by vidrep 3 years ago.
IMG_0136.JPG (1.4 MB) - added by vidrep 3 years ago.
IMG_0132.JPG (2.9 MB) - added by vidrep 3 years ago.
IMG_0133.JPG (1.7 MB) - added by vidrep 3 years ago.
IMG_0134.JPG (2.3 MB) - added by vidrep 3 years ago.
KDL_serial_log.txt (20.9 KB) - added by vidrep 3 years ago.
10032015_1.txt (19.2 KB) - added by vidrep 3 years ago.
10152015.txt (42.0 KB) - added by vidrep 3 years ago.
KDL_report (11.7 KB) - added by vidrep 3 years ago.
IMG_0150.JPG (1.5 MB) - added by vidrep 3 years ago.
listdev (3.4 KB) - added by bruno 2 years ago.
listdev_HP8100 (2.1 KB) - added by vidrep 2 years ago.
listdev_HP8300 (2.0 KB) - added by vidrep 2 years ago.

Change History (85)

Changed 4 years ago by vidrep

Attachment: syslog added

Changed 4 years ago by vidrep

Attachment: previous_syslog added

Changed 4 years ago by vidrep

Attachment: syslog.old added

comment:1 Changed 4 years ago by vidrep

hrev49548 x86_gcc2

comment:2 Changed 4 years ago by waddlesplash

Today, it wouldn't shutdown at all.

What does this mean? Does it close all apps/services, but never turns off? Or does it hang on trying to close apps?

comment:3 Changed 4 years ago by vidrep

Sorry for the poor description. Haiku never turns off, even after waiting more than 15 minutes. There is not any message about an application or service that needs to be terminated - only a blue screen.

comment:4 Changed 4 years ago by waddlesplash

Milestone: UnscheduledR1/beta1
Priority: normalblocker

I don't see anything in the syslogs, but I suspect it's (yet another) launch_daemon regression.

comment:5 Changed 4 years ago by vidrep

Would capturing serial data help at all? The problem is intermittent, but happens often enough that there is a chance of maybe getting something useful. Any suggestions of steps to take?

comment:6 Changed 4 years ago by axeld

The launch_daemon currently has nothing to do with the shutdown process at all; while it's planned to change this, it's unlikely to be the cause of any issues there at this time.

comment:7 Changed 3 years ago by vidrep

It's happening now with hrev49573 x86_gcc2. It wouldn't shut down. I had to power off and cold boot. Upon boot it immediately went into KDL. I am at the KDL screen now. Any steps to take?

comment:8 Changed 3 years ago by vidrep

Unfortunately I have to go now. I'll attach the syslogs and a photo of the KDL later.

comment:9 Changed 3 years ago by vidrep

Attached syslog and photo of KDL output. If this happens again, what steps can I take at the KDL screen to get useful information?

Changed 3 years ago by vidrep

Attachment: syslog.2 added

Changed 3 years ago by vidrep

Attachment: IMG_0125.JPG added

comment:10 Changed 3 years ago by vidrep

The KDL output appears the same as reported in tickets #11439 and #11952. Perhaps this is two unrelated issues and #11439 should be reopened?

comment:11 Changed 3 years ago by diver

The crash in ps2_dev_unpublish is already being tracked in another ticket but I was unable to find which one.

comment:12 Changed 3 years ago by diver

For the record - it was #11952.

comment:13 Changed 3 years ago by vidrep

This is till happening in hrev49627 x86_64. Is there anything I can do when the shutdown sequence is hanging? I am able to drop into KDL using Alt-SysRq-D. Are there any commands I can execute to get something useful for resolving this issue?

comment:14 Changed 3 years ago by diver

Can you capture the output of syslog | tail, teams and threads?

comment:15 Changed 3 years ago by vidrep

hrev49653 x86_gcc2 Shutdown -> Power Off Shutdown Status -> Shutting Down...

Attached are photos of syslog | tail, teams, and threads

Changed 3 years ago by vidrep

Attachment: syslog_tail added

Changed 3 years ago by vidrep

Attachment: teams added

Changed 3 years ago by vidrep

Attachment: threads added

comment:16 Changed 3 years ago by vidrep

Happened again Dropped into KDL and ran rwlock command on the suspect thread Photo attached (rwlock)

Changed 3 years ago by vidrep

Attachment: rwlock added

comment:17 Changed 3 years ago by vidrep

I have attached the photo again without the rename.

Last edited 2 years ago by vidrep (previous) (diff)

Changed 3 years ago by vidrep

Attachment: IMG_0136.JPG added

comment:18 Changed 3 years ago by ttcoder

Thanks, easier to read online if it has the .JPG extension ;-)

Seems the read-write lock at address 0x82809f10 is the culprit indeed since the same "shutdown worker" thread was stuck on that same lock in the previous instance of the bug (previous screenshot you uploaded).

What's needed now is to draw the attention of one of the kernel devs, drawing their attention to the rwlock named kernel address space not being relinquinshed by thread 5 (-> ?), hence thread 3034 (-> "shutdown worker") being unable to complete its work. Maybe they can interpret the "count" field also, I have no idea what that one means

comment:19 Changed 3 years ago by vidrep

I'm constantly switching between Alpha 4.1, x86_gcc2 and x86_64 builds while testing, so that's probably why I see this issue as much as I do - usually every few shutdown events. Somebody just needs to guide me on what steps to take to get useful information so the bug can be resolved. Thanks again Cedric!

comment:20 Changed 3 years ago by vidrep

It happened again today. This time I got a backtrace on thread 5 (rwlock) and 420 (shutdown worker). Photo attached (IMG_0132)

Changed 3 years ago by vidrep

Attachment: IMG_0132.JPG added

comment:21 Changed 3 years ago by vidrep

I may have backtraced the wrong thread. Fortunately it happened on the next shutdown attempt, and I was able to get the proper backtrace for the shutdown worker. Attached photos IMG_1033 and IMG_0134.

Changed 3 years ago by vidrep

Attachment: IMG_0133.JPG added

Changed 3 years ago by vidrep

Attachment: IMG_0134.JPG added

comment:22 Changed 3 years ago by ttcoder

shutdown-worker's backtrace locates it in rw_lock_wait(), sounds consistent with with the threads command listing is as being stuck waiting on an read-write lock

undertaker's backtrace is more interesting: it seems it acquired the same lock indeed (since its bt is deep in "areas" functions and the lock it presumably acquired does deal with areas) but never got around to releasing it since it's sitting idle in smp_send_broadcast_ici()

I can't find it now, but unless I'm mixing up tickets I seem to recall one of the screenshots you had posted showed you had 3 cores disabled (1, 2, and 3) and only one (0) remaining active.

Maybe smp_send_broadcast_ici() fails whenever it's called with the other cores disabled, which happens sometimes for you for some reason ?? Just guessing, not kernel savvy, you need to get on the horn on IRC ..etc and bribe/beg/steal a solution or at least a true analysis from a real kernel dev, giving them a pointer to this backtrace stuff :-)

comment:23 Changed 3 years ago by bonefish

If the undertaker (or whatever thread) doesn't return from smp_send_broadcast_ici, it's probably either a bug in the ICI mechanism or some other thread is stuck somewhere else with interrupts disabled. It would be good to get stack traces for all currently running threads (get the list via running).

comment:24 Changed 3 years ago by vidrep

Since this bug is reproducible on my system, can somebody give me a step-by-step on what to do next time it occurs? e.g.

  1. Invoke KDL > Alt-SysRq-D
  2. threads
  3. rwlock 0x82809f10
  4. bt 5
  5. bt 3514
  6. running???

comment:25 Changed 3 years ago by ttcoder

I guess that would be:

  1. Invoke KDL > Alt-SysRq-D
  2. running
  3. bt x
  4. bt y
  5. bt z

Where 'x' 'y', 'z' and so on, are to be replaced with the thread numbers returned in the output of the running command (based on your previous posts we can guess that the first number might be 514 or 516; you'll have to figure out the other numbers from the output on screen); there should not be many of them don't worry, depending on how many cores your computer CPUs have.

comment:26 Changed 3 years ago by vidrep

Is there any way to capture the output of the KDL session to a log file or output through a serial port?

comment:27 Changed 3 years ago by diver

IIRC, debug_server writes KDL session to the syslog a few moments after you escape KDL.

comment:28 Changed 3 years ago by vidrep

So, wouldn't the same data be output through the serial port, if the system was set up for serial debugging? It appears only shutdown is affected. If I type in the "reboot" command from KDL then it reboots.

Last edited 3 years ago by vidrep (previous) (diff)

comment:29 Changed 3 years ago by diver

Sure, serial output should be the same.

comment:30 Changed 3 years ago by vidrep

I managed to capture the KDL debug output (attached) using a serial connection to a Windows PC. I hope there is something useful in the information provided. Let me know if I can try any other steps.

Changed 3 years ago by vidrep

Attachment: KDL_serial_log.txt added

comment:31 Changed 3 years ago by ttcoder

@bonefish All three of his "idle thread" are listed as being inside smp_send_broadcast_ici(), with some sort of interrupt occuring therein (?)

There are several calls to that function, maybe this is the one that is referred to in the backtrace: http://code.metager.de/source/xref/haiku/src/system/kernel/arch/x86/paging/X86VMTranslationMap.cpp#108

And the function seems to be defined here: http://code.metager.de/source/xref/haiku/src/system/kernel/smp.cpp#1208

comment:32 Changed 3 years ago by bonefish

The stack traces look all pretty normal (@ttcoder: Not sure what you're seeing. The idle threads are in the idle loop and the interrupt they are handling is the ICI throwing them into KDL). I don't have the disassembly of the kernel in question, but I suppose the "undertaker" thread is pre-empted at the moment it re-enables interrupts at the end of smp_send_broadcast_ici().

I can imagine two possible error scenarios: 1. The thread is never continued (a scheduler issue). 2. The thread is in an infinite loop in X86VMTranslationMapPAE::UnmapPages(). Both can be tested fairly easily via the kernel debugger by setting a breakpoint at the respective return instruction and exiting KDL. If all is well, the thread should reach the breakpoint (immediately) which would trigger KDL. If not, KDL is not entered.

So the whole procedure is:

  • Enter KDL as usual, get the threads and the "undertaker" stack trace.
  • The stack trace contains two interesting lines. In the attached output those are frames 5 and 7 (first number in the line), listing X86VMTranslationMap<...>::Flush() and X86VMTranslationMapPAE<...>::UnmapArea() respectively. The last numbers in either line (before the "<kernel_x86>") are the respective return addresses where to set a breakpoint. In the attached output the addresses are 8014de51 and 8014069f respectively. Unless another kernel is installed, they should remain the same (but it doesn't harm to check).
  • Test the first hypothesis (note the "0x" to prepend to the address):
    breakpoint 0x8014de51
    co
    
    If KDL reappears right away, probably mentioning that a breakpoint or debug exception has been encountered, the hypothesis is refuted and the next one must be checked. Otherwise we have found the culprit. If KDL reappears, but only after a noticeable delay, that could point to an entirely different issue.
  • Test the second hypothesis:
    breakpoint 0x8014de51 clear
    breakpoint 0x8014069f
    co
    
    Again, a reappearing KDL refutes the hypothesis.

comment:33 Changed 3 years ago by vidrep

Bonefish,

Since Haiku refuses to shutdown 50% of the time, it should be easy for me to get the information as per your instructions. I should have a log for you later today.

comment:34 Changed 3 years ago by vidrep

Bonefish, As usual, it refused to shutdown on only the second attempt. I went through the steps as described in comment 32. The log is attached (10032015_1.txt) In both instances it returned to the KDL prompt immediately.

Changed 3 years ago by vidrep

Attachment: 10032015_1.txt added

comment:35 Changed 3 years ago by vidrep

Is there anything else I can try to help resolve this issue?

comment:36 Changed 3 years ago by bonefish

I just noticed I was looking at the wrong lock. The lock acquired in X86VMTranslationMapPAE::UnmapPages() is actually the translation map lock, not the address space lock. The address space lock is acquired in vm_delete_area() i.e. farther down the stack. Could you try the breakpoint procedure with the return address from vm_delete_area() to delete_area()? In your attached logs that would be stack frame number 10 or address 8011f91d.

In case the thread doesn't return, please go up the stack one by one repeating the procedure up to frame number 8 or until the thread returns to KDL. The full procedure would be:

breakpoint 0x8011f91d
co
...
breakpoint 0x8011f91d clear
breakpoint 0x801186ae
co
...
breakpoint 0x801186ae clear
breakpoint 0x8011839f
co

comment:37 in reply to:  36 Changed 3 years ago by vidrep

Replying to bonefish:

I just noticed I was looking at the wrong lock. The lock acquired in X86VMTranslationMapPAE::UnmapPages() is actually the translation map lock, not the address space lock. The address space lock is acquired in vm_delete_area() i.e. farther down the stack. Could you try the breakpoint procedure with the return address from vm_delete_area() to delete_area()? In your attached logs that would be stack frame number 10 or address 8011f91d.

In case the thread doesn't return, please go up the stack one by one repeating the procedure up to frame number 8 or until the thread returns to KDL. The full procedure would be:

breakpoint 0x8011f91d
co
...
breakpoint 0x8011f91d clear
breakpoint 0x801186ae
co
...
breakpoint 0x801186ae clear
breakpoint 0x8011839f
co

I did as you instructed. I have attached the resulting log file to the ticket. Like last time, the prompt returned without delay.

Changed 3 years ago by vidrep

Attachment: 10152015.txt added

comment:38 Changed 3 years ago by vidrep

Have not seen this problem in about a week. Usually I'd see it every 2nd or 3rd shutdown attempt. Today, I attempted at least 12 shutdowns without issue, as a final test, using both x86_gcc2 and x86_64 builds. Let's consider the bug fixed. I will reopen the ticket if I see it again.

comment:39 Changed 3 years ago by pulkomandy

Resolution: fixed
Status: newclosed

comment:40 Changed 3 years ago by vidrep

Please reopen this ticket. I spoke too soon about a fix :(. Last night the problem manifested itself twice. I was able to invoke Kernel Debugger and get a back trace on a rwlock. It appeared to be different than what I was getting before. Unfortunately, I was not able to save a log.

comment:41 Changed 3 years ago by diver

Resolution: fixed
Status: closedreopened

comment:42 Changed 3 years ago by vidrep

It happened tonight with hrev49857 x86_gcc2 shutdown->power off shutdown status->shutting down hangs at this message I have attached a KDL_report of the rwlock backtrace

Changed 3 years ago by vidrep

Attachment: KDL_report added

comment:43 Changed 3 years ago by vidrep

This is still happening with hrev49913 x86_64. However, it is much more intermittent than before. Last time was 4 weeks ago. Previously, it was happening on every 2nd or 3rd shutdown. So, if it's ever going to get resolved, we might have to fall back on those earlier builds or get lucky with recent builds.

comment:44 Changed 3 years ago by axeld

Priority: blockerhigh

Obviously only affects some systems - I haven't seen this yet.

comment:45 in reply to:  44 Changed 3 years ago by vidrep

Replying to axeld:

Obviously only affects some systems - I haven't seen this yet.

Would it be worth the effort to install an earlier build?

comment:46 Changed 3 years ago by tqh

From the KDL_report it looks like AddressSpaceReadLocker is already held by something when using area_for in ACPICAHaiku.cpp:

void
AcpiOsUnmapMemory(void *where, ACPI_SIZE length)
{
	DEBUG_FUNCTION_F("mapped: %p; length: %lu", where, (size_t)length);
	delete_area(area_for(where));
}

Not sure how this could be. Any ideas?

comment:47 Changed 3 years ago by tqh

Only thing ACPI related I can see is that we allow unloading ACPI module. Is that a bad thing?

comment:48 Changed 3 years ago by vidrep

It happened today on hrev50156 x86_gcc2. Attached is a photo of the back trace for the rwlock.

Changed 3 years ago by vidrep

Attachment: IMG_0150.JPG added

comment:49 Changed 3 years ago by vidrep

Had it happen yesterday on hrev50325 x86_gcc2. I initiated a shutdown and left the house. When I came back home 2 hours later Haiku was still in the process of shutting down.

comment:50 Changed 3 years ago by tqh

Maybe it is because we disable all but cpu 0 before shutdown:

		// Make sure we run on the boot CPU (apparently needed for some ACPI
		// implementations)
		_user_set_cpu_enabled(0, true);
		for (int32 cpu = 1; cpu < smp_get_num_cpus(); cpu++) {
			_user_set_cpu_enabled(cpu, false);
		}
		// TODO: must not be called from the idle thread!
		thread_yield();

		status = acpi->prepare_sleep_state(ACPI_POWER_STATE_OFF, NULL, 0);
		if (status == B_OK) {
			//cpu_status state = disable_interrupts();
			status = acpi->enter_sleep_state(ACPI_POWER_STATE_OFF);
			//restore_interrupts(state);
		}

Inside prepare_sleep_state we hang. From https://github.com/haiku/haiku/blob/master/src/system/kernel/arch/x86/arch_cpu.cpp#L112

comment:51 Changed 3 years ago by axeld

Blocking: 10883 added

(In #10883) Duplicate of #12306 which has a lot more info to work with.

comment:52 Changed 2 years ago by vidrep

This is still happening with hrev50738, using a different HP desktop PC.

comment:53 Changed 2 years ago by bruno

Haiku hrev 50849

still happens here too. sometimes if I close the wifi-connection... Haiku will shutdown... sometimes difficult to track down the problem! Very difficult to reproduce!

comment:54 Changed 2 years ago by vidrep

Still happening with hrev50861. It only appears to happen on my HP branded PC's (Elite 8100 & Elite 8300). Could this possibly be an issue with the Intel chipset that's used? Bruno, can you attach your listdev for comparison?

Changed 2 years ago by bruno

Attachment: listdev added

comment:55 Changed 2 years ago by vidrep

I had 2 HP branded PC's with the same shutdown problem. I have attached the listdev for each one of them. listdev_HP8100 listdev_HP8300

Changed 2 years ago by vidrep

Attachment: listdev_HP8100 added

Changed 2 years ago by vidrep

Attachment: listdev_HP8300 added

comment:56 Changed 19 months ago by vidrep

Summary: Haiku won't shutdownHaiku intermittently won't shutdown on HP brand PC's

comment:57 Changed 19 months ago by bruno

I use a samsung laptop with Haiku... problem is still the same for more than 2 years now... wont close! Have to hold the power button for 4 seconds! After some reboots it goes to kdl! Have to blank the hd and install Haiku from scratch! Strange Bug!

comment:58 Changed 19 months ago by phoudoin

It seems more related to AddressSpaceReadLocker still hold while we disabled all cpus but the one calling ACPI wich will call area_for() to deadlock there (as no other thread could now wake up the lock wait)

As some users has reported that disabling all but last cpu with Pulse occasionnaly freeze too, I guess investigating some hidden race condition behind _user_set_cpu_enabled() could make sense.

Last edited 19 months ago by phoudoin (previous) (diff)

comment:59 Changed 14 months ago by tqh

Keywords: shutdown detailed added

comment:60 Changed 14 months ago by tqh

Keywords: address_space added

comment:61 Changed 14 months ago by tqh

Summary: Haiku intermittently won't shutdown on HP brand PC'sHaiku intermittently won't shutdown

This looks like it is not hardware specific. I can reproduce it quite often by turning off the first CPU and then shutting down. Can't reproduce in QEMU.

comment:62 Changed 14 months ago by vidrep

The last commit by tqh appears to have fixed this issue. I have not had a no-shutdown event since. The only unusual thing I had happen, once, was when shutting down, the HDD sounded like it was still spinning for a couple of seconds after the PC had already shut off. Related to this perhaps?? https://dev.haiku-os.org/ticket/6847

comment:63 Changed 13 months ago by vidrep

Haiku is shutting down cleanly on both x86_gcc2h and x86_64 since the commit. The issue is fixed (in my case at least). Let's close.

comment:64 Changed 13 months ago by tqh

Resolution: fixed
Status: reopenedclosed

Closing, please comment if it is still a problem. Preferably with syslogs and hardware details to investigate.

Note: See TracTickets for help on using tickets.