Opened 3 years ago

Closed 2 years ago

#17573 closed bug (not reproducible)

Haiku QEMU guest hangs while compiling Haiku

Reported by: turbo Owned by: nobody
Priority: normal Milestone: Unscheduled
Component: - General Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Platform: x86-64

Description

Host: qemu 6.2.0-2 on Arch Linux Guest: rev55860, 8 vCPUs, 4GB RAM, virtio disk/net

I'm running into trouble compiling Haiku x86_64 on Haiku x86_64. I wanted to get a build environment working inside Haiku so I could easily try to work on some of the built-in apps.

This is a fresh install of nightly, then updated via HaikuDepot. I ran pkgman install cmd:python3 cmd:xorriso devel:libzstd, cloned the git repos for buildtools and haiku, and ran:

mkdir generated.x86_64; cd generated.x86_64

../configure --cross-tools-source ../../buildtools --build-cross-tools x86_64

After about 4 or 5 minutes, the entire system starts having a strange issue. I can interact with windows and navigate Tracker, but anything I do that attempts to start a new process hangs forever. If I try to start enough new processes, everything else (including interacting with windows) stops responding.

Attached a screenshot and the kernel serial output. The screenshot was exactly what was on the screen for a few minutes before I took the capture. I tried to get a syslog, but after I force reset the guest, I don't see any syslog entries that align with the kernel serial output. It could be lost?

Filing this in case this is a bug or there are other troubleshooting steps I can follow.

Attachments (6)

haiku0.log (76.6 KB ) - added by turbo 3 years ago.
kernel serial output
haiku.png (159.4 KB ) - added by turbo 3 years ago.
screenshot
scsi bt.png (87.6 KB ) - added by turbo 3 years ago.
scsi thread backtrace
scsi sems.png (33.3 KB ) - added by turbo 3 years ago.
scsi sems
kernel threads.png (254.5 KB ) - added by turbo 3 years ago.
kernel threads
userland bt.png (249.4 KB ) - added by turbo 3 years ago.
userland bt

Download all attachments as: .zip

Change History (14)

by turbo, 3 years ago

Attachment: haiku0.log added

kernel serial output

by turbo, 3 years ago

Attachment: haiku.png added

screenshot

comment:1 by waddlesplash, 3 years ago

Something is locked up in the kernel. The way to diagnose this further is to drop into the kernel debugger and poke around to determine what is blocked on.

If you are comfortable navigating such things yourself, you can drop to KDL via sendkey alt-sysrq-d in the QEMU compat monitor, and then use the teams | grep ..., threads <team>, and then bt <thread> commands to get backtraces of the hung threads, combined with the mutex, condition_variable, sem, and related commands to get information about precisely what thread holds the lock object that is being blocked on.

(If you aren't comfortable with or don't manage to find your way around the kernel debugger, someone else will have to try and reproduce the problem instead, or I can try and write out more detailed instructions for how to use it.)

comment:2 by turbo, 3 years ago

Thanks for the help. I also went through the kernel debugger guide on the web site. There’s quite a few teams and threads running, but I’m finding quite a few of them have on the top of the stack reschedule, thread_block, and fifo::Inode::WaitForReadRequest.

I poked through some threads in the kernel, and most are waiting on semaphores or cdevs. block notifier is waiting on a semaphore block cache event and bfs log flusher is waiting on a cdev of type `I/O request finished’. Admittedly, I’m guessing a bit on where to look, but most userland processes are waiting on fifo and the few kernel threads seem to also be waiting on file I/O.

Is there an order to the file I/O events in the kernel that would help me work backwards to figure out where it’s stuck?

Thanks for the patience as I learn this on the fly :)

comment:3 by waddlesplash, 3 years ago

The first two functions you mention are the standard ones called whenever a thread is going to wait for something; what is more interesting is what they are waiting on. The "threads" command generally shows thread state and what it is blocked on (e.g. sem, mutex, etc.)

The FIFO one is more interesting, as is the BFS log flusher waiting for I/O. If you can, please paste a full backtrace of one of the userland process threads that is stuck.

It sounds most probable that disk I/O has gotten stuck for some reason. I know we had problems with Virtio-SCSI in the past hanging, but those were resolved a long time ago (in hrev53452), and I think our package builder VMs are using virtio-scsi by default these days. If you can find the appropriate SCSI threads, you may be able to see what they are currently waiting for, and whether that is where I/O has gotten backed up.

by turbo, 3 years ago

Attachment: scsi bt.png added

scsi thread backtrace

by turbo, 3 years ago

Attachment: scsi sems.png added

scsi sems

by turbo, 3 years ago

Attachment: kernel threads.png added

kernel threads

by turbo, 3 years ago

Attachment: userland bt.png added

userland bt

comment:4 by turbo, 3 years ago

Am I reading this right that scsi_bus_service thread 312 is waiting on sem 1244, which it currently has acquired?

comment:5 by waddlesplash, 3 years ago

If you are running in QEMU, it is nicer to use -serial stdio and copy KDL output from your terminal instead of taking screenshots.

No, you are not reading it quite correctly. Semaphores are used in both resource management and producer/consumer scenarios; this is likely the latter. In this case the same thread will always be the last acquirer, and the other thread will be the releaser. I'm not so familiar with the ins and outs of our SCSI stack, but the scsi_bus_service in particular looks like it is the management layer and only wakes up periodically, so what you have is not necessarily indicative of a problem.

The backtrace you have of grep looks like it is just waiting on a FIFO, i.e., a shell pipe, and not for disk I/O. So, none of these backtraces are the actual deadlock we are looking for, or even point to it. Finding something blocked on a "read" syscall that goes into the SCSI layer is what we are really looking for here (and then finding out what is supposed to wake that up and where it is blocked.)

Not sure what timezone you are in, but either I or someone else on IRC might be able to help you track this down in an "interactive" session (or perhaps I should just upgrade my QEMU and see if I can reproduce the problem directly.)

comment:6 by waddlesplash, 3 years ago

(Kudos for finding your way around with only moderate help so far, though! Most people who file tickets for the first time usually require much more assistance in order to collect data like this :)

comment:7 by turbo, 2 years ago

I think this can be closed. I ran out of time looking into this and I don't seem to be able to reproduce this anymore. If it comes up again, I'll reopen. Thank you for all the help!

comment:8 by nephele, 2 years ago

Resolution: not reproducible
Status: newclosed
Note: See TracTickets for help on using tickets.