Opened 2 years ago

Closed 15 months ago

#18109 closed bug (fixed)

UI freezes / becomes unusable

Reported by: outsidecontext Owned by: nobody
Priority: normal Milestone: R1/beta5
Component: - General Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Platform: All

Description

I am using Haiku nightly builds (currently on hrev56609) in a qemu VM under Linux (currently using GNOME Boxes for managing this). Since a recent update the UI starts to become unresponsive after a short while of usage. I can't exactly pin it down when it happened, but sometime the last two weeks.

The symptoms are:

  • Initially the system works as normal
  • After a while I notice that UI partially stops working. I can't launch applications, desktop context menu does not work. But other parts work, e.g. I can open the task menu and move windows
  • Shortly after moving windows starts producing artifacts

At that point the system has become very much unusable to a degree that I can't really check anything (e.g. running processes or logs). That makes it difficult to figure out the root cause.

I also could not figure out when this will happen. It sometimes happens very early, or I can use Haiku for a while.

I'm happy for any pointers hat might help tracking down the actual cause for this issue.

Attachments (8)

Bildschirmfoto vom 2022-11-29 08-35-57.png (497.2 KB ) - added by outsidecontext 2 years ago.
teams.png (91.9 KB ) - added by outsidecontext 2 years ago.
threads-1.png (584.0 KB ) - added by outsidecontext 2 years ago.
threads-2.png (647.9 KB ) - added by outsidecontext 2 years ago.
sc-git.png (198.7 KB ) - added by outsidecontext 2 years ago.
git process stack trace
sc-and-sem-tracker.png (128.9 KB ) - added by outsidecontext 2 years ago.
threads-after-tracker-start.png (71.5 KB ) - added by outsidecontext 2 years ago.
stacktraces-22-12-19.zip (2.1 MB ) - added by outsidecontext 2 years ago.

Change History (40)

by outsidecontext, 2 years ago

comment:1 by outsidecontext, 2 years ago

Ok, maybe I just got a clue: The issue seems to always start with some heavy disk IO.

I actively tried to reproduce the issue and started and closed different installed applications, without having it happen. I then did a git branch switch inside a terminal, which just hang and never completed and then the issue started as described above. Noticeable the CPU usage was near zero. I attached a screenshot.

I tried to kill the git task, but it would not close. Now even a simple "ls" inside a terminal hangs in a similar way.

comment:2 by outsidecontext, 2 years ago

I can very reliably reproduce this issue with "git checkout" in any git repository. Always same behavior. I could also narrow it down by bisecting with older versions from the boot menu. In between these two states the issue starts to happen:

2022-10-06 hrev56480: OK 2022-10-17 hrev56514: FAILED

No issue at all with hrev56480. When I boot hrev56514 I get this hanging every time I do a git checkout in any repository.

So the issue is older than two weeks, I just happen to have noticed it just now.

comment:3 by X512, 2 years ago

I can very reliably reproduce this issue with "git checkout" in any git repository.

May be related: #15585.

Last edited 2 years ago by X512 (previous) (diff)

comment:4 by outsidecontext, 2 years ago

Not sure, maybe remotely related. But the linked issue is pretty old and is about IO blocking for a few seconds. In general the file system is not really fast in scenarios like git, handling writes to many small files. Branch switching on Haiku is notably slower than on other systems. That's especially the case with larger file trees, e.g. checked out haikuports. But that's not really the issue I'm having here.

In my case the system does not recover (at least not in around 10-20 minutes, haven't tried keeping it running longer). The differences between the branches are minor. Also the issue is not present in hrev56480 and got introduced a short time after that.

The similarity to the above issue is that it is some kind of IO blocking, though. Everything else I'm experiencing on the desktop is just a symptom of this then.

comment:5 by outsidecontext, 2 years ago

Ok, now I found something odd: In order to reproduce this I created a new VM, and was unable to reproduce the issue at first. But then I saw a difference in the configured CPU count for the VMs: The newly created one used all 8 cores, the older one was limited to 4.

I tested again with varying CPU core configuration in the VMs. Result:

With 4 CPUs I can reproduce the issue reliable. On every "git checkout" (no matter the differences between the branches) there is this IO freeze.

With any other CPU count (1-3, 5-8) everything works.

At least my findings give me an easy workaround, and maybe they provide some clue in finding the root cause (which might be threading related).

As noted above with hrev56480 this was working well no matter the CPU count.

Last edited 2 years ago by outsidecontext (previous) (diff)

comment:6 by waddlesplash, 2 years ago

Milestone: UnscheduledR1/beta4

That's a rather large range. In it, there are a few minor app_server changes, the refactor of USB handling to at least partially use the new bus manager, disk system name refactors, dirent fixes, -fvisibility=hidden for kernel-mode static libraries. The other changes seem extremely unlikely to have caused this problem (and really, none of those seems especially obvious.)

Can you try to narrow things down further? See the section at the end of https://www.haiku-os.org/guides/daily-tasks/updating-system/ and then the currently available hrevs at https://eu.hpkg.haiku-os.org/haiku/master/x86_64/ to try and narrow this down to a more specific version.

comment:7 by outsidecontext, 2 years ago

I can't reproduce this again. While I had this over the last week frequently and yesterday as I wrote I could reliably reproduce it (always doing the same, fresh boot, terminal, cd repo, git checkout somebranch -> freeze) I can't reproduce it today at all. No matter the CPU settings of the VM or the Haiku revision I boot :( Same VM, same procedures as yesterday.

As far as I can tell I had not relevant updates on the host system. There was some update to perl, that's all.

I'll continue using Haiku and see if it happens again.

comment:8 by waddlesplash, 2 years ago

Milestone: R1/beta4Unscheduled

OK. If it happens again, save the VM state and ask on IRC if anyone is around to help you debug it. We can try poking around in KDL to see if it can be determined what has gone wrong.

comment:9 by kallisti5, 2 years ago

I've seen this randomly now after testing some images. I'm working to determine some circumstances around it showing up.

comment:10 by outsidecontext, 2 years ago

I started to have this more frequent again, after I barely could reproduce it before as I wrote. Currently the CPU count does not make a difference. It all indicates some race condition leading to a dead lock, and I previously just shifted the likelihood of this happening.

It happens in all kinds of situation, but it always seems to be IO related. Doing git stuff is a good way to trigger it, probably because git can quickly do a lot of IO for all the small files it manages. But I also got it when just listing files in terminal, saving files in icon-o-matic etc. It's really super random.

by outsidecontext, 2 years ago

Attachment: teams.png added

by outsidecontext, 2 years ago

Attachment: threads-1.png added

by outsidecontext, 2 years ago

Attachment: threads-2.png added

by outsidecontext, 2 years ago

Attachment: sc-git.png added

git process stack trace

comment:11 by outsidecontext, 2 years ago

I managed to reproduce this again (hrev56645). This case again is a fresh boot, where I opened a terminal, cd into my haikuports checkout and switch branch.

I opened KDL and did some basic output, see the screenshots. I listed teams, all processes, and because it is odd "other" state and is the process currently hanging also the stack trace for the git thread.

The git thread keeps hanging just exactly there, no changes also after a while.

I'll keep this VM state for now, what else should I look for?

Last edited 2 years ago by outsidecontext (previous) (diff)

comment:12 by outsidecontext, 2 years ago

Another detail: In this system state I tried opening a Tracker window (by double clicking on the Home folder on my desktop).

I'll attach a screenshot of threads after this (threads-after-tracker-start.png), it shows the tracker thread is waiting for a semaphore. Also added sc and sem output for this thread (sc-and-sem-tracker.png).

by outsidecontext, 2 years ago

Attachment: sc-and-sem-tracker.png added

by outsidecontext, 2 years ago

comment:13 by waddlesplash, 2 years ago

WaitForPageEvents is likely waiting on disk I/O. The fact that the rest of the system hung, not just this one process, is a good indication that I/O has stalled altogether. So, the real problem is likely on some other thread stack trace.

comment:14 by kallisti5, 2 years ago

@outsidecontext you mentioned a VM. What are you using to manage your VM's?

I saw something similar in Gnome Boxes and on physical hardware, but qemu-system-x86_64 runs fine.

comment:15 by kallisti5, 2 years ago

aah. " (currently using GNOME Boxes for managing this) "

@waddlesplash I bet you could reproduce this one under Gnome boxes.

comment:16 by kallisti5, 2 years ago

Yeah. I briefly had this happen too just now.

  • Gnome Boxes
  • Install Haiku RC1 (Haiku nightly os, 4GiB of ram, 20 GiB disk)
  • Install to disk. MBR
  • Reboot to OS
  • git clone https://github.com/haiku/haiku.git

While working, the deskbar will randomly hang, pulse will stop updating, etc. Sometimes it's only brief... sometimes it is for a few minutes.

Drivers: virtio disk

I did a soft reboot, and it's hung "asking other processes to quit"

comment:17 by outsidecontext, 2 years ago

@kallisti5 Gnome boxes is just a qemu frontend. In the end it runs qemu-system-x86_64 . Also if it hangs just shortly it seems to be a bit different issue. The system for me does not recover eventually (or it takes significantly longer than 30-40 minutes).

by outsidecontext, 2 years ago

Attachment: stacktraces-22-12-19.zip added

comment:18 by outsidecontext, 2 years ago

@waddlesplash I tried to poke around more and created stack traces of all kinds of various threads, see stacktraces-22-12-19.zip

But I'm mostly poking in the dark, though :(

in reply to:  17 comment:19 by kallisti5, 2 years ago

Replying to outsidecontext:

@kallisti5 Gnome boxes is just a qemu frontend. In the end it runs qemu-system-x86_64 . Also if it hangs just shortly it seems to be a bit different issue. The system for me does not recover eventually (or it takes significantly longer than 30-40 minutes).

I understand that... however I can reproduce the issue in Gnome Boxes, and cannot in qemu-system-x86_64. This makes me think the default configuration Gnome Boxes does for Haiku is related to the instability.

Mix in the I/O issues... I start thinking it's driver related. So network card driver (unknown, i need to check), disk driver (virtio), vesa driver, etc.

comment:20 by waddlesplash, 2 years ago

If GNOME Boxes is just running QEMU there should be a way to get the full command line.

I suspect the virtual disk is probably the problem. If someone can replicate this in vanilla QEMU with just command line flags I can try and reproduce and then debug it here. But I don't know much about virtio, so no promises there...

comment:21 by waddlesplash, 2 years ago

The additional stack traces you have posted indeed point to an I/O lockup. The I/O servicing thread is thread 332 in that trace, which shows that do_io is waiting on switch_sem_etc, which is probably the lock-up point, but then the question is what is supposed to release that semaphore.

I seem to recall there was some similar issue in virtio_scsi with lockups that I think mmlr fixed some time ago, perhaps it's related.

comment:22 by outsidecontext, 2 years ago

I did some more tests. I'm now rather sure the issue lies somewhere with virtio. If I launch the VM with qemu without all those virtio devices gnome-boxes uses than I don't have the locks. So probably some issue in the Haiku virtio drivers.

I'll see to get a clean reproducer that can be launched from command line. The actual qemu-system-x86_64 command is being called is rather huge and does not even run on its own. I'm myself not super familiar with virtio, but I'll see to get something together that makes it possible to reproduce the issue on other systems.

comment:23 by kallisti5, 2 years ago

If the cause is virtio, then this would likely be a blocker for R1/Beta4... we have a *lot* of vm users of Haiku

Here's the qemu raw command from gnome boxes:

usr/bin/qemu-system-x86_64 -name guest=haikunightly,debug-threads=on -S -object {"qom-type":"secret","id":"masterKey0","format":"raw","file":"/home/kallisti5/.config/libvirt/qemu/lib/domain-1-haikunightly/master-key.aes"} -machine pc-i440fx-7.1,usb=off,dump-guest-core=off,memory-backend=pc.ram -accel kvm -cpu host,migratable=on -m 4096 -object {"qom-type":"memory-backend-ram","id":"pc.ram","size":4294967296} -overcommit mem-lock=off -smp 32,sockets=1,dies=1,cores=16,threads=2 -uuid 7ef4ef7c-daba-4ee9-b0a5-1b0498fc505d -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=24,server=on,wait=off -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot menu=on,strict=on -device {"driver":"ich9-usb-ehci1","id":"usb","bus":"pci.0","addr":"0x5.0x7"} -device {"driver":"ich9-usb-uhci1","masterbus":"usb.0","firstport":0,"bus":"pci.0","multifunction":true,"addr":"0x5"} -device {"driver":"ich9-usb-uhci2","masterbus":"usb.0","firstport":2,"bus":"pci.0","addr":"0x5.0x1"} -device {"driver":"ich9-usb-uhci3","masterbus":"usb.0","firstport":4,"bus":"pci.0","addr":"0x5.0x2"} -device {"driver":"virtio-serial-pci","id":"virtio-serial0","bus":"pci.0","addr":"0x8"} -device {"driver":"usb-ccid","id":"ccid0","bus":"usb.0","port":"1"} -blockdev {"driver":"file","filename":"/home/kallisti5/.local/share/gnome-boxes/images/haikunightly","node-name":"libvirt-2-storage","cache":{"direct":false,"no-flush":false},"auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-2-format","read-only":false,"discard":"unmap","cache":{"direct":false,"no-flush":false},"driver":"qcow2","file":"libvirt-2-storage","backing":null} -device {"driver":"virtio-blk-pci","bus":"pci.0","addr":"0x6","drive":"libvirt-2-format","id":"virtio-disk0","bootindex":1,"write-cache":"on"} -blockdev {"driver":"file","filename":"/home/kallisti5/Downloads/haiku-r1beta4-rc1-x86_64-anyboot.iso","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-1-format","read-only":true,"driver":"raw","file":"libvirt-1-storage"} -device {"driver":"ide-cd","bus":"ide.1","unit":0,"drive":"libvirt-1-format","id":"ide0-1-0"} -netdev user,id=hostnet0 -device {"driver":"rtl8139","netdev":"hostnet0","id":"net0","mac":"52:54:00:e8:9e:f7","bus":"pci.0","addr":"0x3"} -chardev spicevmc,id=charsmartcard0,name=smartcard -device {"driver":"ccid-card-passthru","chardev":"charsmartcard0","id":"smartcard0","bus":"ccid0.0"} -chardev pty,id=charserial0 -device {"driver":"isa-serial","chardev":"charserial0","id":"serial0","index":0} -chardev spiceport,id=charchannel0,name=org.spice-space.webdav.0 -device {"driver":"virtserialport","bus":"virtio-serial0.0","nr":1,"chardev":"charchannel0","id":"channel0","name":"org.spice-space.webdav.0"} -chardev spicevmc,id=charchannel1,name=vdagent -device {"driver":"virtserialport","bus":"virtio-serial0.0","nr":2,"chardev":"charchannel1","id":"channel1","name":"com.redhat.spice.0"} -device {"driver":"usb-tablet","id":"input0","bus":"usb.0","port":"2"} -audiodev {"id":"audio1","driver":"spice"} -spice port=0,disable-ticketing=on,image-compression=off,seamless-migration=on -device {"driver":"qxl-vga","id":"video0","max_outputs":1,"ram_size":67108864,"vram_size":67108864,"vram64_size_mb":0,"vgamem_mb":16,"bus":"pci.0","addr":"0x2"} -device {"driver":"intel-hda","id":"sound0","bus":"pci.0","addr":"0x4"} -device {"driver":"hda-duplex","id":"sound0-codec0","bus":"sound0.0","cad":0,"audiodev":"audio1"} -chardev spicevmc,id=charredir0,name=usbredir -device {"driver":"usb-redir","chardev":"charredir0","id":"redir0","bus":"usb.0","port":"3"} -chardev spicevmc,id=charredir1,name=usbredir -device {"driver":"usb-redir","chardev":"charredir1","id":"redir1","bus":"usb.0","port":"4"} -chardev spicevmc,id=charredir2,name=usbredir -device {"driver":"usb-redir","chardev":"charredir2","id":"redir2","bus":"usb.0","port":"5"} -chardev spicevmc,id=charredir3,name=usbredir -device {"driver":"usb-redir","chardev":"charredir3","id":"redir3","bus":"usb.0","port":"6"} -device {"driver":"virtio-balloon-pci","id":"balloon0","bus":"pci.0","addr":"0x7"} -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on

comment:24 by kallisti5, 2 years ago

Things of note:

  • rtl8139 network
  • virtio-serial-pci
  • virtio-blk-pci
  • virtio-balloon-pci

Out of the above, virtio block seems most likely to be our cause.

Last edited 2 years ago by kallisti5 (previous) (diff)

comment:26 by kallisti5, 2 years ago

Unable to reproduce with this process:

  • qemu-img create data.qcow2 24G
  • qemu-system-x86_64 -boot d --enable-kvm -m 4G -drive id=data,file=data.qcow2,if=none -device virtio-blk-pci,drive=data -cdrom haiku-r1beta4-rc1-x86_64-anyboot.iso
  • Format virtio disk as BFS with MBR partition
  • git checkout haiku to virtio disk

comment:28 by kallisti5, 2 years ago

added -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 to qemu above with smp 16. No change

comment:29 by korli, 15 months ago

Please check with hrev57330 or newer. It seems virtio-block was in use.

comment:30 by outsidecontext, 15 months ago

This looks promising, thanks for the update. I don't have much access to my laptop on which the Haiku VM runs this week, but I'll make sure to test this.

In the last time I could not reproduce this as reliable anymore, but it still happens. So I'm sure I can give some feedback. Just need to try the critical file system mass operations often enough :)

comment:31 by outsidecontext, 15 months ago

korli: Took me a while, but today I finally got back to Haiku. My machine was still running something before hrev57330 and I had the freezing multiple times before I upgraded (once it even froze right after booting when I launched the updater, the other times I was doing gits checkouts).

Upgraded to todays latest hrev and could not reproduce the issue anymore. So I'd say this is fixed and can be closed.

Thanks a lot to all of you for your support and worl.

comment:32 by waddlesplash, 15 months ago

Milestone: UnscheduledR1/beta5
Resolution: fixed
Status: newclosed

Thank you for reporting and testing!

Note: See TracTickets for help on using tickets.