Opened 12 months ago

Last modified 12 months ago

#14560 new bug

monitoring of maui builders

Reported by: korli Owned by: haiku-web
Priority: normal Milestone: Unscheduled
Component: Sys-Admin Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Has a Patch: no Platform: All

Description

It would be nice to add some monitoring for the builders on maui: The x86_64 builder is stuck for 24h now, unable to guess what's happening. https://build.haiku-os.org/buildmaster/master/x86_64/logviewer.html?buildruns/2217/builders/maui_x86_64.log

I know builders are not critical, but polling them should be possible.

Change History (2)

comment:1 by kallisti5, 12 months ago

The correct way for this would be to implement a watchdog driver that supports qemu / libvirt

https://rwmj.wordpress.com/2010/03/03/what-is-a-watchdog/

comment:2 by mmlr, 12 months ago

A watchdog wouldn't really help in most cases. These are the things that happen from time to time:

  • KDLs: Right now they don't trigger an automatic reboot as I usually get around to checking them and it's nicer to be able to debug. When noone is able to spend the time we can set bluescreen false in the kernel settings to avoid this need. A watchdog would trigger in these cases and would do pretty much the same thing.
  • Stuck downloads: These seem to happen less often lately. Generally they are hard to automatically handle as the download sizes of the ports as well as the speed of their source servers vary wildly, which makes a simple timeout impractical. Instead download progress would need to be measured and in cases of an actually stuck download it should be restarted, eventually using up all retries. This would mean in-sourcing the download process into HaikuPorter or checking if the currently used wget can be configured to do the same and adding the appropriate parameters.
  • Stuck package activation: There is a race condition somewhere in package activation that irregularly leads to build packages not getting activated. I haven't been able to further investigate this unfortunately. A timeout on the build package activation would be relatively simple though.
  • Stuck build process: Handling a stuck build process is probably the most difficult to handle automatically as there is no real universal way to check for a progressing build. A simple timeout is again a rather poor fit considering how long some of our larger packages tend to take to build. A timeout for phases without any log output might work, but it'd have to be rather long to not produce false positives which would be especially frustrating on such long running builds.
  • Stuck virtio block: I've checked what's going on in the case that prompted this ticket and it looks like the virtio block driver is stuck and stopped processing requests. The system itself is still responsive, but all disk IO to the build volume is blocked. The boot volume seems fine, the logs are clean. It was not possible to attach the debugger to the running git process so I entered the kernel debugger, which revealed that the git thread is waiting for the virtio block driver to finish an IO request.

The only case a watchdog or a network based poll would trigger is for KDLs. And a presumable automatic reboot in such cases can instead be done via the kernel settings. The other cases are more difficult and so far handling them case by case worked out mostly ok. It may however make sense to widen the group of people who can access the VMs via libvirt so that more people could handle these cases when I am unavailable for some time.

Note: See TracTickets for help on using tickets.