#17233 closed bug (fixed)
Haiku issues with 12 or more CPUs
Reported by: | Coldfirex | Owned by: | nobody |
---|---|---|---|
Priority: | normal | Milestone: | R1/beta5 |
Component: | System | Version: | R1/beta3 |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Platform: | All |
Description (last modified by )
Noticed that nightly hrev55370 (I did not test earlier) is not stable with 12 or more vCPUs (tested under ESXi 6.7; no vmware tools).
12 and 32 vCPUs - I could get to the desktop but was unable to interact with desktop at all via mouse or keyboard. The system appeared frozen up as the time on the clock did not change.
64 vCPUs - Hang on lit up red icon.
FYI Haiku Beta 3 (not nightly) specifically would not boot with 10 or higher at all. 9 was unstable.
Logs attached for 12, 32, and 64 vCPUs from nightly.
Attachments (10)
Change History (35)
by , 3 years ago
Attachment: | syslog-12vCPU.txt added |
---|
by , 3 years ago
Attachment: | syslog-32vCPU.txt added |
---|
by , 3 years ago
Attachment: | syslog-64vCPU.txt added |
---|
comment:1 by , 3 years ago
Description: | modified (diff) |
---|
comment:2 by , 3 years ago
comment:3 by , 3 years ago
Not sure. I dont believe VMs "know" about logical vs physical cores. I can only test this with ESXi (this is at work).
by , 2 years ago
Attachment: | hrev56645-40vCPU-syslog.txt added |
---|
comment:4 by , 2 years ago
Syslog of recent hrev56645 with 40 vCPUs. I noticed boot to desktop, freeze after a few second (mouse was still moveable), reset VM, icons lit up but never booted to desktop, reset VM, booted to desktop and froze again with mouse able to be moved.
comment:5 by , 15 months ago
Syslog from 57247 with 64vCPUs. I was able to get to desktop, but tracker was very "jumpy" and then within a few minutes tracker and the mouse froze.
by , 15 months ago
Attachment: | syslog-64cpu-57247.txt added |
---|
comment:6 by , 15 months ago
How many physical cores do you have?
Strange things have been known to happen when running Haiku on more vCPUs than there are physical CPUs. We probably shouldn't hang, at least, but correct behavior may be hard to get.
Can you drop to KDL?
comment:7 by , 15 months ago
There are 64 physical cores available on this server blade. How do I do that manually? I tried variations of F12 with no luck yet.
comment:9 by , 15 months ago
If the mouse can move then you should be able to enter via Alt+SysRq+D. (Can you check that you can use that shortcut to enter before the freeze?)
comment:10 by , 15 months ago
I can sort of pull up the KDL before it freezes. When it works to open I see nothing that I type or output from sc\bt\etc. If I type exit though, it closes the KDL. I had a hang on boot (red rocket list up) and was able to open the KDL and issue a bt correctly. Syslog and screenshot attached.
by , 15 months ago
Attachment: | syslog-64cpu-57247-KDLAttempts.txt added |
---|
by , 15 months ago
Attachment: | hrev57234-BootHangKDL-SC.png added |
---|
by , 15 months ago
Attachment: | syslog-64cpu-57247-BootHang.txt added |
---|
comment:11 by , 15 months ago
When it works to open I see nothing that I type or output
Are you using the VMware video driver? This is a known problem. Switch back to VESA/framebuffer and KDL refresh will work even after the desktop starts.
comment:12 by , 15 months ago
The KDL trace from the boot hang has nothing interesting in it. Probably the real problem is some deadlocked threads. You'll likely need to poke around and figure out what threads are deadlocked and why.
comment:14 by , 14 months ago
Finding the thread that's responsible for continuing the boot process (likely a launch_daemon thread), figure out what it's blocked on, and then figure out why it's blocked on that. "threads" and then the mutex/lock/sem/etc. KDL commands are relevant. You can ping me on IRC/Matrix sometime next week and I should be able to help troubleshoot.
by , 14 months ago
Attachment: | syslog-09-22-2023.txt added |
---|
comment:16 by , 14 months ago
The READ/WRITE FAULT occurring on any backtrace that goes into userland I can readily reproduce here, so that looks like a bad regression we need to fix. The "area contains" not working I can also reproduce, so that needs to be investigated also.
The one thing that can be gleaned from this session anyway is this:
0 ffffffff81f947c0 (+ 112) ffffffff8009e7e1 <kernel_x86_64> reschedule(int) + 0x431 1 ffffffff81f94830 (+ 48) ffffffff80089896 <kernel_x86_64> thread_block + 0xc6 2 ffffffff81f94860 (+ 96) ffffffff8009990e <kernel_x86_64> _mutex_lock + 0xce 3 ffffffff81f948c0 (+ 32) ffffffff80099ade <kernel_x86_64> recursive_lock_lock + 0x3e 4 ffffffff81f948e0 (+ 32) ffffffff8015dc16 <kernel_x86_64> X86VMTranslationMap::Lock() + 0x16 5 ffffffff81f94900 (+ 384) ffffffff80127a86 <kernel_x86_64> vm_soft_fault(VMAddressSpace*, unsigned long, bool, bool, bool, vm_page**) + 0x5d6 6 ffffffff81f94a80 (+ 240) ffffffff80132836 <kernel_x86_64> vm_page_fault + 0x176
Other threads appear to be waiting for this same mutex, so this points to what the problem is, but of course why this lock hasn't been released is the real question.
comment:17 by , 14 months ago
Hmm, the issue with KDL backtraces may be somehow specific to VMware. I booted up in QEMU and can use all these commands without a problem.
comment:18 by , 14 months ago
It seems that's related to the use of SSE2+ in the kernel. Building the kernel and drivers with -mno-sse2
fixes the READ/WRITE FAULTs.
comment:19 by , 14 months ago
Blocking: | 18593 added |
---|
by , 14 months ago
Attachment: | syslog-57287.txt added |
---|
comment:20 by , 14 months ago
Blocking: | 18593 removed |
---|
The system is blocked waiting for the x86VMTranslationMap mutex, which is held by the page daemon:
0 ffffffff81d49d60 (+ 64) ffffffff800783ba <kernel_x86_64> process_pending_ici(int) + 0x1ca 1 ffffffff81d49da0 (+ 112) ffffffff80079237 <kernel_x86_64> smp_send_multicast_ici + 0x1d7 2 ffffffff81d49e10 (+ 80) ffffffff8015f77c <kernel_x86_64> X86VMTranslationMap::Flush() + 0x1bc 3 ffffffff81d49e60 (+ 96) ffffffff8014f9f0 <kernel_x86_64> X86VMTranslationMap64Bit::ClearAccessedAndModified(VMArea*, unsigned long, bool, bool&) + 0x1f0 4 ffffffff81d49ec0 (+ 64) ffffffff8012ad44 <kernel_x86_64> vm_clear_page_mapping_accessed_flags + 0x54
How it could have gotten stuck here is not clear to me, but I don't know much about the ICI code.
comment:21 by , 14 months ago
Could you try out a build of Haiku with https://review.haiku-os.org/c/haiku/+/6919 applied, like https://haiku.movingborders.es/testbuild/Ifa84da51eccd85c1eff529749ffa00bc2159899e/2/hrev57305/x86_64/?
I believe this should improve high-core-count behavior; all your VMs have multiple different X2APIC cluster IDs, which likely caused ICIs to be sent to the wrong cores...
comment:22 by , 14 months ago
A quick 10 minute or so test this morning with this build held up and did not KDL or freeze. I can test more later this weekend if needed.
comment:23 by , 14 months ago
Coldfirex, please check the build with the updated patch: https://haiku.movingborders.es/testbuild/Ifa84da51eccd85c1eff529749ffa00bc2159899e/4/hrev57305/x86_64/
comment:24 by , 14 months ago
Old rev ran all afternoon w/o KDL/freeze/hang. Latest rev seems to be working the same as well.
comment:25 by , 14 months ago
Milestone: | Unscheduled → R1/beta5 |
---|---|
Resolution: | → fixed |
Status: | new → closed |
Patch applied in hrev57306, thanks to Puck Meerburg. .
I have 12 threads (6 cores, with dual threading), it's not quite clear whether your virtual cores here use hyperthreading or not. Sounds like an issue specific to your hypervisor perhaps?