Opened 3 years ago

Closed 13 months ago

Last modified 13 months ago

#17233 closed bug (fixed)

Haiku issues with 12 or more CPUs

Reported by: Coldfirex Owned by: nobody
Priority: normal Milestone: R1/beta5
Component: System Version: R1/beta3
Keywords: Cc:
Blocked By: Blocking:
Platform: All

Description (last modified by Coldfirex)

Noticed that nightly hrev55370 (I did not test earlier) is not stable with 12 or more vCPUs (tested under ESXi 6.7; no vmware tools).

12 and 32 vCPUs - I could get to the desktop but was unable to interact with desktop at all via mouse or keyboard. The system appeared frozen up as the time on the clock did not change.

64 vCPUs - Hang on lit up red icon.

FYI Haiku Beta 3 (not nightly) specifically would not boot with 10 or higher at all. 9 was unstable.

Logs attached for 12, 32, and 64 vCPUs from nightly.

Attachments (10)

syslog-12vCPU.txt (111.4 KB ) - added by Coldfirex 3 years ago.
syslog-32vCPU.txt (131.2 KB ) - added by Coldfirex 3 years ago.
syslog-64vCPU.txt (141.4 KB ) - added by Coldfirex 3 years ago.
hrev56645-40vCPU-syslog.txt (451.9 KB ) - added by Coldfirex 23 months ago.
syslog-64cpu-57247.txt (147.9 KB ) - added by Coldfirex 14 months ago.
syslog-64cpu-57247-KDLAttempts.txt (714.6 KB ) - added by Coldfirex 14 months ago.
hrev57234-BootHangKDL-SC.png (41.4 KB ) - added by Coldfirex 14 months ago.
syslog-64cpu-57247-BootHang.txt (369.8 KB ) - added by Coldfirex 14 months ago.
syslog-09-22-2023.txt (196.5 KB ) - added by Coldfirex 14 months ago.
syslog-57287.txt (321.3 KB ) - added by Coldfirex 14 months ago.

Change History (35)

by Coldfirex, 3 years ago

Attachment: syslog-12vCPU.txt added

by Coldfirex, 3 years ago

Attachment: syslog-32vCPU.txt added

by Coldfirex, 3 years ago

Attachment: syslog-64vCPU.txt added

comment:1 by Coldfirex, 3 years ago

Description: modified (diff)

comment:2 by nephele, 3 years ago

I have 12 threads (6 cores, with dual threading), it's not quite clear whether your virtual cores here use hyperthreading or not. Sounds like an issue specific to your hypervisor perhaps?

comment:3 by Coldfirex, 3 years ago

Not sure. I dont believe VMs "know" about logical vs physical cores. I can only test this with ESXi (this is at work).

by Coldfirex, 23 months ago

Attachment: hrev56645-40vCPU-syslog.txt added

comment:4 by Coldfirex, 23 months ago

Syslog of recent hrev56645 with 40 vCPUs. I noticed boot to desktop, freeze after a few second (mouse was still moveable), reset VM, icons lit up but never booted to desktop, reset VM, booted to desktop and froze again with mouse able to be moved.

comment:5 by Coldfirex, 14 months ago

Syslog from 57247 with 64vCPUs. I was able to get to desktop, but tracker was very "jumpy" and then within a few minutes tracker and the mouse froze.

by Coldfirex, 14 months ago

Attachment: syslog-64cpu-57247.txt added

comment:6 by waddlesplash, 14 months ago

How many physical cores do you have?

Strange things have been known to happen when running Haiku on more vCPUs than there are physical CPUs. We probably shouldn't hang, at least, but correct behavior may be hard to get.

Can you drop to KDL?

comment:7 by Coldfirex, 14 months ago

There are 64 physical cores available on this server blade. How do I do that manually? I tried variations of F12 with no luck yet.

comment:8 by Coldfirex, 14 months ago

No, I cannot enter KDL.

comment:9 by waddlesplash, 14 months ago

If the mouse can move then you should be able to enter via Alt+SysRq+D. (Can you check that you can use that shortcut to enter before the freeze?)

comment:10 by Coldfirex, 14 months ago

I can sort of pull up the KDL before it freezes. When it works to open I see nothing that I type or output from sc\bt\etc. If I type exit though, it closes the KDL. I had a hang on boot (red rocket list up) and was able to open the KDL and issue a bt correctly. Syslog and screenshot attached.

by Coldfirex, 14 months ago

by Coldfirex, 14 months ago

by Coldfirex, 14 months ago

comment:11 by waddlesplash, 14 months ago

When it works to open I see nothing that I type or output

Are you using the VMware video driver? This is a known problem. Switch back to VESA/framebuffer and KDL refresh will work even after the desktop starts.

comment:12 by waddlesplash, 14 months ago

The KDL trace from the boot hang has nothing interesting in it. Probably the real problem is some deadlocked threads. You'll likely need to poke around and figure out what threads are deadlocked and why.

comment:13 by Coldfirex, 14 months ago

How would I go about doing this?

comment:14 by waddlesplash, 14 months ago

Finding the thread that's responsible for continuing the boot process (likely a launch_daemon thread), figure out what it's blocked on, and then figure out why it's blocked on that. "threads" and then the mutex/lock/sem/etc. KDL commands are relevant. You can ping me on IRC/Matrix sometime next week and I should be able to help troubleshoot.

comment:15 by Coldfirex, 14 months ago

Syslog from our troubleshooting as requested.

by Coldfirex, 14 months ago

Attachment: syslog-09-22-2023.txt added

comment:16 by waddlesplash, 14 months ago

The READ/WRITE FAULT occurring on any backtrace that goes into userland I can readily reproduce here, so that looks like a bad regression we need to fix. The "area contains" not working I can also reproduce, so that needs to be investigated also.

The one thing that can be gleaned from this session anyway is this:

 0 ffffffff81f947c0 (+ 112) ffffffff8009e7e1   <kernel_x86_64> reschedule(int) + 0x431
 1 ffffffff81f94830 (+  48) ffffffff80089896   <kernel_x86_64> thread_block + 0xc6
 2 ffffffff81f94860 (+  96) ffffffff8009990e   <kernel_x86_64> _mutex_lock + 0xce
 3 ffffffff81f948c0 (+  32) ffffffff80099ade   <kernel_x86_64> recursive_lock_lock + 0x3e
 4 ffffffff81f948e0 (+  32) ffffffff8015dc16   <kernel_x86_64> X86VMTranslationMap::Lock() + 0x16
 5 ffffffff81f94900 (+ 384) ffffffff80127a86   <kernel_x86_64> vm_soft_fault(VMAddressSpace*, unsigned long, bool, bool, bool, vm_page**) + 0x5d6
 6 ffffffff81f94a80 (+ 240) ffffffff80132836   <kernel_x86_64> vm_page_fault + 0x176

Other threads appear to be waiting for this same mutex, so this points to what the problem is, but of course why this lock hasn't been released is the real question.

comment:17 by waddlesplash, 14 months ago

Hmm, the issue with KDL backtraces may be somehow specific to VMware. I booted up in QEMU and can use all these commands without a problem.

comment:18 by waddlesplash, 14 months ago

It seems that's related to the use of SSE2+ in the kernel. Building the kernel and drivers with -mno-sse2 fixes the READ/WRITE FAULTs.

comment:19 by waddlesplash, 14 months ago

Blocking: 18593 added

by Coldfirex, 14 months ago

Attachment: syslog-57287.txt added

comment:20 by waddlesplash, 14 months ago

Blocking: 18593 removed

The system is blocked waiting for the x86VMTranslationMap mutex, which is held by the page daemon:

 0 ffffffff81d49d60 (+  64) ffffffff800783ba   <kernel_x86_64> process_pending_ici(int) + 0x1ca
 1 ffffffff81d49da0 (+ 112) ffffffff80079237   <kernel_x86_64> smp_send_multicast_ici + 0x1d7
 2 ffffffff81d49e10 (+  80) ffffffff8015f77c   <kernel_x86_64> X86VMTranslationMap::Flush() + 0x1bc
 3 ffffffff81d49e60 (+  96) ffffffff8014f9f0   <kernel_x86_64> X86VMTranslationMap64Bit::ClearAccessedAndModified(VMArea*, unsigned long, bool, bool&) + 0x1f0
 4 ffffffff81d49ec0 (+  64) ffffffff8012ad44   <kernel_x86_64> vm_clear_page_mapping_accessed_flags + 0x54

How it could have gotten stuck here is not clear to me, but I don't know much about the ICI code.

comment:21 by puckipedia, 13 months ago

Could you try out a build of Haiku with https://review.haiku-os.org/c/haiku/+/6919 applied, like https://haiku.movingborders.es/testbuild/Ifa84da51eccd85c1eff529749ffa00bc2159899e/2/hrev57305/x86_64/?

I believe this should improve high-core-count behavior; all your VMs have multiple different X2APIC cluster IDs, which likely caused ICIs to be sent to the wrong cores...

comment:22 by Coldfirex, 13 months ago

A quick 10 minute or so test this morning with this build held up and did not KDL or freeze. I can test more later this weekend if needed.

comment:24 by Coldfirex, 13 months ago

Old rev ran all afternoon w/o KDL/freeze/hang. Latest rev seems to be working the same as well.

comment:25 by korli, 13 months ago

Milestone: UnscheduledR1/beta5
Resolution: fixed
Status: newclosed

Patch applied in hrev57306, thanks to Puck Meerburg.

Last edited 13 months ago by korli (previous) (diff)
Note: See TracTickets for help on using tickets.