Opened 13 months ago

Last modified 11 months ago

#14659 new bug

KGPE-D16 support, dual socket/NUMA issues

Reported by: cb88 Owned by: nobody
Priority: normal Milestone: Unscheduled
Component: System/Kernel Version: R1/Development
Keywords: NUMA Cc:
Blocked By: Blocking: #13991, #14774
Has a Patch: no Platform: x86-64

Description

I'm attaching syslogs grabbed from serial here.

Crashes if booted up with 2nd CPU without bootloader switches.

One 16 core CPU + 32GB ram boots.

Disabling SMP or LAPIC allows booting but no SMP of course.

Attachments (13)

KGPE-D16_default.txt (15.5 KB ) - added by cb88 13 months ago.
Booted with 2 Opteron 6386se engineering samples + 64GB
KGPE-D16_disableSMP.txt (104.0 KB ) - added by cb88 13 months ago.
Same config disabled SMP
KGPE-D16_disableLAPIC.txt (103.9 KB ) - added by cb88 13 months ago.
Same config disabled LAPIC
KGPE-D16_disablecCPU2inBIOS.txt (103.1 KB ) - added by cb88 13 months ago.
Disabled CPU2 in BIOS 1 Opteron 6386se CPU + 32GB
KGPE-D16_disableACPI.txt (15.2 KB ) - added by cb88 13 months ago.
Same config as _default disabled ACPI
KGPE-D16_disableIOAPIC.txt (15.1 KB ) - added by cb88 13 months ago.
Same config as _default disabled IO APIC
KGPE-D16_disableCPU2bootlooping.txt (525.7 KB ) - added by cb88 13 months ago.
After failing to boot with CPU2 enabled I disabled it, the kept bootlooping right after bootloader, there are several hard power cycles in here and eventually it booted completely at the end.
listdev.txt (5.9 KB ) - added by cb88 13 months ago.
Listdev and Listusb output
QEMU_KVM_2socket_16core.txt (20.3 KB ) - added by cb88 13 months ago.
sudo qemu-system-x86_64 -hda /dev/sda -m 12G -enable-kvm -smp cores=16,sockets=2 -s --serial stdio
QEMU_KVM_2socket_8core.txt (152.6 KB ) - added by cb88 13 months ago.
Took several bootloops before loading: qemu-system-x86_64 -hda /dev/sda -m 12G -enable-kvm -smp cores=8,sockets=2 -s --serial stdio
KGPE-D16_default_hang_after_bootlooping_alot_repeatable.txt (33.3 KB ) - added by cb88 11 months ago.
Hangs after about an hour or two like this.
KGPE-D16_default_bootloop_hrev52701.txt (16.1 KB ) - added by cb88 11 months ago.
No apparently change that I see
KGPE-D16_default_crash_hrev52707.txt (7.1 KB ) - added by cb88 11 months ago.
Updated in qemu from a live USB, rebooted and got this.

Download all attachments as: .zip

Change History (32)

by cb88, 13 months ago

Attachment: KGPE-D16_default.txt added

Booted with 2 Opteron 6386se engineering samples + 64GB

by cb88, 13 months ago

Attachment: KGPE-D16_disableSMP.txt added

Same config disabled SMP

by cb88, 13 months ago

Attachment: KGPE-D16_disableLAPIC.txt added

Same config disabled LAPIC

by cb88, 13 months ago

Disabled CPU2 in BIOS 1 Opteron 6386se CPU + 32GB

by cb88, 13 months ago

Attachment: KGPE-D16_disableACPI.txt added

Same config as _default disabled ACPI

by cb88, 13 months ago

Attachment: KGPE-D16_disableIOAPIC.txt added

Same config as _default disabled IO APIC

comment:1 by waddlesplash, 13 months ago

So, "done trampolining" means it got to http://xref.plausible.coop/source/xref/haiku/src/system/boot/platform/bios_ia32/smp.cpp#560 i.e. that all CPUs initialized successfully. Immediately after that should be http://xref.plausible.coop/source/xref/haiku/src/system/boot/platform/bios_ia32/long.cpp#345 i.e. kernel entry, but apparently that never happens. What's going on here?

comment:2 by waddlesplash, 13 months ago

FreeBSD's early-boot code seems to live here and here.

by cb88, 13 months ago

After failing to boot with CPU2 enabled I disabled it, the kept bootlooping right after bootloader, there are several hard power cycles in here and eventually it booted completely at the end.

by cb88, 13 months ago

Attachment: listdev.txt added

Listdev and Listusb output

by cb88, 13 months ago

Attachment: QEMU_KVM_2socket_16core.txt added

sudo qemu-system-x86_64 -hda /dev/sda -m 12G -enable-kvm -smp cores=16,sockets=2 -s --serial stdio

by cb88, 13 months ago

Attachment: QEMU_KVM_2socket_8core.txt added

Took several bootloops before loading: qemu-system-x86_64 -hda /dev/sda -m 12G -enable-kvm -smp cores=8,sockets=2 -s --serial stdio

comment:3 by cb88, 13 months ago

Note QEMU gets reasonable CPU ID's 0 through N-1 unlike when running directly on the hardware. Where they were offset for whatever reason. It still fails to boot at the same place as real hardware though.

comment:4 by waddlesplash, 12 months ago

I can't get it to reliably reproduce with cores=8,sockets=2 here; with cores=16 it happens occasionally but eventually boots, with cores=32 it apparently always happens. (4 physical cores on this machine -- processor is an AMD Phenom II.)

However, this is also an experimental Clang build which is not quite right in the head, so perhaps there's some crossed wires.

comment:5 by waddlesplash, 12 months ago

So ... with -d int,cpu_reset we get:

...
wait for delivery
deassert INIT
wait for delivery
num startups = 2
send STARTUP
wait for delivery
send STARTUP
wait for delivery
done trampolining
CPU Reset (CPU 0)
RAX=0000000000000000 RBX=ffffffff801b3160 RCX=00000000c0000101 RDX=00000000ffffffff
RSI=0000000000000040 RDI=ffffffff801b3160 RBP=ffffffff81004fd0 RSP=ffffffff81004fb0
R8 =ffffffff801a2110 R9 =0000000000000000 R10=0000000000000028 R11=0000000000000000
R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
...

In other words, there's no triple-fault or anything that I can see, the CPU just unceremoniously resets. I added -d guest_errors and that produced nothing else, either. What's going on here?

comment:6 by waddlesplash, 11 months ago

On a system without KVM, cores=2,sockets=2 is enough to trigger it. So it is apparently a race condition of some kind.

in reply to:  6 comment:7 by cb88, 11 months ago

Replying to waddlesplash:

On a system without KVM, cores=2,sockets=2 is enough to trigger it. So it is apparently a race condition of some kind.

That's very odd, since QEMU's CPU emulation is single threaded unless you turn on multithreaded TGC explicitly you wouldn't think it would occur there. From what I understand IO and such is on a separate thread... perhaps that is it since 2 sockets imply multiple APICS? Perhaps the APICs are getting setup incorrectly somehow?

comment:8 by waddlesplash, 11 months ago

Blocking: 14774 added

comment:9 by waddlesplash, 11 months ago

Vidrep in #14774 reports this "sometimes" occurs for him, but not always; and that if he just waits a while, eventually it boots; which is consistent with the behavior I saw in QEMU. Can you set up your system to auto-boot Haiku and see if it eventually boots with the dual-socket config after a while?

comment:10 by waddlesplash, 11 months ago

Blocking: 13991 added

in reply to:  9 comment:11 by cb88, 11 months ago

Replying to waddlesplash:

Vidrep in #14774 reports this "sometimes" occurs for him, but not always; and that if he just waits a while, eventually it boots; which is consistent with the behavior I saw in QEMU. Can you set up your system to auto-boot Haiku and see if it eventually boots with the dual-socket config after a while?

Sure, just note that the stock BIOS I still have on there boots pretty slowly so maybe 30-60 boot attempts per hour, maybe less just guessing? Regardless that would seem to imply something is being left to chance... in the boot process.

comment:12 by cb88, 11 months ago

hrev52698 is bootlooping every 60 seconds almost exactly right now... I'll update here if it boots to desktop in the next few days if it takes longer than than a couple days I'll update here also.

comment:13 by cb88, 11 months ago

I has hung 3 times since I've started testing... I don't have a serial logger on it at the moment though. So may not be able to do as much testing as I'd hoped.

by cb88, 11 months ago

Hangs after about an hour or two like this.

comment:14 by waddlesplash, 11 months ago

Oho, now that's actually a panic; and my guess is it's probably related here. Tracking down the cause will be harder, probably...

comment:15 by mmlr, 11 months ago

I have investigated this and found a probable cause and submitted a couple of changes to Gerrit:

  • The intermittent failures with 16 cores, 2 sockets were caused by faults when loading the task register while another CPU already overwrote the entry for its TSS in the GDT: https://review.haiku-os.org/c/haiku/+/810 fixes this and probably also fixes this ticket on real hardware.

Can you try reproducing this on real hardware with change 810 applied? The other changes aren't really relevant and don't need to be applied.

comment:16 by cb88, 11 months ago

I busted my Gentoo install on there so can't build anything at the moment I'll test the rest though.

by cb88, 11 months ago

No apparently change that I see

comment:17 by waddlesplash, 11 months ago

hrev52701 will indeed behave identically. You will have to apply patch 810 manually and rebuild to test the fix.

by cb88, 11 months ago

Updated in qemu from a live USB, rebooted and got this.

comment:18 by cb88, 11 months ago

I was about to build the 810 patch and noticed that it had been pushed so went ahead and tested.

Last edited 11 months ago by cb88 (previous) (diff)

comment:19 by waddlesplash, 11 months ago

Excellent, so it was a panic that was turning into a triple fault, then!

That message comes from here. It looks like we may need to tweak what/how the bootloader allocates physical pages; it may be grabbing too few.

Note: See TracTickets for help on using tickets.