Opened 6 years ago

Closed 2 weeks ago

#14659 closed bug (fixed)

PANIC: error allocating early page

Reported by: cb88 Owned by: nobody
Priority: normal Milestone: R1/beta6
Component: System/Kernel Version: R1/Development
Keywords: Cc:
Blocked By: Blocking: #14774, #17232, #18140, #18755, #18949, #19009, #19088, #19117
Platform: x86-64

Description

I'm attaching syslogs grabbed from serial here.

Crashes if booted up with 2nd CPU without bootloader switches.

One 16 core CPU + 32GB ram boots.

Disabling SMP or LAPIC allows booting but no SMP of course.

Attachments (13)

KGPE-D16_default.txt (15.5 KB ) - added by cb88 6 years ago.
Booted with 2 Opteron 6386se engineering samples + 64GB
KGPE-D16_disableSMP.txt (104.0 KB ) - added by cb88 6 years ago.
Same config disabled SMP
KGPE-D16_disableLAPIC.txt (103.9 KB ) - added by cb88 6 years ago.
Same config disabled LAPIC
KGPE-D16_disablecCPU2inBIOS.txt (103.1 KB ) - added by cb88 6 years ago.
Disabled CPU2 in BIOS 1 Opteron 6386se CPU + 32GB
KGPE-D16_disableACPI.txt (15.2 KB ) - added by cb88 6 years ago.
Same config as _default disabled ACPI
KGPE-D16_disableIOAPIC.txt (15.1 KB ) - added by cb88 6 years ago.
Same config as _default disabled IO APIC
KGPE-D16_disableCPU2bootlooping.txt (525.7 KB ) - added by cb88 6 years ago.
After failing to boot with CPU2 enabled I disabled it, the kept bootlooping right after bootloader, there are several hard power cycles in here and eventually it booted completely at the end.
listdev.txt (5.9 KB ) - added by cb88 6 years ago.
Listdev and Listusb output
QEMU_KVM_2socket_16core.txt (20.3 KB ) - added by cb88 6 years ago.
sudo qemu-system-x86_64 -hda /dev/sda -m 12G -enable-kvm -smp cores=16,sockets=2 -s --serial stdio
QEMU_KVM_2socket_8core.txt (152.6 KB ) - added by cb88 6 years ago.
Took several bootloops before loading: qemu-system-x86_64 -hda /dev/sda -m 12G -enable-kvm -smp cores=8,sockets=2 -s --serial stdio
KGPE-D16_default_hang_after_bootlooping_alot_repeatable.txt (33.3 KB ) - added by cb88 6 years ago.
Hangs after about an hour or two like this.
KGPE-D16_default_bootloop_hrev52701.txt (16.1 KB ) - added by cb88 6 years ago.
No apparently change that I see
KGPE-D16_default_crash_hrev52707.txt (7.1 KB ) - added by cb88 6 years ago.
Updated in qemu from a live USB, rebooted and got this.

Download all attachments as: .zip

Change History (45)

by cb88, 6 years ago

Attachment: KGPE-D16_default.txt added

Booted with 2 Opteron 6386se engineering samples + 64GB

by cb88, 6 years ago

Attachment: KGPE-D16_disableSMP.txt added

Same config disabled SMP

by cb88, 6 years ago

Attachment: KGPE-D16_disableLAPIC.txt added

Same config disabled LAPIC

by cb88, 6 years ago

Disabled CPU2 in BIOS 1 Opteron 6386se CPU + 32GB

by cb88, 6 years ago

Attachment: KGPE-D16_disableACPI.txt added

Same config as _default disabled ACPI

by cb88, 6 years ago

Attachment: KGPE-D16_disableIOAPIC.txt added

Same config as _default disabled IO APIC

comment:1 by waddlesplash, 6 years ago

So, "done trampolining" means it got to http://xref.plausible.coop/source/xref/haiku/src/system/boot/platform/bios_ia32/smp.cpp#560 i.e. that all CPUs initialized successfully. Immediately after that should be http://xref.plausible.coop/source/xref/haiku/src/system/boot/platform/bios_ia32/long.cpp#345 i.e. kernel entry, but apparently that never happens. What's going on here?

comment:2 by waddlesplash, 6 years ago

FreeBSD's early-boot code seems to live here and here.

by cb88, 6 years ago

After failing to boot with CPU2 enabled I disabled it, the kept bootlooping right after bootloader, there are several hard power cycles in here and eventually it booted completely at the end.

by cb88, 6 years ago

Attachment: listdev.txt added

Listdev and Listusb output

by cb88, 6 years ago

Attachment: QEMU_KVM_2socket_16core.txt added

sudo qemu-system-x86_64 -hda /dev/sda -m 12G -enable-kvm -smp cores=16,sockets=2 -s --serial stdio

by cb88, 6 years ago

Attachment: QEMU_KVM_2socket_8core.txt added

Took several bootloops before loading: qemu-system-x86_64 -hda /dev/sda -m 12G -enable-kvm -smp cores=8,sockets=2 -s --serial stdio

comment:3 by cb88, 6 years ago

Note QEMU gets reasonable CPU ID's 0 through N-1 unlike when running directly on the hardware. Where they were offset for whatever reason. It still fails to boot at the same place as real hardware though.

comment:4 by waddlesplash, 6 years ago

I can't get it to reliably reproduce with cores=8,sockets=2 here; with cores=16 it happens occasionally but eventually boots, with cores=32 it apparently always happens. (4 physical cores on this machine -- processor is an AMD Phenom II.)

However, this is also an experimental Clang build which is not quite right in the head, so perhaps there's some crossed wires.

comment:5 by waddlesplash, 6 years ago

So ... with -d int,cpu_reset we get:

...
wait for delivery
deassert INIT
wait for delivery
num startups = 2
send STARTUP
wait for delivery
send STARTUP
wait for delivery
done trampolining
CPU Reset (CPU 0)
RAX=0000000000000000 RBX=ffffffff801b3160 RCX=00000000c0000101 RDX=00000000ffffffff
RSI=0000000000000040 RDI=ffffffff801b3160 RBP=ffffffff81004fd0 RSP=ffffffff81004fb0
R8 =ffffffff801a2110 R9 =0000000000000000 R10=0000000000000028 R11=0000000000000000
R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
...

In other words, there's no triple-fault or anything that I can see, the CPU just unceremoniously resets. I added -d guest_errors and that produced nothing else, either. What's going on here?

comment:6 by waddlesplash, 6 years ago

On a system without KVM, cores=2,sockets=2 is enough to trigger it. So it is apparently a race condition of some kind.

in reply to:  6 comment:7 by cb88, 6 years ago

Replying to waddlesplash:

On a system without KVM, cores=2,sockets=2 is enough to trigger it. So it is apparently a race condition of some kind.

That's very odd, since QEMU's CPU emulation is single threaded unless you turn on multithreaded TGC explicitly you wouldn't think it would occur there. From what I understand IO and such is on a separate thread... perhaps that is it since 2 sockets imply multiple APICS? Perhaps the APICs are getting setup incorrectly somehow?

comment:8 by waddlesplash, 6 years ago

Blocking: 14774 added

comment:9 by waddlesplash, 6 years ago

Vidrep in #14774 reports this "sometimes" occurs for him, but not always; and that if he just waits a while, eventually it boots; which is consistent with the behavior I saw in QEMU. Can you set up your system to auto-boot Haiku and see if it eventually boots with the dual-socket config after a while?

comment:10 by waddlesplash, 6 years ago

Blocking: 13991 added

in reply to:  9 comment:11 by cb88, 6 years ago

Replying to waddlesplash:

Vidrep in #14774 reports this "sometimes" occurs for him, but not always; and that if he just waits a while, eventually it boots; which is consistent with the behavior I saw in QEMU. Can you set up your system to auto-boot Haiku and see if it eventually boots with the dual-socket config after a while?

Sure, just note that the stock BIOS I still have on there boots pretty slowly so maybe 30-60 boot attempts per hour, maybe less just guessing? Regardless that would seem to imply something is being left to chance... in the boot process.

comment:12 by cb88, 6 years ago

hrev52698 is bootlooping every 60 seconds almost exactly right now... I'll update here if it boots to desktop in the next few days if it takes longer than than a couple days I'll update here also.

comment:13 by cb88, 6 years ago

I has hung 3 times since I've started testing... I don't have a serial logger on it at the moment though. So may not be able to do as much testing as I'd hoped.

by cb88, 6 years ago

Hangs after about an hour or two like this.

comment:14 by waddlesplash, 6 years ago

Oho, now that's actually a panic; and my guess is it's probably related here. Tracking down the cause will be harder, probably...

comment:15 by mmlr, 6 years ago

I have investigated this and found a probable cause and submitted a couple of changes to Gerrit:

  • The intermittent failures with 16 cores, 2 sockets were caused by faults when loading the task register while another CPU already overwrote the entry for its TSS in the GDT: https://review.haiku-os.org/c/haiku/+/810 fixes this and probably also fixes this ticket on real hardware.

Can you try reproducing this on real hardware with change 810 applied? The other changes aren't really relevant and don't need to be applied.

comment:16 by cb88, 6 years ago

I busted my Gentoo install on there so can't build anything at the moment I'll test the rest though.

by cb88, 6 years ago

No apparently change that I see

comment:17 by waddlesplash, 6 years ago

hrev52701 will indeed behave identically. You will have to apply patch 810 manually and rebuild to test the fix.

by cb88, 6 years ago

Updated in qemu from a live USB, rebooted and got this.

comment:18 by cb88, 6 years ago

I was about to build the 810 patch and noticed that it had been pushed so went ahead and tested.

Last edited 6 years ago by cb88 (previous) (diff)

comment:19 by waddlesplash, 6 years ago

Excellent, so it was a panic that was turning into a triple fault, then!

That message comes from here. It looks like we may need to tweak what/how the bootloader allocates physical pages; it may be grabbing too few.

comment:20 by waddlesplash, 5 years ago

Summary: KGPE-D16 support, dual socket/NUMA issuesPANIC: error allocating early page

comment:21 by kallisti5, 5 years ago

This was seen on a Ryzen 7 1800x in #13370. The triple-fault has morphed into "PANIC: error allocating early page" as of hrev53673

comment:22 by cb88, 5 years ago

Yes but I'm not sure these are completely related since this was on dual socket Piledriver, and that was on single socket Zen 1... thought 1800x does have 2 CCXes hmm.

Very weird that it would span so many generations of AMD CPUs with so many changes. Granted Zen 1 and Piledriver share the same front end...

comment:23 by diver, 3 years ago

Blocking: 17232 added

comment:24 by waddlesplash, 10 months ago

Blocking: 18755 added

comment:25 by waddlesplash, 4 weeks ago

Blocking: 19088 added

comment:26 by waddlesplash, 3 weeks ago

Blocking: 18140 added

comment:27 by waddlesplash, 3 weeks ago

Blocking: 19117 added

comment:28 by waddlesplash, 3 weeks ago

Blocking: 19009 added

comment:29 by waddlesplash, 3 weeks ago

Blocking: 18949 added

comment:30 by waddlesplash, 3 weeks ago

Keywords: NUMA removed

This should be fixed in hrev58212. Waiting to hear back from some of the "blocking" tickets before closing it for good, though.

comment:31 by waddlesplash, 3 weeks ago

Blocking: 13991 removed

comment:32 by waddlesplash, 2 weeks ago

Milestone: UnscheduledR1/beta6
Resolution: fixed
Status: newclosed

Indeed this seems fixed after hrev58243. There is a remaining problem on 32-bit with PAE but that's a separate issue.

Note: See TracTickets for help on using tickets.