Opened 6 years ago
Closed 3 months ago
#14659 closed bug (fixed)
PANIC: error allocating early page
Reported by: | cb88 | Owned by: | nobody |
---|---|---|---|
Priority: | normal | Milestone: | R1/beta6 |
Component: | System/Kernel | Version: | R1/Development |
Keywords: | Cc: | ||
Blocked By: | Blocking: | #14774, #17232, #18140, #18755, #18949, #19009, #19088, #19117 | |
Platform: | x86-64 |
Description
I'm attaching syslogs grabbed from serial here.
Crashes if booted up with 2nd CPU without bootloader switches.
One 16 core CPU + 32GB ram boots.
Disabling SMP or LAPIC allows booting but no SMP of course.
Attachments (13)
Change History (45)
by , 6 years ago
Attachment: | KGPE-D16_default.txt added |
---|
by , 6 years ago
Attachment: | KGPE-D16_disablecCPU2inBIOS.txt added |
---|
Disabled CPU2 in BIOS 1 Opteron 6386se CPU + 32GB
by , 6 years ago
Attachment: | KGPE-D16_disableIOAPIC.txt added |
---|
Same config as _default disabled IO APIC
comment:1 by , 6 years ago
So, "done trampolining" means it got to http://xref.plausible.coop/source/xref/haiku/src/system/boot/platform/bios_ia32/smp.cpp#560 i.e. that all CPUs initialized successfully. Immediately after that should be http://xref.plausible.coop/source/xref/haiku/src/system/boot/platform/bios_ia32/long.cpp#345 i.e. kernel entry, but apparently that never happens. What's going on here?
by , 6 years ago
Attachment: | KGPE-D16_disableCPU2bootlooping.txt added |
---|
After failing to boot with CPU2 enabled I disabled it, the kept bootlooping right after bootloader, there are several hard power cycles in here and eventually it booted completely at the end.
by , 6 years ago
Attachment: | QEMU_KVM_2socket_16core.txt added |
---|
sudo qemu-system-x86_64 -hda /dev/sda -m 12G -enable-kvm -smp cores=16,sockets=2 -s --serial stdio
by , 6 years ago
Attachment: | QEMU_KVM_2socket_8core.txt added |
---|
Took several bootloops before loading: qemu-system-x86_64 -hda /dev/sda -m 12G -enable-kvm -smp cores=8,sockets=2 -s --serial stdio
comment:3 by , 6 years ago
Note QEMU gets reasonable CPU ID's 0 through N-1 unlike when running directly on the hardware. Where they were offset for whatever reason. It still fails to boot at the same place as real hardware though.
comment:4 by , 6 years ago
I can't get it to reliably reproduce with cores=8,sockets=2 here; with cores=16 it happens occasionally but eventually boots, with cores=32 it apparently always happens. (4 physical cores on this machine -- processor is an AMD Phenom II.)
However, this is also an experimental Clang build which is not quite right in the head, so perhaps there's some crossed wires.
comment:5 by , 6 years ago
So ... with -d int,cpu_reset
we get:
... wait for delivery deassert INIT wait for delivery num startups = 2 send STARTUP wait for delivery send STARTUP wait for delivery done trampolining CPU Reset (CPU 0) RAX=0000000000000000 RBX=ffffffff801b3160 RCX=00000000c0000101 RDX=00000000ffffffff RSI=0000000000000040 RDI=ffffffff801b3160 RBP=ffffffff81004fd0 RSP=ffffffff81004fb0 R8 =ffffffff801a2110 R9 =0000000000000000 R10=0000000000000028 R11=0000000000000000 R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000 ...
In other words, there's no triple-fault or anything that I can see, the CPU just unceremoniously resets. I added -d guest_errors
and that produced nothing else, either. What's going on here?
follow-up: 7 comment:6 by , 6 years ago
On a system without KVM, cores=2,sockets=2 is enough to trigger it. So it is apparently a race condition of some kind.
comment:7 by , 6 years ago
Replying to waddlesplash:
On a system without KVM, cores=2,sockets=2 is enough to trigger it. So it is apparently a race condition of some kind.
That's very odd, since QEMU's CPU emulation is single threaded unless you turn on multithreaded TGC explicitly you wouldn't think it would occur there. From what I understand IO and such is on a separate thread... perhaps that is it since 2 sockets imply multiple APICS? Perhaps the APICs are getting setup incorrectly somehow?
comment:8 by , 6 years ago
Blocking: | 14774 added |
---|
follow-up: 11 comment:9 by , 6 years ago
Vidrep in #14774 reports this "sometimes" occurs for him, but not always; and that if he just waits a while, eventually it boots; which is consistent with the behavior I saw in QEMU. Can you set up your system to auto-boot Haiku and see if it eventually boots with the dual-socket config after a while?
comment:10 by , 6 years ago
Blocking: | 13991 added |
---|
comment:11 by , 6 years ago
Replying to waddlesplash:
Vidrep in #14774 reports this "sometimes" occurs for him, but not always; and that if he just waits a while, eventually it boots; which is consistent with the behavior I saw in QEMU. Can you set up your system to auto-boot Haiku and see if it eventually boots with the dual-socket config after a while?
Sure, just note that the stock BIOS I still have on there boots pretty slowly so maybe 30-60 boot attempts per hour, maybe less just guessing? Regardless that would seem to imply something is being left to chance... in the boot process.
comment:12 by , 6 years ago
hrev52698 is bootlooping every 60 seconds almost exactly right now... I'll update here if it boots to desktop in the next few days if it takes longer than than a couple days I'll update here also.
comment:13 by , 6 years ago
I has hung 3 times since I've started testing... I don't have a serial logger on it at the moment though. So may not be able to do as much testing as I'd hoped.
by , 6 years ago
Attachment: | KGPE-D16_default_hang_after_bootlooping_alot_repeatable.txt added |
---|
Hangs after about an hour or two like this.
comment:14 by , 6 years ago
Oho, now that's actually a panic; and my guess is it's probably related here. Tracking down the cause will be harder, probably...
comment:15 by , 6 years ago
I have investigated this and found a probable cause and submitted a couple of changes to Gerrit:
- Early debug output was broken, complicating the investigation of this ticket: https://review.haiku-os.org/#/c/haiku/+/808
- The guaranteed reset with the 32 cores, 2 sockets configuration of qemu was caused by an off-by-one in an assert: https://review.haiku-os.org/c/haiku/+/809
- The intermittent failures with 16 cores, 2 sockets were caused by faults when loading the task register while another CPU already overwrote the entry for its TSS in the GDT: https://review.haiku-os.org/c/haiku/+/810 fixes this and probably also fixes this ticket on real hardware.
Can you try reproducing this on real hardware with change 810 applied? The other changes aren't really relevant and don't need to be applied.
comment:16 by , 6 years ago
I busted my Gentoo install on there so can't build anything at the moment I'll test the rest though.
by , 6 years ago
Attachment: | KGPE-D16_default_bootloop_hrev52701.txt added |
---|
No apparently change that I see
comment:17 by , 6 years ago
hrev52701 will indeed behave identically. You will have to apply patch 810 manually and rebuild to test the fix.
by , 6 years ago
Attachment: | KGPE-D16_default_crash_hrev52707.txt added |
---|
Updated in qemu from a live USB, rebooted and got this.
comment:18 by , 6 years ago
I was about to build the 810 patch and noticed that it had been pushed so went ahead and tested.
comment:19 by , 6 years ago
Excellent, so it was a panic that was turning into a triple fault, then!
That message comes from here. It looks like we may need to tweak what/how the bootloader allocates physical pages; it may be grabbing too few.
comment:20 by , 5 years ago
Summary: | KGPE-D16 support, dual socket/NUMA issues → PANIC: error allocating early page |
---|
comment:21 by , 5 years ago
comment:22 by , 5 years ago
Yes but I'm not sure these are completely related since this was on dual socket Piledriver, and that was on single socket Zen 1... thought 1800x does have 2 CCXes hmm.
Very weird that it would span so many generations of AMD CPUs with so many changes. Granted Zen 1 and Piledriver share the same front end...
comment:23 by , 3 years ago
Blocking: | 17232 added |
---|
comment:24 by , 12 months ago
Blocking: | 18755 added |
---|
comment:25 by , 3 months ago
Blocking: | 19088 added |
---|
comment:26 by , 3 months ago
Blocking: | 18140 added |
---|
comment:27 by , 3 months ago
Blocking: | 19117 added |
---|
comment:28 by , 3 months ago
Blocking: | 19009 added |
---|
comment:29 by , 3 months ago
Blocking: | 18949 added |
---|
comment:30 by , 3 months ago
Keywords: | NUMA removed |
---|
This should be fixed in hrev58212. Waiting to hear back from some of the "blocking" tickets before closing it for good, though.
comment:31 by , 3 months ago
Blocking: | 13991 removed |
---|
comment:32 by , 3 months ago
Milestone: | Unscheduled → R1/beta6 |
---|---|
Resolution: | → fixed |
Status: | new → closed |
Indeed this seems fixed after hrev58243. There is a remaining problem on 32-bit with PAE but that's a separate issue.
Booted with 2 Opteron 6386se engineering samples + 64GB