Opened 7 years ago

Last modified 6 days ago

#14082 new bug

PAE paging leads to random reboots/freezing using AMD-FX 6300 6-core processor

Reported by: nzimmermann Owned by: nobody
Priority: normal Milestone: Unscheduled
Component: System/Kernel Version: R1/Development
Keywords: vm Cc:
Blocked By: Blocking: #14770, #14887, #15988, #17216
Platform: x86

Description

My system consist of a AMD-FX 6300 6-core machine with 16GB DDR3 RAM and Haiku is installed os primary OS. The machine randomly reboots and/or freezes (this is more seldom) after a few minutes, when the system is under heavy load (compiling a large software project, using make -j 6).

I replaced the motherboard, CPU, RAM, power-supply etc, but things kept unstable. Finally, after playing with the safe mode settings I found a work-around: Enabling 4gb_memory_limit. (This was a suggestion from korli on ticket #10279).

In src/system/kernel/arch/x86/arch_vm_translation_map.cpp a decision is made based on this setting. PAE paging is disabled, and 32 bit paging is enabled. The paging method is switched from X86PagingMethodPAE to X86PagingMethod32Bit.

Using the latter paging method, my system is stable: no random reboots, no random binary crashes, nothing. I'm confident that the actual memory limit of 4 GB is not curing the problem, as I can reproduce the system reboots also when using only one RAM stick with 4 GB alone --> it must be the paging.

NOTE: Other people reported that the 4gb_memory_limit does not help with the random binary crashes (as stated on ticket #10279, by kallisti5), but in my case it does make the difference. Apparently in earlier days, the 4gb_memory_limit did NOT deactivate PAE paging, but nowadays it does.

Attachments (2)

0000-Test-Apply-Microcode-Update.diff (44.4 KB ) - added by nzimmermann 7 years ago.
WIP patch implementing microcode updates
syslog_hrev58259.txt (74.2 KB ) - added by dovsienko 2 months ago.
/var/log/syslog for hrev58259 (sanitised)

Download all attachments as: .zip

Change History (72)

comment:1 by korli, 7 years ago

One big difference between 32bit and PAE is the size of page_directory_entry and pae_page_table_entry (32 vs 64 bits). Now, these entries on x86 are not updated atomically, which might or might not be an issue (I heard that PAE usually requires cmpxchg64). It's only a hunch.

in reply to:  1 ; comment:2 by nzimmermann, 7 years ago

Replying to korli:

One big difference between 32bit and PAE is the size of page_directory_entry and pae_page_table_entry (32 vs 64 bits). Now, these entries on x86 are not updated atomically, which might or might not be an issue (I heard that PAE usually requires cmpxchg64). It's only a hunch.

Thanks korli for commenting. Forgive my ignorance, this topic is new to me, but Isn't setting page table entries done atomically?

/*static*/ inline pae_page_table_entry
X86PagingMethodPAE::SetPageTableEntry(pae_page_table_entry* entry,
        pae_page_table_entry newEntry)
{
        return atomic_get_and_set64((int64*)entry, newEntry);
}

in reply to:  2 ; comment:3 by korli, 7 years ago

Replying to Nikolas Zimmermann:

Thanks korli for commenting. Forgive my ignorance, this topic is new to me, but Isn't setting page table entries done atomically?

/*static*/ inline pae_page_table_entry
X86PagingMethodPAE::SetPageTableEntry(pae_page_table_entry* entry,
        pae_page_table_entry newEntry)
{
        return atomic_get_and_set64((int64*)entry, newEntry);
}

This one is used only when unmapping. For the mapping, see here

/*static*/ void
X86PagingMethodPAE::PutPageTableInPageDir(pae_page_directory_entry* entry,
	phys_addr_t physicalTable, uint32 attributes)
{
	*entry = (physicalTable & X86_PAE_PDE_ADDRESS_MASK)
		| X86_PAE_PDE_PRESENT
		| X86_PAE_PDE_WRITABLE
		| X86_PAE_PDE_USER;
		// TODO: We ignore the attributes of the page table -- for compatibility
		// with BeOS we allow having user accessible areas in the kernel address
		// space. This is currently being used by some drivers, mainly for the
		// frame buffer. Our current real time data implementation makes use of
		// this fact, too.
		// We might want to get rid of this possibility one day, especially if
		// we intend to port it to a platform that does not support this.
}

It's used here Then compare with 64bit

comment:4 by korli, 7 years ago

Cc: ingo_weinhold@… axeld@… kallisti5@… removed

comment:5 by nzimmermann, 7 years ago

NOTE: The random binary crashes are still present, even though the likelihood is decreased when disabling PAE paging. The link to bug #10279 should be removed.

Last edited 7 years ago by nzimmermann (previous) (diff)

in reply to:  3 comment:6 by nzimmermann, 7 years ago

Replying to korli:

This one is used only when unmapping. For the mapping, see here

Okay, I'll give it a try, thanks for the suggestions.

comment:7 by kallisti5, 7 years ago

What's odd is this issue seems completely limited to AMD Bulldozer systems (MBR/BIOS booted)

I upgraded my desktop to Ryzen, and the few times I've successfully booted with EFI I haven't seen the issue. *however*... MBR/BIOS boot doesn't work on my Ryzen (#13370), so maybe related?

comment:8 by korli, 7 years ago

My suggestions as a patch https://review.haiku-os.org/#/c/120/

in reply to:  8 comment:9 by nzimmermann, 7 years ago

Replying to korli:

My suggestions as a patch https://review.haiku-os.org/#/c/120/

Thanks korli, you were quicker than me :-(

My attempt was to simply call atomic_get_and_set64 in PutPageTableInPageDir(), but your cleanup is indeed nicer. Unfortunately I couldn't test it yet - due to #13980, which currently makes it impossible for me to compile Haiku.

I'm trying waddlesplashs attempt to fix it, together with your patch - hopefully I have a new build available in the next few hours, and can confirm whether the PAE problem is fixed or not.

comment:10 by nzimmermann, 7 years ago

New kernel is compiled, and running, let's see if I can still reproduce problems with PAE enabled.

in reply to:  10 comment:11 by nzimmermann, 7 years ago

Replying to Nikolas Zimmermann:

New kernel is compiled, and running, let's see if I can still reproduce problems with PAE enabled.

Unfortunately it crashed again, in the same way, during make -j6. I've captured the syslog using a serial cable, but it's not informative I fear:

...
bfs: bfs_rename:1145: No such file or directory
Last message repeated 8 times.
slab memory manager: created area 0x93801000 (124296)
slab memory manager: created area 0x94001000 (124297)
bfs: bfs_rename:1145: No such file or directory
hda: Unsolicited response: 00000080/00000010
hda: sensed pin widget 27, 1
hda: Unsolicited response: 00000000/00000010
hda: sensed pin widget 27, 0
slab memory manager: created area 0x94801000 (167682)
slab memory manager: created area 0x95001000 (1169484)
hda: Unsolicited response: 00000080/00000010
hda: sensed pin widget 27, 1
hda: Unsolicited response: 00000000/00000010
hda: sensed pin widget 27, 0
slab memory manager: created area 0x95801000 (1365358)

comment:12 by korli, 7 years ago

Thanks for the test, it would have be surprising given the changes involved. Is it with USB booting?

in reply to:  12 comment:13 by nzimmermann, 7 years ago

Replying to korli:

Thanks for the test, it would have be surprising given the changes involved. Is it with USB booting?

No, Haiku is installed on a SATA HDD on my machine.

comment:14 by nzimmermann, 7 years ago

I've further analyzed this, and checked all locks / CPU pinning are done correctly, by comparing to the 64bit implementation. There is only one really interresting difference: X86VMTranslationMapPAE::QueryInterrupt(). In the 64bit paging implementation, QueryInterrupt() simply calls Query() wheres in the PAE code a specific implementation exists. The Query() code for both 64bit and PAE, enforce thread cpu pinning, but the QueryInterrupt() code for PAE doesn't do that.

As I said before, I'm not expert in this area, and can't judge whether this is problematic or not, but just wanted to mention this difference. I've observed that I always see "hda: Unsolicited response" in the syslog before the machine crashes, and that message comes from the HDA audio driver IRQ handling.

What do the experts think? Could a missing ThreadCPUPinner in the QueryInterrupt() function cause such problems?

comment:15 by korli, 7 years ago

No idea about QueryInterrupt(). About the HDA audio driver, you could blacklist it to check whether it could be a source of problems. This can actually be done with a few drivers. A warning though, audio or network drivers run continuously and tend to exercise the system, that doesn't mean that they are at fault (like for random crashes).

comment:16 by nzimmermann, 7 years ago

My QueryInterrupt changes didn't help - I got on the wrong track, as also disabling HDA driver doesn't help. Still random hangs w/o any indication in the syslog what went wrong.

I tried to enable the PAE tracing, but it takes hours to boot into haiku with that. I wish I could only enable it after a certain point in time.

comment:17 by nzimmermann, 7 years ago

I've read through all related closed and opened PAE tickets, and came across the memeater tool:

Using several memeater processes at the same time each allocating 1000MB 100 times, I can trigger following page fault: (Log obtained from minicom on a second machine via serial cable)

[2018-04-18 10:32:07] vm_soft_fault: va 0x1b900000 not covered by area in address space
[2018-04-18 10:32:07] vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0x1b900100, ip 0x801470b3, write 1, user 0, thref
[2018-04-18 10:32:07] PANIC: vm_page_fault: unhandled page fault in kernel space at 0x1b900100, ip 0x801470b3
[2018-04-18 10:32:07] 
[2018-04-18 10:32:07] Welcome to Kernel Debugging Land...
[2018-04-18 10:32:07] Thread 831 "memeater.c" running on CPU 4
[2018-04-18 10:32:07] stack trace for thread 831 "memeater.c"
[2018-04-18 10:32:07]     kernel stack: 0x822c9000 to 0x822cd000
[2018-04-18 10:32:07]       user stack: 0x70363000 to 0x71363000
[2018-04-18 10:32:07] frame               caller     <image>:function + offset
[2018-04-18 10:32:07]  0 822ccbc4 (+  32) 8014bfde   <kernel_x86> arch_debug_stack_trace + 0x12
[2018-04-18 10:32:07]  1 822ccbe4 (+  16) 800a9183   <kernel_x86> stack_trace_trampoline(NULL) + 0x0b
[2018-04-18 10:32:07]  2 822ccbf4 (+  12) 8013db72   <kernel_x86> arch_debug_call_with_fault_handler + 0x1b
[2018-04-18 10:32:07]  3 822ccc00 (+  48) 800aac0c   <kernel_x86> debug_call_with_fault_handler + 0x60
[2018-04-18 10:32:07]  4 822ccc30 (+  64) 800a9397   <kernel_x86> kernel_debugger_loop(0x801904b7 "PANIC: ", 0x801a71a0 "vm_page_fault: unhax
[2018-04-18 10:32:07] ", 0x822cccdc "", int32: 4) + 0x20f
[2018-04-18 10:32:07]  5 822ccc70 (+  48) 800a973b   <kernel_x86> kernel_debugger_internal(0x801904b7 "PANIC: ", 0x801a71a0 "vm_page_fault: x
[2018-04-18 10:32:07] ", 0x822cccdc "", int32: 4) + 0x77
[2018-04-18 10:32:07]  6 822ccca0 (+  48) 800aaf8e   <kernel_x86> panic + 0x3a
[2018-04-18 10:32:07]  7 822cccd0 (+ 144) 80122ac9   <kernel_x86> vm_page_fault + 0x13d
[2018-04-18 10:32:07]  8 822ccd60 (+  80) 8014d850   <kernel_x86> x86_page_fault_exception + 0x1ec
[2018-04-18 10:32:07]  9 822ccdb0 (+  12) 8014046c   <kernel_x86> int_bottom + 0x3c
[2018-04-18 10:32:07] kernel iframe at 0x822ccdbc (end = 0x822cce0c)
[2018-04-18 10:32:07]  eax 0x4060008     ebx 0x88bd1afc     ecx 0x1b900100  edx 0xf1759018
[2018-04-18 10:32:07]  esi 0x80000001    edi 0x4b1b5000     ebp 0x822cce54  esp 0x822ccdf0
[2018-04-18 10:32:07]  eip 0x801470b3 eflags 0x10206   
[2018-04-18 10:32:07]  vector: 0xe, error code: 0x2
[2018-04-18 10:32:07] 10 822ccdbc (+ 152) 801470b3   <kernel_x86> X86VMTranslationMapPAE<0xdff84818>::UnmapArea(VMArea*: 0xdfc67270, false) 7
[2018-04-18 10:32:07] 11 822cce54 (+  64) 8011f2db   <kernel_x86> delete_area(VMAddressSpace*: 0xdffad360, VMArea*: 0xdfc67270, false) + 0xff
[2018-04-18 10:32:07] 12 822cce94 (+ 144) 8011f5cf   <kernel_x86> vm_delete_area + 0x203
[2018-04-18 10:32:07] 13 822ccf24 (+  32) 80126d8a   <kernel_x86> _user_delete_area + 0x1e
[2018-04-18 10:32:07] 14 822ccf44 (+ 100) 8014066f   <kernel_x86> handle_syscall + 0xdc
[2018-04-18 10:32:07] user iframe at 0x822ccfa8 (end = 0x822cd000)
[2018-04-18 10:32:07]  eax 0xc1          ebx 0x115d138      ecx 0x713622ac  edx 0x620b7114
[2018-04-18 10:32:07]  esi 0x7136353c    edi 0x7136354c     ebp 0x713622d8  esp 0x822ccfdc
[2018-04-18 10:32:07]  eip 0x620b7114 eflags 0x3202    user esp 0x713622ac
[2018-04-18 10:32:07]  vector: 0x63, error code: 0x0
[2018-04-18 10:32:07] 15 822ccfa8 (+   0) 620b7114   <commpage> commpage_syscall + 0x04
[2018-04-18 10:32:07] 16 713622d8 (+  80) 00788be4   <memeater.c> main + 0x214
[2018-04-18 10:32:07] 17 71362328 (+  48) 007888b7   <memeater.c> _start + 0x5b
[2018-04-18 10:32:07] 18 71362358 (+  64) 00bfb000   </boot/system/runtime_loader@0x00be9000> <unknown> + 0x12000
[2018-04-18 10:32:07] 19 71362398 (+   0) 620b7250   <commpage> commpage_thread_exit + 0x00

Not sure if it's at all related to my random system freezes, but at least there's a hint something is still wrong in the PAE code.

comment:18 by nzimmermann, 7 years ago

Some news on this item: The system is completely unstable (random reeboots) with PAE paging enabled, and stable with 32bit paging. By now I've replaced the RAM, played with memory settings in BIOS, but it all doesn't help.

I went ahead and installed a x86_64 Haiku version using 64bit paging, and this is even more unstable. The system randomly reboots after a few minutes, even when not doing heavy compilation work. Copying a file via scp, or browsing with Web+ is already sufficient to trigger a reboot -- as always nothing visible in the syslog with any hint. Sometimes I saw a corrupted imagine on the display, eg. a checker board pattern, or just a plain color, which made me initially think it could be related to the graphics card.

Maybe a stupid question, but still: My machine uses an onboard radeon GPU, and the RAM is shared according to the BIOS settings (e.g. 256MB is mapped for the GPU). Does Haiku know about that? Do we have any means to detect that a certain portion of the RAM is dedicated for the GPU?

comment:19 by pulkomandy, 7 years ago

You can check the ram size reported by the os (for example in AboutSystem). it should either reduce the total size there or report some "inaccessible" RAM. If it doesn't, indeed we have a problem.

Last edited 7 years ago by pulkomandy (previous) (diff)

in reply to:  19 comment:20 by nzimmermann, 7 years ago

Replying to PulkoMandy:

You can check the ram size reported by the os (for example in AboutSystem). it should either reduce the total size there or report some "inaccessible" RAM. If it doesn't, indeed we have a problem.

Thanks for the suggestion. The inaccessible RAM stays at 1 MiB independent of the frame buffer size I chose in the BIOS, but the actual available RAM reflects the BIOS settings --- also the syslog tells me that the radeon_hd driver found the correct frame buffer size - corresponding to the BIOS settings.

comment:21 by nzimmermann, 7 years ago

New theory: I start to believe that this is an intrinsic AMD-FX instability, that can be cured with microcode updates, as the system is stable under Linux - which include ucode update facilities. Haiku lacks the ability to update the CPU microcode.

From https://launchpad.net/debian/+source/amd64-microcode/+changelog:

    + This update fixes important processor bugs that cause data corruption
      or unpredictable system behaviour.  It also fixes a performance issue
      and several issues that cause system lockup.

Some of these apply to AMD-FX 6300, the CPU I'm using - according to the AMD 15h documentation.

comment:22 by korli, 7 years ago

Does a BIOS update include these microcode updates?

in reply to:  22 comment:23 by nzimmermann, 7 years ago

Replying to korli:

Does a BIOS update include these microcode updates?

Nope, the last BIOS update for my ASUS is from 2014.

by nzimmermann, 7 years ago

WIP patch implementing microcode updates

in reply to:  22 ; comment:24 by nzimmermann, 7 years ago

Replying to korli:

Does a BIOS update include these microcode updates?

Dear korli,

I've started to port the Linux microcode updating facilities to Haiku, with partial success. According to the syslog the microcode is now properly updated, to the latest version that's available for my AMD Bulldozer (0x15) -- but the crashes that I observe on x86_64 remain.

There are a few possibilities:

  • my limited knowledge of this topic introduced a subtle bug, and the microcode is not correctly updated on all CPUs (e.g. eventually I've not properly protected/isolated the individual CPU updates from each other, not sure if that's actually needed, or whether x86_write_msr, can be safely done as-is
  • I update the ucode too late/early in the booting process
  • Outdated microcode is not the issue for my random-reboot-issue

TODO:

  • Microcode updating only works for AMD 0x15 CPUs, simply because I didn't bother to extend to anything else for now.
  • Not sure how to "package" the ucode blobs so that they're available early in the boot process -- for now I've hexdumped the amd-ucode.bin file, and hardcoded it into the kernel - an ugly workaround for now.

comment:25 by nzimmermann, 7 years ago

Syslog excerpt:

2018-05-11 18:05:17 KERN: Welcome to kernel debugger output!
2018-05-11 18:05:17 KERN: Haiku revision: hrev51927-3-g645dc27 [master]
2018-05-11 18:05:17 KERN: CPU 0: type 0 family 15 extended_family 6 model 2 extended_model 0 stepping 0, string 'AuthenticAMD'
2018-05-11 18:05:17 KERN: CPU 0: vendor 'AMD' model name 'AMD FX(tm)-6300 Six-Core Processor             '
2018-05-11 18:05:17 KERN: CPU 0: apic id 0, package 0, core 0, smt 0
2018-05-11 18:05:17 KERN: CPU 0: cache sharing: L1 id 0, L2 id 0, L3 id 0
2018-05-11 18:05:17 KERN: CPU 0: features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clfsh mmx fxsr sse sse2 htt sse3 pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c syscall nx mmxext ffxsr long aperfmperf bmi1 
2018-05-11 18:05:17 KERN: microcode: Obtaining AMD microcode update
2018-05-11 18:05:17 KERN: microcode: Using microcode binary 'amd_fam15h.bin' blob starting at 0xffffffff801aad38, size 7877 bytes.
2018-05-11 18:05:17 KERN: microcode: Successfully extracted AMD microcode firmware: patch_id: 100665423, processor_rev_id 24608
2018-05-11 18:05:17 KERN: microcode: Currently loaded revision 100665378
2018-05-11 18:05:17 KERN: microcode: Successfully patched microcode to revision 100665423

in reply to:  24 ; comment:26 by korli, 7 years ago

Replying to Nikolas Zimmermann:

I've started to port the Linux microcode updating facilities to Haiku, with partial success. According to the syslog the microcode is now properly updated, to the latest version that's available for my AMD Bulldozer (0x15) -- but the crashes that I observe on x86_64 remain.

Sorry to read that. Is there maybe a pattern when the crashes don't happen? Did you try to blacklist all your unneeded drivers (for instance USB ehci,xhci)? I know the crashes also exist on Intel.

There are a few possibilities:

  • my limited knowledge of this topic introduced a subtle bug, and the microcode is not correctly updated on all CPUs (e.g. eventually I've not properly protected/isolated the individual CPU updates from each other, not sure if that's actually needed, or whether x86_write_msr, can be safely done as-is

One thing missing is re-reading the CPU features after the microcode update.

  • I update the ucode too late/early in the booting process
  • Outdated microcode is not the issue for my random-reboot-issue

TODO:

  • Microcode updating only works for AMD 0x15 CPUs, simply because I didn't bother to extend to anything else for now.
  • Not sure how to "package" the ucode blobs so that they're available early in the boot process -- for now I've hexdumped the amd-ucode.bin file, and hardcoded it into the kernel - an ugly workaround for now.

I imagine the vendor specific code would be better placed in src/add-ons/kernel/cpu/x86/amd.cpp (for AMD). Better confirm that before going forward with ucode update.

in reply to:  26 ; comment:27 by nzimmermann, 7 years ago

Replying to korli:

Replying to Nikolas Zimmermann:

I've started to port the Linux microcode updating facilities to Haiku, with partial success. According to the syslog the microcode is now properly updated, to the latest version that's available for my AMD Bulldozer (0x15) -- but the crashes that I observe on x86_64 remain.

Sorry to read that. Is there maybe a pattern when the crashes don't happen? Did you try to blacklist all your unneeded drivers (for instance USB ehci,xhci)? I know the crashes also exist on Intel.

I already disabled all USB busses, the HDA driver, radeon_hd, etc. doesn't help. Also the 4GB RAM limit doesn't help (whereas on x86-gcc2, this allows me to have a stable system). Disabling SMP helps in both cases to get a stable system.

I'm lost what else I could try...

in reply to:  27 ; comment:28 by nzimmermann, 7 years ago

Replying to Nikolas Zimmermann:

I'm lost what else I could try...

I just came across an interesting article: http://blog.stuffedcow.net/2015/08/pagewalk-coherence/ Apparently AMD Bulldozer series is special in the sense that the pagewalk behavior is "non-coherent", compared to most other AMD/Intel CPUs.

Both Intel and AMD manuals state that without invalidation, even if the old entry is not cached in the TLB, the pagewalk may still see the old entry. That is, in the following pseudocode, the second instruction (load) can non-deterministically use either the old mapping or new mapping, and that there must be an invalidation or TLB flush in between to guarantee the new page table entry is visible by the second instruction.

mov [page table], new_mapping
mov eax, [linear address using updated mapping]

I wonder if the Haiku paging code guarantees this. Apparently we only call InvalidatePage(..) when the accessed bit is set.

        if ((oldEntry & X86_64_PTE_ACCESSED) != 0) {
                // Note, that we only need to invalidate the address, if the
                // accessed flags was set, since only then the entry could have been
                // in any TLB.
                InvalidatePage(address);
..

X86VMTranslationMap::InvalidatePage() only records that a page was invalid, and upon the next Flush() the TLB is invalidated, when fInvalidPagesCount > 0.

I'm currently traveling, when I'm back I will try to always call invalidatePage() in any case, just to see if it helps with my issue. Obviously I'm still 'fishing in the blind' - trying to understand what makes AMD Bulldozer special.

in reply to:  28 ; comment:29 by nzimmermann, 7 years ago

Replying to Nikolas Zimmermann:

Replying to Nikolas Zimmermann:

I'm lost what else I could try...

I just came across an interesting article: http://blog.stuffedcow.net/2015/08/pagewalk-coherence/ Apparently AMD Bulldozer series is special in the sense that the pagewalk behavior is "non-coherent", compared to most other AMD/Intel CPUs.

@korli, what do you think about this?

in reply to:  29 comment:30 by korli, 7 years ago

Replying to Nikolas Zimmermann:

@korli, what do you think about this?

I looked some other implementations, and they don't seem to work differently from ours. This article speaks of page table mapping update. The AMD PDF states that the flag X86_64_PTE_ACCESSED is set when speculation occurs. Interesting information, anyway I suppose you can try whatever would make sense and prove a theory by testing.

comment:31 by waddlesplash, 6 years ago

Milestone: R1/beta1Unscheduled

comment:32 by nzimmermann, 6 years ago

I found out a super easy way to trigger the reboots: find / | xargs md5sum (repeat < 5 times)

Out of curiosity I went back to old hrevs. To make a long story short: hrev45681 reproducible reboots spontaneously, and hrev45225 does not. Unfortunately there are no nightlies in-between, so I will have to check all patches between 45225 .. 45681 that touch the x86_64 kernel and see if I can find the culprit: fun ahead of me.

comment:33 by waddlesplash, 6 years ago

Potentially relevant, from that range: hrev45564, hrev45566, hrev45681, hrev45518 (big), hrev45397 (more use of largepages). (not an exhaustive list.)

As bisecting is probably too difficult here, reverting patches individually may be of more use.

in reply to:  33 comment:34 by nzimmermann, 6 years ago

Replying to waddlesplash:

As bisecting is probably too difficult here, reverting patches individually may be of more use.

I started with the same approach, inspecting the git log between hrev45225 ... hrev45681, identifying problematic patches, etc. - my list is even more complete than yours - but reverting them individually is in some cases problematic :-( I ended up with a kernel that didn't even boot, entering KDL early in the process. Reverting some patches created a lot of subtle fixup work in various places (e.g. removing B_RANDOMIZED_ANY_ADDRESS support..) which took a hour, and once it compiled and booted properly, I could still easily trigger the reboot.

I decided to give up on this, and started again with reproducing the nightly builds from 5 years ago, from an older linux host system, by utilizing vanilla RHEL6 images in a singularity container, mimicking a standard host system for the cross-compilation from 2013.

I now have the btrev<XY> from 2013 running, and the old hrevs as well. By now, I successfully reproduced the random reboot with a self-compiled hrev45681. My bisect is currently at hrev45560, and this also does not produce random reboots.

--> Narrowed down to hrev45560 .. hrev45681 (from hrev45225 ... hrev45681). Currently I'm building hrev45620, let's see ;_0

Last edited 6 years ago by nzimmermann (previous) (diff)

comment:35 by nzimmermann, 6 years ago

Bah, all unrelieable, some hrevs work just fine, then I reboot, and can immediately reproduce the crash :-( Still no conclusion from my side.

comment:36 by waddlesplash, 6 years ago

It would be interesting to see if you can get any of these issues to reproduce on KVM with AMD-V enabled. That way the page tables should be mostly going through the real MMU and not QEMU's MMU emulation, and then if you can trigger triple faults using this it may be much easier to debug.

comment:37 by mmlr, 6 years ago

The issue in #14659 might be relevant here. The instant resets hint at a tripple fault, i.e. the double fault handler failed. Due to the bug fixed in https://review.haiku-os.org/#/c/haiku/+/810 the double fault IST would be cleared for most CPUs. The reason for the double fault itself would not be explained by that bug, but if the fix makes the double fault handler work again it might shed some light on the actual cause. It is therefore worth a try to retest this with change 810 applied.

in reply to:  37 comment:38 by korli, 6 years ago

Replying to mmlr:

It is therefore worth a try to retest this with change 810 applied.

About retesting: The 810 change would only help on x86_64, though hrev52701 might help on x86 too.

comment:39 by alpopa, 6 years ago

(Also added this comment to #14887)

Retested hrev53092 on AMD Phenom II x4 955. I can compile from Terminal large projects, and compilation takes several (tens of) minutes with no problem. However, when trying to grep something simple, the system restarts instantly.

Edit: grep worked after that without problem. It may be a sporadical reboot, or some problem in grep (first time I used its argument without quotes, later I used them) or in file system.

Last edited 6 years ago by alpopa (previous) (diff)

comment:40 by waddlesplash, 6 years ago

Blocking: 10279 removed

comment:41 by nephele, 5 years ago

It was mentioned to me that this issue might be the same that occurs on my machine. My Processor (according to 'About this system') is 4 Processors: AMD FX-Series 3.81 GHz I have 4GB of ram ('4079 MiB total, 1 MiB inaccessible')

My haiku revision is hrev53242

My system crashed completely twice on running configure in the haiku tree, aswell as once when running gcc (after deinstalling gcc configure no longer crashed the machine) (also once at running jam -j4 @anyboot-image)

comment:42 by waddlesplash, 5 years ago

While poking at this again, I just ran into this FreeBSD commit: https://github.com/freebsd/freebsd/commit/6cd4f250111b610bf48172c9845d9bf88ea97fca

"Family 10h" is Bulldozer. So ... perhaps this is the cause of our problems here?

comment:43 by waddlesplash, 5 years ago

Blocking: 14770 added

comment:44 by waddlesplash, 5 years ago

Haiku now has an "errata patcher" that does what the above commit from FreeBSD does, fwiw.

comment:45 by waddlesplash, 5 years ago

Blocking: 15988 added

comment:46 by waddlesplash, 4 years ago

Please retest after hrev54510.

comment:47 by waddlesplash, 3 years ago

Blocking: 17216 added

comment:48 by waddlesplash, 3 years ago

Blocking: 14887 added

comment:49 by nipos, 3 years ago

Any progress on this? I've moved from an Intel i7 to an AMD Athlon II x3 machine recently and have been hit by exactly the same error. I had those problems on a AMD A4,too but hoped that it wouldn't affect the AMD Athlon II as it's an older series,but it unfortunately does. I can reliably reproduce the issue by running "make" without any arguments. When I run it with -j3 (my cpu has three cores) or even only -j1,the chance of getting a crash decreases very much,but I still get them every few hours when working on my application and trying to rebuild it often. I understand that this may not be a priority anymore as everyone is using Ryzen nowadays,but those old AMD cpus are quite cheap and run Haiku perfectly fast and smooth,so it would be nice to have that fixed.

comment:50 by waddlesplash, 22 months ago

Has anyone here tried booting with ACPI disabled (and none of the other safemode options, except maybe "failsafe video", enabled), and seeing how that affects behavior?

comment:51 by Habbie, 21 months ago

tl;dr: disabling ACPI did not help. disabling SMP did help. Failed to reproduce the issue in AMD-V/KVM for unrelated reasons.

I have two machines that suffer from random reboots. Running -j20 builds makes this happen more often.

The machines are https://www.parkytowers.me.uk/thin/hp/t520/ and https://www.parkytowers.me.uk/thin/hp/t620/ (the quadcore model), so AMD GX-212JC and AMD GX-415GA.

I have tried the following (all with failsafe video on, because otherwise Haiku does not boot at all):

  • disabling ACPI - depending on reboot/powercycle sequence, this either does not boot at all (with weird memory corruption it appears), or when it does boot, the reboots still happen
  • disabling SMP - this works, systems are stable
  • running in KVM with AMD-V on Linux - within an hour, graphics stop working, so I don't have any useful data from this experiment. I might redo it later with a shell on serial, or via sshd.

This is on beta4, with and without a kernel update shipped somewhere in the last two weeks.

I hope to connect serial soon, but I understand we're not optimistic about learning anything there.

comment:52 by Habbie, 21 months ago

(running in 64 bit mode, by the way)

comment:53 by Habbie, 21 months ago

Limiting memory to 4GB did not help. Physically removing 2 of my 6GB did not help.

comment:54 by waddlesplash, 15 months ago

The "4GB Memory Limit" option is currently only effective on 32-bit x86. On 64-bit it does nothing. And what it really does is disable PAE, not merely limit the memory. Possibly we should rename the option and remove it from the bootloader settings for 64-bit systems?

comment:55 by bipolar, 15 months ago

Noticing the symptoms (hangs/freezes, or instant-reset due to triple faults as mmlr mentioned)...

Could errata 670 be at play here?

Quoting from "44739_12h_Rev_Gd.pdf" for easier reference:

Potential Effect on System

For affected instructions that have an implicit or explicit LOCK prefix, a system hang occurs.

For affected instructions that do not have an implicit or explicit LOCK prefix, the processor may present a #PF exception after some of the instruction effects have been applied to the processor state.

No system effect is observed unless the operating system’s page fault handler has some dependency on this interim processor state, which is not the case in any known operating system software. The interim state does not impact program behavior if the operating system resolves the #PF and resumes the instruction. However, this interim state may be observed by a debugger or if the operating system changes the #PF to a program error (for example, a segmentation fault).

Suggested Workaround

System software should set MSRC001_1020[8] = 1b.

This workaround ensures that instructions with an implicit or explicit LOCK prefix do not cause a system hang due to this erratum. However, instructions may still present a #PF after altering architectural state.

The "which is not the case in any known operating system software" made me wonder if they tried on Haiku :-P

Might be worth a try adding:

x86_write_msr(0xc0011020, x86_read_msr(0xc0011020) | ((uint64)1 << 8));

under https://cgit.haiku-os.org/haiku/tree/src/system/kernel/arch/x86/64/errata.cpp#n59 to see if it helps with the freezes, at least?

(or I'm, as always, waaay out of my league here and should just STFU :-D)

comment:56 by waddlesplash, 15 months ago

I think Bulldozer is family 15h, not 12h? But those sorts of errata documents are where such problems might be noted, indeed.

comment:57 by bipolar, 15 months ago

I think Bulldozer is family 15h, not 12h?

Indeed. Errata 670 applies to both 10h and 12h. I guess the mentions here of issues with Athlon/Phenoms II was what stuck in my mind (as that's the hardware I have, but I have not been affected by this so far).

Off to read the one for Bulldozer then :-D

Last edited 15 months ago by bipolar (previous) (diff)

comment:58 by waddlesplash, 7 months ago

So, as it turns out, the allocation of double-fault stacks was completely broken since 2014 and so any double-fault would just become a triple-fault. This is now fixed in hrev57747; so please retest after that, and let's see if we at last get an intelligible panic message.

comment:59 by nipos, 6 months ago

I'm currently running hrev57755 and can confirm that for me the problem is completely fixed. I'm running Haiku on a AMD Athlon II cpu and Haiku always crashed when I put much load on it. I always had to run compile tasks with -j1 to prevent crashes and with bad luck,it sometimes happened anyway. Today I compiled major projects on all cores and have yet to experience any crash. Thank you very much for the fix,that makes Haiku a lot more enjoyable here!

Edit: Forget what I said,don't know why it worked so well for a long time yesterday,but today I already had 3 crashes again without doing anything different.

Last edited 6 months ago by nipos (previous) (diff)

comment:60 by waddlesplash, 5 months ago

Habbie reports on IRC that hrev57820 survived for about 15 hours of continuous -j4 compile jobs on his Jaguar-based machine before it spontaneously rebooted (his previous record was only about 3.5 hours, it appears.) So, it's possible that the APIC changes in recent hrevs improved things here.

comment:61 by nipos, 5 months ago

I just updated to hrev57820 which seems to be the latest nightly. Then I ran a rather large compile task on all three cores again,and after about a half hour,I got the first crash once again. For me there doesn't seem to be a improvement here.

comment:62 by Habbie, 5 months ago

on a second attempt with hrev57820 (I'm doing -j40, btw, but the GX-415GA has 4 cores) it lasted around 32 hours - so I do see the improvement, possibly from unrelated changes that just reduced the likelihood of the triple fault being triggered?

comment:63 by waddlesplash, 5 months ago

The write coalescing changes significantly reduce the number of I/O interrupts, so it's possible that's also a factor here.

comment:64 by dovsienko, 3 months ago

For reference, here is one more way to reproduce this problem.

The hardware is an old HP ProLiant MicroServer N40L with AMD Turion II Neo N40L (1.5GHz, 2 CPU cores, https://github.com/OscarL/amd_temp reads below 55degC under full load) and 4GB ECC RAM (memtest86 passes). The software is R1/beta5 with all current updates (hrev57937+114).

The problem reliably reproduces when running a build matrix of tcpdump. This usually requires between 10 and 20 minutes. It does not reproduce when both CPU cores are loaded 100% using stress-ng only. It reproduces both with and without active use of RAMFS, both on the local console and using an SSH session. The steps to reproduce are as follows:

pkgman install cmake llvm18_clang
git clone https://github.com/the-tcpdump-group/libpcap/
git clone https://github.com/the-tcpdump-group/tcpdump/
export MAKEFLAGS="-j $(nproc)"
export TMPDIR=/boot/system/var/shared_memory
cd tcpdump
./build_matrix.sh

Changing BIOS settings to enable one CPU core only seems to resolve the problem: the script has been running in a loop for hours without a problem.

comment:65 by dovsienko, 3 months ago

Using hrev58146 on the same AMD PC with SMP enabled, the problem still reproduces as before: the 5 attempts to run the build matrix rebooted the PC within 1, 30, 10, 35 and 21 minutes. In the first case the reboot occurred one minute into the first build while running Autoconf, so it looks like parallel make is the most common, but not the only way to reproduce the problem.

comment:66 by dovsienko, 3 months ago

Another occurrence on hrev58149 whilst running Autoconf in a different project:

checking for strchr... yes
checking for strdup... yes
checking for strstr... yes
checking for shmget... no
checking for atexit... yes
checking for sendto... no
checking for sendto in -lnetwork... yes
checking size of unsigned short int... 2
checking size of unsigned long int... 8
unsigned long is NOT 4 bytes... hmmm...
checking size of unsigned int... 4
checking for a BSD-compatible install... /bin/install -c
checking whether build environment is sane... yes
client_loop: send disconnect: Broken pipe

The only activity on the host after a normal boot a couple hours earlier was installing a couple packages and trying to build a tiny project 20 or 30 times. It looks like the problem is not that difficult to trigger, but I have not found a way to do it using stress-ng. Perhaps the root cause has more to do with some state that accumulates in the OS after a specific action rather than a load-related race condition.

comment:67 by waddlesplash, 2 months ago

It would be interesting to test if anything is different after the bootloader/kernel early memory allocation changes here. If not, a syslog capture (containing the memory ranges information) from x86_64 would be useful.

comment:68 by dovsienko, 2 months ago

Using hrev58259, the problem reproduced in 5 minutes after booting the PC and starting the software build.

by dovsienko, 2 months ago

Attachment: syslog_hrev58259.txt added

/var/log/syslog for hrev58259 (sanitised)

comment:69 by waddlesplash, 6 days ago

Since the problems seem to happen most frequently in shells and other things that run fork() a lot, perhaps that's somehow related. I've uploaded https://review.haiku-os.org/c/haiku/+/8684 which may affect things here.

Note: See TracTickets for help on using tickets.