Opened 11 days ago

Last modified 5 days ago

#14082 new bug

PAE paging leads to random reboots/freezing using AMD-FX 6300 6-core processor

Reported by: Nikolas Zimmermann Owned by: Nobody
Priority: normal Milestone: R1/beta1
Component: System/Kernel Version: R1/Development
Keywords: vm Cc:
Blocked By: Blocking: #10279
Has a Patch: no Platform: x86

Description

My system consist of a AMD-FX 6300 6-core machine with 16GB DDR3 RAM and Haiku is installed os primary OS. The machine randomly reboots and/or freezes (this is more seldom) after a few minutes, when the system is under heavy load (compiling a large software project, using make -j 6).

I replaced the motherboard, CPU, RAM, power-supply etc, but things kept unstable. Finally, after playing with the safe mode settings I found a work-around: Enabling 4gb_memory_limit. (This was a suggestion from korli on ticket #10279).

In src/system/kernel/arch/x86/arch_vm_translation_map.cpp a decision is made based on this setting. PAE paging is disabled, and 32 bit paging is enabled. The paging method is switched from X86PagingMethodPAE to X86PagingMethod32Bit.

Using the latter paging method, my system is stable: no random reboots, no random binary crashes, nothing. I'm confident that the actual memory limit of 4 GB is not curing the problem, as I can reproduce the system reboots also when using only one RAM stick with 4 GB alone --> it must be the paging.

NOTE: Other people reported that the 4gb_memory_limit does not help with the random binary crashes (as stated on ticket #10279, by kallisti5), but in my case it does make the difference. Apparently in earlier days, the 4gb_memory_limit did NOT deactivate PAE paging, but nowadays it does.

Change History (17)

comment:1 Changed 10 days ago by korli

One big difference between 32bit and PAE is the size of page_directory_entry and pae_page_table_entry (32 vs 64 bits). Now, these entries on x86 are not updated atomically, which might or might not be an issue (I heard that PAE usually requires cmpxchg64). It's only a hunch.

comment:2 in reply to:  1 ; Changed 10 days ago by Nikolas Zimmermann

Replying to korli:

One big difference between 32bit and PAE is the size of page_directory_entry and pae_page_table_entry (32 vs 64 bits). Now, these entries on x86 are not updated atomically, which might or might not be an issue (I heard that PAE usually requires cmpxchg64). It's only a hunch.

Thanks korli for commenting. Forgive my ignorance, this topic is new to me, but Isn't setting page table entries done atomically?

/*static*/ inline pae_page_table_entry
X86PagingMethodPAE::SetPageTableEntry(pae_page_table_entry* entry,
        pae_page_table_entry newEntry)
{
        return atomic_get_and_set64((int64*)entry, newEntry);
}

comment:3 in reply to:  2 ; Changed 10 days ago by korli

Replying to Nikolas Zimmermann:

Thanks korli for commenting. Forgive my ignorance, this topic is new to me, but Isn't setting page table entries done atomically?

/*static*/ inline pae_page_table_entry
X86PagingMethodPAE::SetPageTableEntry(pae_page_table_entry* entry,
        pae_page_table_entry newEntry)
{
        return atomic_get_and_set64((int64*)entry, newEntry);
}

This one is used only when unmapping. For the mapping, see here

/*static*/ void
X86PagingMethodPAE::PutPageTableInPageDir(pae_page_directory_entry* entry,
	phys_addr_t physicalTable, uint32 attributes)
{
	*entry = (physicalTable & X86_PAE_PDE_ADDRESS_MASK)
		| X86_PAE_PDE_PRESENT
		| X86_PAE_PDE_WRITABLE
		| X86_PAE_PDE_USER;
		// TODO: We ignore the attributes of the page table -- for compatibility
		// with BeOS we allow having user accessible areas in the kernel address
		// space. This is currently being used by some drivers, mainly for the
		// frame buffer. Our current real time data implementation makes use of
		// this fact, too.
		// We might want to get rid of this possibility one day, especially if
		// we intend to port it to a platform that does not support this.
}

It's used here Then compare with 64bit

comment:4 Changed 10 days ago by korli

Cc: ingo_weinhold@… axeld@… kallisti5@… removed

comment:5 Changed 10 days ago by Nikolas Zimmermann

NOTE: The random binary crashes are still present, even though the likelihood is decreased when disabling PAE paging. The link to bug #10279 should be removed.

Last edited 10 days ago by Nikolas Zimmermann (previous) (diff)

comment:6 in reply to:  3 Changed 10 days ago by Nikolas Zimmermann

Replying to korli:

This one is used only when unmapping. For the mapping, see here

Okay, I'll give it a try, thanks for the suggestions.

comment:7 Changed 9 days ago by Alexander von Gluck

What's odd is this issue seems completely limited to AMD Bulldozer systems (MBR/BIOS booted)

I upgraded my desktop to Ryzen, and the few times I've successfully booted with EFI I haven't seen the issue. *however*... MBR/BIOS boot doesn't work on my Ryzen (#13370), so maybe related?

comment:8 Changed 8 days ago by korli

My suggestions as a patch https://review.haiku-os.org/#/c/120/

comment:9 in reply to:  8 Changed 8 days ago by Nikolas Zimmermann

Replying to korli:

My suggestions as a patch https://review.haiku-os.org/#/c/120/

Thanks korli, you were quicker than me :-(

My attempt was to simply call atomic_get_and_set64 in PutPageTableInPageDir(), but your cleanup is indeed nicer. Unfortunately I couldn't test it yet - due to #13980, which currently makes it impossible for me to compile Haiku.

I'm trying waddlesplashs attempt to fix it, together with your patch - hopefully I have a new build available in the next few hours, and can confirm whether the PAE problem is fixed or not.

comment:10 Changed 7 days ago by Nikolas Zimmermann

New kernel is compiled, and running, let's see if I can still reproduce problems with PAE enabled.

comment:11 in reply to:  10 Changed 7 days ago by Nikolas Zimmermann

Replying to Nikolas Zimmermann:

New kernel is compiled, and running, let's see if I can still reproduce problems with PAE enabled.

Unfortunately it crashed again, in the same way, during make -j6. I've captured the syslog using a serial cable, but it's not informative I fear:

...
bfs: bfs_rename:1145: No such file or directory
Last message repeated 8 times.
slab memory manager: created area 0x93801000 (124296)
slab memory manager: created area 0x94001000 (124297)
bfs: bfs_rename:1145: No such file or directory
hda: Unsolicited response: 00000080/00000010
hda: sensed pin widget 27, 1
hda: Unsolicited response: 00000000/00000010
hda: sensed pin widget 27, 0
slab memory manager: created area 0x94801000 (167682)
slab memory manager: created area 0x95001000 (1169484)
hda: Unsolicited response: 00000080/00000010
hda: sensed pin widget 27, 1
hda: Unsolicited response: 00000000/00000010
hda: sensed pin widget 27, 0
slab memory manager: created area 0x95801000 (1365358)

comment:12 Changed 7 days ago by korli

Thanks for the test, it would have be surprising given the changes involved. Is it with USB booting?

comment:13 in reply to:  12 Changed 7 days ago by Nikolas Zimmermann

Replying to korli:

Thanks for the test, it would have be surprising given the changes involved. Is it with USB booting?

No, Haiku is installed on a SATA HDD on my machine.

comment:14 Changed 6 days ago by Nikolas Zimmermann

I've further analyzed this, and checked all locks / CPU pinning are done correctly, by comparing to the 64bit implementation. There is only one really interresting difference: X86VMTranslationMapPAE::QueryInterrupt(). In the 64bit paging implementation, QueryInterrupt() simply calls Query() wheres in the PAE code a specific implementation exists. The Query() code for both 64bit and PAE, enforce thread cpu pinning, but the QueryInterrupt() code for PAE doesn't do that.

As I said before, I'm not expert in this area, and can't judge whether this is problematic or not, but just wanted to mention this difference. I've observed that I always see "hda: Unsolicited response" in the syslog before the machine crashes, and that message comes from the HDA audio driver IRQ handling.

What do the experts think? Could a missing ThreadCPUPinner in the QueryInterrupt() function cause such problems?

comment:15 Changed 6 days ago by korli

No idea about QueryInterrupt(). About the HDA audio driver, you could blacklist it to check whether it could be a source of problems. This can actually be done with a few drivers. A warning though, audio or network drivers run continuously and tend to exercise the system, that doesn't mean that they are at fault (like for random crashes).

comment:16 Changed 5 days ago by Nikolas Zimmermann

My QueryInterrupt changes didn't help - I got on the wrong track, as also disabling HDA driver doesn't help. Still random hangs w/o any indication in the syslog what went wrong.

I tried to enable the PAE tracing, but it takes hours to boot into haiku with that. I wish I could only enable it after a certain point in time.

comment:17 Changed 5 days ago by Nikolas Zimmermann

I've read through all related closed and opened PAE tickets, and came across the memeater tool:

Using several memeater processes at the same time each allocating 1000MB 100 times, I can trigger following page fault: (Log obtained from minicom on a second machine via serial cable)

[2018-04-18 10:32:07] vm_soft_fault: va 0x1b900000 not covered by area in address space
[2018-04-18 10:32:07] vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0x1b900100, ip 0x801470b3, write 1, user 0, thref
[2018-04-18 10:32:07] PANIC: vm_page_fault: unhandled page fault in kernel space at 0x1b900100, ip 0x801470b3
[2018-04-18 10:32:07] 
[2018-04-18 10:32:07] Welcome to Kernel Debugging Land...
[2018-04-18 10:32:07] Thread 831 "memeater.c" running on CPU 4
[2018-04-18 10:32:07] stack trace for thread 831 "memeater.c"
[2018-04-18 10:32:07]     kernel stack: 0x822c9000 to 0x822cd000
[2018-04-18 10:32:07]       user stack: 0x70363000 to 0x71363000
[2018-04-18 10:32:07] frame               caller     <image>:function + offset
[2018-04-18 10:32:07]  0 822ccbc4 (+  32) 8014bfde   <kernel_x86> arch_debug_stack_trace + 0x12
[2018-04-18 10:32:07]  1 822ccbe4 (+  16) 800a9183   <kernel_x86> stack_trace_trampoline(NULL) + 0x0b
[2018-04-18 10:32:07]  2 822ccbf4 (+  12) 8013db72   <kernel_x86> arch_debug_call_with_fault_handler + 0x1b
[2018-04-18 10:32:07]  3 822ccc00 (+  48) 800aac0c   <kernel_x86> debug_call_with_fault_handler + 0x60
[2018-04-18 10:32:07]  4 822ccc30 (+  64) 800a9397   <kernel_x86> kernel_debugger_loop(0x801904b7 "PANIC: ", 0x801a71a0 "vm_page_fault: unhax
[2018-04-18 10:32:07] ", 0x822cccdc "", int32: 4) + 0x20f
[2018-04-18 10:32:07]  5 822ccc70 (+  48) 800a973b   <kernel_x86> kernel_debugger_internal(0x801904b7 "PANIC: ", 0x801a71a0 "vm_page_fault: x
[2018-04-18 10:32:07] ", 0x822cccdc "", int32: 4) + 0x77
[2018-04-18 10:32:07]  6 822ccca0 (+  48) 800aaf8e   <kernel_x86> panic + 0x3a
[2018-04-18 10:32:07]  7 822cccd0 (+ 144) 80122ac9   <kernel_x86> vm_page_fault + 0x13d
[2018-04-18 10:32:07]  8 822ccd60 (+  80) 8014d850   <kernel_x86> x86_page_fault_exception + 0x1ec
[2018-04-18 10:32:07]  9 822ccdb0 (+  12) 8014046c   <kernel_x86> int_bottom + 0x3c
[2018-04-18 10:32:07] kernel iframe at 0x822ccdbc (end = 0x822cce0c)
[2018-04-18 10:32:07]  eax 0x4060008     ebx 0x88bd1afc     ecx 0x1b900100  edx 0xf1759018
[2018-04-18 10:32:07]  esi 0x80000001    edi 0x4b1b5000     ebp 0x822cce54  esp 0x822ccdf0
[2018-04-18 10:32:07]  eip 0x801470b3 eflags 0x10206   
[2018-04-18 10:32:07]  vector: 0xe, error code: 0x2
[2018-04-18 10:32:07] 10 822ccdbc (+ 152) 801470b3   <kernel_x86> X86VMTranslationMapPAE<0xdff84818>::UnmapArea(VMArea*: 0xdfc67270, false) 7
[2018-04-18 10:32:07] 11 822cce54 (+  64) 8011f2db   <kernel_x86> delete_area(VMAddressSpace*: 0xdffad360, VMArea*: 0xdfc67270, false) + 0xff
[2018-04-18 10:32:07] 12 822cce94 (+ 144) 8011f5cf   <kernel_x86> vm_delete_area + 0x203
[2018-04-18 10:32:07] 13 822ccf24 (+  32) 80126d8a   <kernel_x86> _user_delete_area + 0x1e
[2018-04-18 10:32:07] 14 822ccf44 (+ 100) 8014066f   <kernel_x86> handle_syscall + 0xdc
[2018-04-18 10:32:07] user iframe at 0x822ccfa8 (end = 0x822cd000)
[2018-04-18 10:32:07]  eax 0xc1          ebx 0x115d138      ecx 0x713622ac  edx 0x620b7114
[2018-04-18 10:32:07]  esi 0x7136353c    edi 0x7136354c     ebp 0x713622d8  esp 0x822ccfdc
[2018-04-18 10:32:07]  eip 0x620b7114 eflags 0x3202    user esp 0x713622ac
[2018-04-18 10:32:07]  vector: 0x63, error code: 0x0
[2018-04-18 10:32:07] 15 822ccfa8 (+   0) 620b7114   <commpage> commpage_syscall + 0x04
[2018-04-18 10:32:07] 16 713622d8 (+  80) 00788be4   <memeater.c> main + 0x214
[2018-04-18 10:32:07] 17 71362328 (+  48) 007888b7   <memeater.c> _start + 0x5b
[2018-04-18 10:32:07] 18 71362358 (+  64) 00bfb000   </boot/system/runtime_loader@0x00be9000> <unknown> + 0x12000
[2018-04-18 10:32:07] 19 71362398 (+   0) 620b7250   <commpage> commpage_thread_exit + 0x00

Not sure if it's at all related to my random system freezes, but at least there's a hint something is still wrong in the PAE code.

Note: See TracTickets for help on using tickets.