#5956 closed bug (fixed)
PPC boot fails on double lock of mutex
Reported by: | stevenh | Owned by: | axeld |
---|---|---|---|
Priority: | normal | Milestone: | R1 |
Component: | System/Kernel | Version: | R1/Development |
Keywords: | Cc: | planche2k@… | |
Blocked By: | Blocking: | #1048 | |
Platform: | PowerPC |
Description
On revision hrev36760, after the fix by Ingo in hrev36748 to disable the debug syslog feature for openfirmware, the booting now fails when creating a VMNullCache for 'iospace'.
Loading ELF checking for memory... 0: base = 0x00000000, size = 268435456 1: empty region total physical memory = 256 MB suggested page table size = 2097152 need new page table, size = 2097152! new table at: 0x00200000 MSR: 0x00003030 found 25 translations found exception handlers! found page table! virt_allocated: 20 phys_allocated: 5 phys_memory: 1 found CPU: /cpus/PowerPC,G4 CPU clock frequency: 1249999995 bus clock frequency: 166483989 time base frequency: 41620997 mmu_alloc: va 0x80084000, pa 0x00003000, size 16384 platform_init_heap() mmu_alloc: va 0x80088000, pa 0x00007000, size 65536 heap base = 0x80088000 heap top = 0x80098000 adding args: '' Welcome to the Haiku boot loader! boot path = "/pci@f4000000/ethernet@f:192.168.1.35,boot_loader_openfirmware" boot type = network add_partitions_for(0x80088234, mountFS = no) add_partitions_for(fd = 0, mountFS = no) 0x80088280 Partition::Partition 0x80088280 Partition::Scan() check for partitioning_system: Amiga Partition Map check for partitioning_system: Intel Partition Map check for partitioning_system: Intel Extended Partition check for partitioning_system: Apple Partition Map 0x80088280 Partition::_Mount check for file_system: BFS Filesystem load kernel... mmu_alloc: va 0x80098000, pa 0x00400000, size 1863680 mmu_alloc: va 0x8025f000, pa 0x00017000, size 102400 mmu_alloc: va 0x80278000, pa 0x00030000, size 192512 mmu_alloc: va 0x802a7000, pa 0x0005f000, size 32768 mmu_alloc: va 0x802af000, pa 0x00067000, size 45056 mmu_alloc: va 0x802ba000, pa 0x00072000, size 45056 mmu_alloc: va 0x802c5000, pa 0x0007d000, size 16384 mmu_alloc: va 0x802c9000, pa 0x00081000, size 229376 mmu_alloc: va 0x80301000, pa 0x000b9000, size 32768 mmu_alloc: va 0x80309000, pa 0x000c1000, size 28672 mmu_alloc: va 0x80310000, pa 0x000c8000, size 8192 mmu_alloc: va 0x80312000, pa 0x000ca000, size 8192 mmu_alloc: va 0x80314000, pa 0x000cc000, size 102400 mmu_alloc: va 0x8032d000, pa 0x000e5000, size 32768 mmu_alloc: va 0x80335000, pa 0x000ed000, size 12288 mmu_alloc: va 0x80338000, pa 0x000f0000, size 61440 mmu_alloc: va 0x80347000, pa 0x000ff000, size 8192 mmu_alloc: va 0x80349000, pa 0x00140000, size 12288 mmu_alloc: va 0x8034c000, pa 0x00143000, size 12288 mmu_alloc: va 0x8034f000, pa 0x00146000, size 32768 mmu_alloc: va 0x80357000, pa 0x0014e000, size 102400 mmu_alloc: va 0x80370000, pa 0x00167000, size 8192 mmu_alloc: va 0x80372000, pa 0x005c7000, size 962560 mmu_alloc: va 0x8045d000, pa 0x00169000, size 40960 mmu_alloc: va 0x80467000, pa 0x00173000, size 32768 mmu_alloc: va 0x8046f000, pa 0x0017b000, size 16384 mmu_alloc: va 0x80473000, pa 0x0017f000, size 16384 mmu_alloc: va 0x80477000, pa 0x00183000, size 20480 mmu_alloc: va 0x8047c000, pa 0x00188000, size 16384 mmu_alloc: va 0x80480000, pa 0x0018c000, size 110592 mmu_alloc: va 0x8049b000, pa 0x001a7000, size 32768 mmu_alloc: va 0x804a3000, pa 0x001af000, size 86016 mmu_alloc: va 0x804b8000, pa 0x001c4000, size 24576 kernel entry at 0x8010e2dc kernel stack top: 0x80088000 Welcome to kernel debugger output! Haiku revision: 36760 heap_add_area: area -1 added to small heap 0x85000000 - usable range 0x8500d000 - 0x85800000 heap_add_area: area -1 added to medium heap 0x85800000 - usable range 0x85801000 - 0x85ccc000 heap_add_area: area -1 added to large heap 0x85cccccc - usable range 0x85cce000 - 0x86000000 reserve_boot_loader_ranges(): Skipping range: 0x00000000, 12288 reserve_boot_loader_ranges(): Skipping range: 0x00102000, 180224 reserve_boot_loader_ranges(): Skipping range: 0x0012f000, 45056 reserve_boot_loader_ranges(): Skipping range: 0x00200000, 2097152 PANIC: _mutex_lock(): double lock of 0x0000001c by thread 0 Welcome to Kernel Debugging Land... Running on CPU 0 kdebug >
I'm keen to see Haiku booting on PPC and would love any help regarding this issue and a bit of background as to what might be going wrong.
Is '1c' a valid mutex ID? Does that mean there are 28 mutexs already in operation? Does openfirmware/ppc allow the creation of null caches? Am I running out of RAM? (my eMac 1.25ghz only had 256mb right now)
Meanwhile, I cannot use the debugger as I have no keyboard at the kdebug prompt. I'll work on this later... but if I could get an idea as to what commands to run, then I'll hard code them to execute after the debugger prompt appears.
Attachments (1)
Change History (12)
comment:1 by , 14 years ago
comment:2 by , 14 years ago
Cc: | added |
---|---|
Version: | R1/alpha1 → R1/Development |
comment:3 by , 14 years ago
Right, I've been trying to make sense of this... and I miss my push/pop stack calls! I've commented out the code from arch_debug.cpp and moved to arch_asm.S; since it's mentioned that this should be implemented in assembler.
I've tried to understand the basic idea from x86 (http://dev.haiku-os.org/browser/haiku/trunk/src/system/kernel/arch/x86/arch_x86.S) and came up with the following. My goal has just been to call the stack_trace method... and implement the fault handling logic once i've conquered the call. It seems that either the null pointer is killing me, or I'm not passing in parameters properly:
//call a debugging command but make sure that we have a fault handler handy? // \param cpu details on the current cpu this is on (smp? maybe?) [%r3?] // \param jumpBuffer Buffer to be used for setjmp()/longjmp(). [%r4?] // \param function The function to be called. [%r5?] // \param parameter The parameter to be passed to the [%r6?] // function to be called. FUNCTION(arch_debug_call_with_fault_handler): mlr %r3 //move the jump target to LR mtctr %r5 //r5 contains our debug call li %r3, 0x1 // argc = 1? mr %r4, %r6 // argv = r6? .. should contain the params to pass to it bctr blr
I attempted two blr calls, instead of bctr and blr respectively, but the latter seems to work a little better... unfortunately it ends up with a 'circular stack' error.
The output is:
stack trace for thread 0x0 --- kernel stack: 0x00000000 to 0x00000000 frame caller <image>:function + offset 80087a90 (+1682632) 00000002 80222758 (+ 0) 0000002 circular stack frame: x80222758
Meanwhile: kernel entry: 0x8010dfd0 kernel stack top at: 0x80088000
Apart from an easy answer, I'd really love to get my head around this.. I'm continuing reading on PPC ASM howtos/manuals/tutorials, but not many people are as low level as this.
Any help is appreciated.
comment:4 by , 14 years ago
Although it's futile.. my code snippet above should read:
... FUNCTION(arch_debug_call_with_fault_handler): mtlr %r4 //move the jump target to LR mtctr %r5 //r5 contains our debug call ...
comment:5 by , 14 years ago
To give a bit of background what this is all about: In the kernel debugger we have the problem that many commands have to examine/use/print kernel data structures and that those structures can be corrupt or that e.g. a pointer specified by the user is not correct. So it's not unlikely that invalid memory is accessed, triggering a page fault. The page fault handler normally checks whether the accessed address is OK and, if so, maps the underlying page respectively. If the access is invalid and happened in userland, the respective team is crashed (or gets a SIGSEGV, if a handler is installed for the signal). If the access happened in the kernel, the kernel panics.
Since sometimes the kernel has to access potentially unsafe user address, there are special functions that do that (user_memcpy(), user_strlcpy()). Before accessing the memory, they set a fault handler program address for the current thread (thread::fault_handler), which the page fault handler returns to when the memory access was illegal. This way the functions can simply return an error code in such a case. This only really works because the functions are fully written in assembly (cf. e.g. arch_cpu_user_memcpy() for x86 -- for PPC it is written in C and probably doesn't work correctly) and thus have full control over all registers.
In the kernel debugger things are a bit more complicated. Using something like user_memcpy() for every single memory access would be very tedious. The alternative used is debug_call_with_fault_handler() which calls the given function after installing a fault handler similar to arch_cpu_user_memcpy() (just using the per-cpu instead of the thread structure). Unlike arch_cpu_user_memcpy() it does not have full control over the registers though, since the page fault can happen anywhere in the called function. To work around that problem setjmp() is used to store the register context. If the fault handler is called, it is restored via longjmp().
So more concretely, this is how things work:
- debug_call_with_fault_handler() (architecture independent):
- Calls setjmp() to store the register state.
- Then calls arch_debug_call_with_fault_handler().
- arch_debug_call_with_fault_handler():
- Stores the jump buffer pointer (which is possibly needed later) on the stack.
- Sets the CPU's fault handler.
- Also sets the CPU's fault_handler_stack_pointer. On x86 this is what the page fault handler sets register ebp to when the fault handler is called.
- Calls the given function.
- If the function caused a page fault, the control flow continues at the fault handler code. The only reliable register at this point is ebp. The function retrieves the jump buffer pointer from the stack invokes longjmp() to restore all registers and return to debug_call_with_fault_handler().
The PPC version of arch_debug_call_with_fault_handler() has to do the same, just in a PPC specific manner, obviously. The stack pointer is r1, so that one has to be saved in cpu_ent::fault_handler_stack_pointer. Function parameter passing happens through r3-r10, for function return values r3 (and r4 for a 64 bit value) is used.
You'll also have to adjust the page fault part of ppc_exception_entry(), since it doesn't support the cpu_ent::fault_handler[_stack_pointer] for the kernel debugger yet. Cf. the x86 page fault handler.
Your function snippet looks quite incomplete yet. I also don't see where you've got the argc/argv part from -- neither the called function nor longjmp() take parameters like this.
Also note, that CPU_ENT_fault_handler and CPU_ENT_fault_handler_stack_pointer (the relative offsets of the respective cpu_ent members) are not yet available for PPC yet, since there's no asm_offsets.cpp file for PPC yet. Have a look at the x86 counterpart and see how it is handled in the respective Jamfile.
BTW, according to our coding style there should be a space between the //
and the comment text.
Good luck!
follow-up: 7 comment:6 by , 14 years ago
Here's some more info on the original problem.
Turning on TRACE_HEAP
in src/system/kernel/heap.cpp shows that it goes through the new
operator (headers/private/kernel/heap.h) to memalign_etc
and through heap_memalign
to heap_raw_alloc
with a NULL
heap_allocator
:
memalign(alignment = 0, size = 112) heap 0x00000000: allocate 112 bytes from raw pages with alignment 0 PANIC: _mutex_lock(): double lock of 0x0000001c by thread 0
This would seem to indicate that sVIPHeap
(cf. memalign_etc
) is NULL
.
comment:7 by , 14 years ago
sVIPHeap
gets initialized in heap_init_post_thread
, called from vm_init_post_thread
much later in _start
(src/system/kernel/main.cpp).
All that is platform-independent code, and the only recent change in order there was hrev36625, but the problem is in vm_init
already.
comment:8 by , 14 years ago
The attached patch avoids the use of VIP priority for the iospace null cache, working around the issue. I then get to:
INIT: init VM semaphores [...] free boot ranges: get rid of 0x86d80000 - 0xfe010000
Apparently x86 does not call the generic code from its arch_vm_translation_map_init_post_area
(arch_vm_translation_map.cpp), so is not affected.
comment:9 by , 14 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
I moved the VIP heap initialization to an earlier point in hrev36855, which should fix the issue. The comments regarding the to-be-implemented PPC specific debug code still apply, of course.
follow-up: 11 comment:10 by , 14 years ago
...The comments regarding the to-be-implemented PPC specific debug code still apply, of course...
Duly noted and time-dependent. If anyone gets a head-start on me, then please open another ticket on this sooner.
Mutexes don't have IDs, they are just structures in memory. And 0x1c suggests a NULL pointer. Apparently it's so early in the boot process that this address is still accessible, otherwise it would be an unhandled page fault. I'd recommend to get at least the stack traces going. On a panic a stack trace is printed automatically, but that requires a working arch_debug_call_with_fault_handler() implementation (http://dev.haiku-os.org/browser/haiku/trunk/src/system/kernel/arch/ppc/arch_debug.cpp#L311).