Opened 14 years ago
Last modified 8 years ago
#6129 assigned bug
Kernel hang after allocate_commpage_entry
Reported by: | andreasf | Owned by: | nobody |
---|---|---|---|
Priority: | normal | Milestone: | Unscheduled |
Component: | System/Kernel | Version: | R1/Development |
Keywords: | Cc: | ||
Blocked By: | #5193 | Blocking: | #1048, #6164 |
Platform: | PowerPC |
Description
At hrev37001, vanilla Haiku hangs after
allocate_commpage_entry(2, 24) -> 0xffff0100
With TRACE_VM
activated, I see it hangs after
vm_free_unused_boot_loader_range(): asked to free 0x00000000 - 0xffffefff [...] free boot range: get rid of 0x86d80000 - 0xfe010000
Attachments (1)
Change History (17)
comment:1 by , 14 years ago
comment:2 by , 14 years ago
Blocked By: | 5193 added |
---|
by , 14 years ago
Attachment: | 0002-Don-t-unmap-0x90000000-0xa0000000.patch added |
---|
workaround: don't unmap the frame buffer - depends on #6128
comment:3 by , 14 years ago
With the attached workaround on my PowerMac G3 I get to INIT: init Device Mapper
(main.cpp).
Unfortunately I then get
PANIC: remove page 0x8681c098 from cache 0x[...]: page still has mappings Welcome to Kernel Debugging Land...
which I guess is the direct consequence of my workaround.
It gets into an endless loop of PANIC: ASSERT FAILED
s afterwards, which might be related to #6032.
comment:4 by , 14 years ago
I've tried installing an ATI Rage 128 as secondary graphics card. It then sits behind the PCI bridge; the Radeon 9200 moves to a new 0xa00000000 translation, while the 0x90000000 translation remains.
comment:5 by , 14 years ago
In #5193 I've attached patches that no longer require the above workaround.
At hrev37025 with those patches I now get:
INIT: Init Device Mapper PANIC: remove page 0x8681c098 from cache 0x85825c30: page still has mappings! Welcome to Kernel Debugging Land... Running on CPU 0 Current thread pointer is 0x86051000, which is an address we can't read from. stack trace for thread 0x2 "undertaker" kernel stack: 0xffff8000 to 0xffffc000 frame caller <image>:function + offset ffffbb70 (+ 240) 80164394 <kernel_ppc>:_ZL11stack_traceiPPc + 0x0350 ffffbc60 (+ 16) 8016450c <kernel_ppc>:arch_debug_stack_trace + 0x0018 ffffbc70 (+ 16) 800b8208 <kernel_ppc>:_ZL22stack_trace_trampolinePv + 0x0010 ffffbc80 (+ 16) 8016a520 <kernel_ppc>:arch_debug_call_with_fault_handler + 0x002c (nearest) ffffbc90 (+ 48) 800b89dc <kernel_ppc>:debug_call_with_fault_handler + 0x0078 ffffbcc0 (+ 112) 800b9f3c <kernel_ppc>:_ZL20kernel_debugger_loopPKcS0_P13__va_list_tagl + 0x0264 ffffbd30 (+ 80) 800ba300 <kernel_ppc>:_ZL20kernel_debugger_internalPKcS0_P13__va_list_tagl + 0x01c8 ffffbd80 (+ 144) 800ba68c <kernel_ppc>:panic + 0x00ac ffffbe10 (+ 64) 8015c3d4 <kernel_ppc>:_ZN7VMCache6DeleteEv + 0x00d0 ffffbe50 (+ 48) 8015b144 <kernel_ppc>:_ZN7VMCache6UnlockEb + 0x0188 ffffbe80 (+ 48) 801434b8 <kernel_ppc>:_ZL11delete_areaP14VMAddressSpaceP6VMAreab + 0x027c ffffbeb0 (+ 160) 801434b8 <kernel_ppc>:vm_delete_area + 0x0224 ffffbf50 (+ 16) 80143940 <kernel_ppc>:delete_area + 0x0038 ffffbf60 (+ 128) 800a87dc <kernel_ppc>:_ZL10undertakerPv + 0x01c8 ffffbfe0 (+ 16) 800a484c <kernel_ppc>:_ZL28_create_kernel_thread_kentryv + 0x0020 ffffbff0 (+16400) 8016a4e4 <kernel_ppc>:ppc_kernel_thread_root + 0x0010 (nearest) kdebug> info kernel build: Jun 3 2010 12:49:23 SVN revision: 37025 cpu count: 1, active times: [1] 2002226 pages: 10333 (131072 max) sems: 42 (65536 max) ports: 0 (4096 max) threads: 12 (4096 max) teams: 1 (2048 max) kdebug> page 0x8681c098 queue_next,prev: 0x8686c06c, 0x00000000 physical_number: 0xa32 cache: 0x85824c30 cache_next: 0x00000000 state: wired wired_count: 1 usage_count: 0 busy: 0 busy_writing: 0 accessed: 0 modified: 0 accessor: -1 area mappings: kdebug> cache 0x85824c30 CACHE 0x85824c30: ref_count: 0 source: 0x00000000 type: RAM virtual_base: 0x0 virtual_end: 0x4000 temporary: 1 scan_skip: 1 lock: 0x85824c84 lock.holder: 2 areas: consumers: pages: 3 in cache kdebug>
Note that the kernel debugger is now in fact working interactively''
I see no direct relation to #5193, since the virtual_base
is below the kernel area and thus no area would be created for such a range...
comment:7 by , 14 years ago
Thanks a bunch, Ingo!
At hrev37031 (with #6139 patches) I get:
INIT: Add preloaded old-style drivers ppc_exception_entry: time 81057527 vector 0x300, iframe 0x808b6b28, srr0: 0x8013bb10 vm_soft_fault: va 0xdfffd000 not covered by area in address space vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0xdfffdb5d, ip 0x8013bb10, write 0, user 0, thread 0xc PANIC: vm_page_fault: unhandled page fault in kernel space at 0xdfffdb5d, ip 0x8013bb10 Welcome to Kernel Debugging Land... Running on CPU 0 Current thread pointer is 0x86053800, which is an address we can't read from. iframe 0x808b6b28 (end = 0x808b6cd0) stack trace for thread 0xc "main2" kernel stack: 0x808b3000 to 0x808b7000 frame caller <image>:function + offset 808b6900 (+ 16) 80164514 <kernel_ppc>:arch_debug_stack_trace + 0x0018 808b6910 (+ 16) 800b8208 <kernel_ppc>:_ZL22stack_trace_trampolinePv + 0x0010 808b6920 (+ 16) 8016a528 <kernel_ppc>:arch_debug_call_with_fault_handler + 0x002c (nearest) 808b6930 (+ 48) 800b89dc <kernel_ppc>:debug_call_with_fault_handler + 0x0078 808b6960 (+ 112) 800b9f3c <kernel_ppc>:_ZL20kernel_debugger_loopPKcS0_P13__va_list_tagl + 0x0264 808b69d0 (+ 80) 800ba300 <kernel_ppc>:_ZL20kernel_debugger_internalPKcS0_P13__va_list_tagl + 0x01c8 808b6a20 (+ 144) 800ba68c <kernel_ppc>:panic + 0x00ac 808b6ab0 (+ 64) 8014a0a4 <kernel_ppc>:vm_page_fault + 0x0128 808b6af0 (+ 48) 80166d04 <kernel_ppc>:ppc_exception_entry + 0x028c iframe at 0x808b6b28 [...] fpscr 0xffc00000 srr0 0x8013bb10 srr1 0x0000b030 dar 0xdfffdf5d dsisr 0x40000000 vector: 0x300 808b6b20 (+ 432) 8013bb10 <kernel_ppc>:hash_hash_string + 0x0010 808b6cd0 (+ 16) 8013e794 <kernel_ppc>:get_memory_map_etc + 0x0248 808b6ce0 (+ 16) 8008b5dc <kernel_ppc>:_ZL11module_hashPvPKvm + 0x0048 808b6cf0 (+ 32) 8013b9cc <kernel_ppc>:hash_lookup + 0x0038 808b6d10 (+ 144) 8008e494 <kernel_ppc>:get_module + 0x0094 808b6da0 (+ 64) 8042dbfc <usb>:_ZN5StackC1Ev + 0x0184 808b6de0 (+ 16) 8042b6ac <usb>:_ZL11bus_std_opslz + 0x0068 808b6df0 (+ 144) 8008e808 <kernel_ppc>:get_module + 0x0408 808b6e80 (+ 32) 8043977c <usb_disk>:init_driver + 0x004c 808b6ea0 (+ 48) 800de6b0 <kernel_ppc>:_ZL11load_driverP13legacy_driver + 0x0150 808b6ed0 (+ 144) 800df7d8 <kernel_ppc>:_ZL10add_driverPKcl + 0x01f8 808b6f60 (+ 64) 800df98c <kernel_ppc>:legacy_driver_add_preloaded + 0x0144 808b6fa0 (+ 64) 8008a8e4 <kernel_ppc>:_ZL5main2Pv + 0x00e8 808b6fe0 (+ 16) 800a484c <kernel_ppc>:_ZL28_create_kernel_thread_kentryv + 0x0020 808b6ff0 (+2138345488) 8016a4ec <kernel_ppc>:ppc_kernel_thread_root + 0x0010 (nearest) kdebug>
follow-up: 9 comment:8 by , 14 years ago
The stack trace doesn't look quite right. The get_memory_map_etc()
definitely doesn't belong there. module_hash()
directly calls hash_hash_string()
. Though, as the objdump shows, apparently through the PLT, which I find weird. I'm not too familiar with the ELF PPC specifics anymore, but (1) the kernel shouldn't have a PLT in the first place (since it should be relocatable, but not position independent) and (2) I cannot see a reason why a function in the same shared object should be called through the PLT at all. Might be some kind of optimization. At any rate the address really seems to be in get_memory_map_etc() (at least in my objdump the function is longer than 0x248 bytes). You might want to check that against the kernel you've built.
Other than that -- which might turn out to be harmless -- either the string pointer passed to get_module() or some module name seems to be broken. I guess debug output is the way to go. A call
command for PPC would be nice to have here, but due to the parameter passing via registers that is probably quite complicated to do.
follow-up: 10 comment:9 by , 14 years ago
Interestingly with TRACE_MODULE
turned on, I get this instead:
INIT: Add preloaded old-style drivers get_module(bus_managers/usb/v3) initializing module bus_managers/usb/v3 (at 0x8042c644)... ...done (No such device) usb_disk: getting module failed 0x80007007 devfs_add_preloaded_drivers: Failed to add "usb_disk" get_module(interrupt_controllers/openpic/device_v1) resolving module dependencies... get_module(system/device_manager/v1) get_module(bus_managers/pci/v1) initializing module interrupt_controllers/openpic/device_v1 (at 0x802ed654)... ...done (No error) get_module(system/device_manager/v1) PANIC: arch_int_init_post_device_manager(): Found no supported PIC! Welcome to Kernel Debugging Land... Running on CPU 0 [...] <kernel_ppc>:panic <kernel_ppc>:arch_int_init_post_device_manager + 0x0324 <kernel_ppc>:int_init_post_device_mapper + 0x0024 <kernel_ppc>:_ZL5main2Pv + 0x00f0
This stems from arch_int.cpp. The code looks stubbed out, and I vaguely remember there was a discussion about it when the device manager was somehow reorganized. Will check the list archives.
Without TRACE_MODULE
I reproducibly get the page fault but couldn't spot an obvious cause. The last successful invocation of hash_hash_string
is "bus_managers/usb/v3", the next string is damaged. Adding a NULL check for module->name doesn't help.
follow-up: 11 comment:10 by , 14 years ago
Replying to andreasf:
Without
TRACE_MODULE
I reproducibly get the page fault but couldn't spot an obvious cause. The last successful invocation ofhash_hash_string
is "bus_managers/usb/v3", the next string is damaged. Adding a NULL check for module->name doesn't help.
No, the fault is at 0xdfffdb5d, so I guess that's the string pointer. It just has to be tracked down where it comes from and why.
follow-ups: 12 14 comment:11 by , 14 years ago
It comes from USB bus manager's new(std::nothrow) Stack()
.
Before that call I can dprintf()
a test message just fine, while as first instruction of the Stack()
constructor I get garbage printed (same for TRACE_USB
).
Seeing nothing fundamentally wrong with the constructor itself, I could think of two possible reasons for this failure:
new(std::nothrow)
might be broken in this context (it looks like a simplemalloc
though if I read kernel_cpp.h right), or- something could be going wrong with the ELF sections (ctors or data). There is a special add-on linker script but it seems it is not used.
Concerning your PLT observations, the kernel linker script does include multiple mentions of plt
sections. And then there's the removal of ppc linkhack.so -shared flag that might have undesired effects?
comment:12 by , 14 years ago
I've been unable to pinpoint the cause of the USB failure. The operator is okay, and the ELF section loading in the boot loader is shared with x86 and does clear BSS beyond filesz
. Possibly stack corruption or some alignment issue?
It is no longer reliably reproducible.
comment:13 by , 14 years ago
Blocking: | 6164 added |
---|
comment:14 by , 14 years ago
The removal of linkhack.so
's -shared
flag has been ruled out as possible cause with #6210.
And the faulting address does not seem to change at runtime, only at compilation time.
comment:15 by , 10 years ago
Milestone: | R1 → Unscheduled |
---|
Moving non-x86 related tickets out of R1 milestone.
comment:16 by , 8 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
The problem appears to be in the combination of OpenFirmware and frame buffer getting unmapped. Either one is not sufficient to remedy the hang.