Opened 9 years ago

Last modified 2 years ago

#6129 assigned bug

Kernel hang after allocate_commpage_entry

Reported by: andreasf Owned by: nobody
Priority: normal Milestone: Unscheduled
Component: System/Kernel Version: R1/Development
Keywords: Cc:
Blocked By: #5193 Blocking: #1048, #6164
Has a Patch: no Platform: PowerPC

Description

At hrev37001, vanilla Haiku hangs after

allocate_commpage_entry(2, 24) -> 0xffff0100

With TRACE_VM activated, I see it hangs after

vm_free_unused_boot_loader_range(): asked to free 0x00000000 - 0xffffefff
[...]
free boot range: get rid of 0x86d80000 - 0xfe010000

Attachments (1)

0002-Don-t-unmap-0x90000000-0xa0000000.patch (958 bytes) - added by andreasf 9 years ago.
workaround: don't unmap the frame buffer - depends on #6128

Download all attachments as: .zip

Change History (17)

comment:1 Changed 9 years ago by andreasf

The problem appears to be in the combination of OpenFirmware and frame buffer getting unmapped. Either one is not sufficient to remedy the hang.

comment:2 Changed 9 years ago by andreasf

Blocked By: 5193 added

(In #5193) hrev36886 was too strict apparently: OpenFirmware itself is unmapped that way.

Changed 9 years ago by andreasf

workaround: don't unmap the frame buffer - depends on #6128

comment:3 Changed 9 years ago by andreasf

With the attached workaround on my PowerMac G3 I get to INIT: init Device Mapper (main.cpp).

Unfortunately I then get

PANIC: remove page 0x8681c098 from cache 0x[...]: page still has mappings

Welcome to Kernel Debugging Land...

which I guess is the direct consequence of my workaround.

It gets into an endless loop of PANIC: ASSERT FAILEDs afterwards, which might be related to #6032.

comment:4 Changed 9 years ago by andreasf

I've tried installing an ATI Rage 128 as secondary graphics card. It then sits behind the PCI bridge; the Radeon 9200 moves to a new 0xa00000000 translation, while the 0x90000000 translation remains.

comment:5 Changed 9 years ago by andreasf

In #5193 I've attached patches that no longer require the above workaround.

At hrev37025 with those patches I now get:

INIT: Init Device Mapper
PANIC: remove page 0x8681c098 from cache 0x85825c30: page still has mappings!

Welcome to Kernel Debugging Land...
Running on CPU 0
Current thread pointer is 0x86051000, which is an address we can't read from.
stack trace for thread 0x2 "undertaker"
    kernel stack: 0xffff8000 to 0xffffc000
frame            caller     <image>:function + offset
ffffbb70 (+ 240) 80164394   <kernel_ppc>:_ZL11stack_traceiPPc + 0x0350
ffffbc60 (+  16) 8016450c   <kernel_ppc>:arch_debug_stack_trace + 0x0018
ffffbc70 (+  16) 800b8208   <kernel_ppc>:_ZL22stack_trace_trampolinePv + 0x0010
ffffbc80 (+  16) 8016a520   <kernel_ppc>:arch_debug_call_with_fault_handler + 0x002c (nearest)
ffffbc90 (+  48) 800b89dc   <kernel_ppc>:debug_call_with_fault_handler + 0x0078
ffffbcc0 (+ 112) 800b9f3c   <kernel_ppc>:_ZL20kernel_debugger_loopPKcS0_P13__va_list_tagl + 0x0264
ffffbd30 (+  80) 800ba300   <kernel_ppc>:_ZL20kernel_debugger_internalPKcS0_P13__va_list_tagl + 0x01c8
ffffbd80 (+ 144) 800ba68c   <kernel_ppc>:panic + 0x00ac
ffffbe10 (+  64) 8015c3d4   <kernel_ppc>:_ZN7VMCache6DeleteEv + 0x00d0
ffffbe50 (+  48) 8015b144   <kernel_ppc>:_ZN7VMCache6UnlockEb + 0x0188
ffffbe80 (+  48) 801434b8   <kernel_ppc>:_ZL11delete_areaP14VMAddressSpaceP6VMAreab + 0x027c
ffffbeb0 (+ 160) 801434b8   <kernel_ppc>:vm_delete_area + 0x0224
ffffbf50 (+  16) 80143940   <kernel_ppc>:delete_area + 0x0038
ffffbf60 (+ 128) 800a87dc   <kernel_ppc>:_ZL10undertakerPv + 0x01c8
ffffbfe0 (+  16) 800a484c   <kernel_ppc>:_ZL28_create_kernel_thread_kentryv + 0x0020
ffffbff0 (+16400) 8016a4e4   <kernel_ppc>:ppc_kernel_thread_root + 0x0010 (nearest)
kdebug> info
kernel build: Jun  3 2010 12:49:23
SVN revision: 37025

cpu count: 1, active times:
  [1] 2002226
pages:          10333 (131072 max)
sems:           42 (65536 max)
ports:          0 (4096 max)
threads:        12 (4096 max)
teams:          1 (2048 max)
kdebug> page 0x8681c098
queue_next,prev: 0x8686c06c, 0x00000000
physical_number: 0xa32
cache:           0x85824c30
cache_next:      0x00000000
state:           wired
wired_count:     1
usage_count:     0
busy:            0
busy_writing:    0
accessed:        0
modified:        0
accessor:        -1
area mappings:
kdebug> cache 0x85824c30
CACHE 0x85824c30:
  ref_count:    0
  source:       0x00000000
  type:         RAM
  virtual_base: 0x0
  virtual_end:  0x4000
  temporary:    1
  scan_skip:    1
  lock:         0x85824c84
  lock.holder:  2
  areas:
  consumers:
  pages:
        3 in cache
kdebug>

Note that the kernel debugger is now in fact working interactively''

I see no direct relation to #5193, since the virtual_base is below the kernel area and thus no area would be created for such a range...

comment:6 Changed 9 years ago by bonefish

hrev37031 should fix the cause of the panic.

comment:7 in reply to:  6 Changed 9 years ago by andreasf

Thanks a bunch, Ingo!

At hrev37031 (with #6139 patches) I get:

INIT: Add preloaded old-style drivers
ppc_exception_entry: time 81057527 vector 0x300, iframe 0x808b6b28, srr0: 0x8013bb10
vm_soft_fault: va 0xdfffd000 not covered by area in address space
vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0xdfffdb5d, ip 0x8013bb10, write 0, user 0, thread 0xc
PANIC: vm_page_fault: unhandled page fault in kernel space at 0xdfffdb5d, ip 0x8013bb10

Welcome to Kernel Debugging Land...
Running on CPU 0
Current thread pointer is 0x86053800, which is an address we can't read from.
iframe 0x808b6b28 (end = 0x808b6cd0)
stack trace for thread 0xc "main2"
    kernel stack: 0x808b3000 to 0x808b7000
frame            caller     <image>:function + offset
808b6900 (+  16) 80164514   <kernel_ppc>:arch_debug_stack_trace + 0x0018
808b6910 (+  16) 800b8208   <kernel_ppc>:_ZL22stack_trace_trampolinePv + 0x0010
808b6920 (+  16) 8016a528   <kernel_ppc>:arch_debug_call_with_fault_handler + 0x002c (nearest)
808b6930 (+  48) 800b89dc   <kernel_ppc>:debug_call_with_fault_handler + 0x0078
808b6960 (+ 112) 800b9f3c   <kernel_ppc>:_ZL20kernel_debugger_loopPKcS0_P13__va_list_tagl + 0x0264
808b69d0 (+  80) 800ba300   <kernel_ppc>:_ZL20kernel_debugger_internalPKcS0_P13__va_list_tagl + 0x01c8
808b6a20 (+ 144) 800ba68c   <kernel_ppc>:panic + 0x00ac
808b6ab0 (+  64) 8014a0a4   <kernel_ppc>:vm_page_fault + 0x0128
808b6af0 (+  48) 80166d04   <kernel_ppc>:ppc_exception_entry + 0x028c
iframe at 0x808b6b28
[...]
fpscr 0xffc00000
 srr0 0x8013bb10  srr1 0x0000b030   dar 0xdfffdf5d dsisr 0x40000000
 vector: 0x300
808b6b20 (+ 432) 8013bb10   <kernel_ppc>:hash_hash_string + 0x0010
808b6cd0 (+  16) 8013e794   <kernel_ppc>:get_memory_map_etc + 0x0248
808b6ce0 (+  16) 8008b5dc   <kernel_ppc>:_ZL11module_hashPvPKvm + 0x0048
808b6cf0 (+  32) 8013b9cc   <kernel_ppc>:hash_lookup + 0x0038
808b6d10 (+ 144) 8008e494   <kernel_ppc>:get_module + 0x0094
808b6da0 (+  64) 8042dbfc   <usb>:_ZN5StackC1Ev + 0x0184
808b6de0 (+  16) 8042b6ac   <usb>:_ZL11bus_std_opslz + 0x0068
808b6df0 (+ 144) 8008e808   <kernel_ppc>:get_module + 0x0408
808b6e80 (+  32) 8043977c   <usb_disk>:init_driver + 0x004c
808b6ea0 (+  48) 800de6b0   <kernel_ppc>:_ZL11load_driverP13legacy_driver + 0x0150
808b6ed0 (+ 144) 800df7d8   <kernel_ppc>:_ZL10add_driverPKcl + 0x01f8
808b6f60 (+  64) 800df98c   <kernel_ppc>:legacy_driver_add_preloaded + 0x0144
808b6fa0 (+  64) 8008a8e4   <kernel_ppc>:_ZL5main2Pv + 0x00e8
808b6fe0 (+  16) 800a484c   <kernel_ppc>:_ZL28_create_kernel_thread_kentryv + 0x0020
808b6ff0 (+2138345488) 8016a4ec   <kernel_ppc>:ppc_kernel_thread_root + 0x0010 (nearest)
kdebug>

comment:8 Changed 9 years ago by bonefish

The stack trace doesn't look quite right. The get_memory_map_etc() definitely doesn't belong there. module_hash() directly calls hash_hash_string(). Though, as the objdump shows, apparently through the PLT, which I find weird. I'm not too familiar with the ELF PPC specifics anymore, but (1) the kernel shouldn't have a PLT in the first place (since it should be relocatable, but not position independent) and (2) I cannot see a reason why a function in the same shared object should be called through the PLT at all. Might be some kind of optimization. At any rate the address really seems to be in get_memory_map_etc() (at least in my objdump the function is longer than 0x248 bytes). You might want to check that against the kernel you've built.

Other than that -- which might turn out to be harmless -- either the string pointer passed to get_module() or some module name seems to be broken. I guess debug output is the way to go. A call command for PPC would be nice to have here, but due to the parameter passing via registers that is probably quite complicated to do.

comment:9 in reply to:  8 ; Changed 9 years ago by andreasf

Interestingly with TRACE_MODULE turned on, I get this instead:

INIT: Add preloaded old-style drivers
get_module(bus_managers/usb/v3)
initializing module bus_managers/usb/v3 (at 0x8042c644)...
...done (No such device)
usb_disk: getting module failed 0x80007007
devfs_add_preloaded_drivers: Failed to add "usb_disk"
get_module(interrupt_controllers/openpic/device_v1)
resolving module dependencies...
get_module(system/device_manager/v1)
get_module(bus_managers/pci/v1)
initializing module interrupt_controllers/openpic/device_v1 (at 0x802ed654)...
...done (No error)
get_module(system/device_manager/v1)
PANIC: arch_int_init_post_device_manager(): Found no supported PIC!
Welcome to Kernel Debugging Land...
Running on CPU 0
[...]
<kernel_ppc>:panic
<kernel_ppc>:arch_int_init_post_device_manager + 0x0324
<kernel_ppc>:int_init_post_device_mapper + 0x0024
<kernel_ppc>:_ZL5main2Pv + 0x00f0

This stems from arch_int.cpp. The code looks stubbed out, and I vaguely remember there was a discussion about it when the device manager was somehow reorganized. Will check the list archives.

Without TRACE_MODULE I reproducibly get the page fault but couldn't spot an obvious cause. The last successful invocation of hash_hash_string is "bus_managers/usb/v3", the next string is damaged. Adding a NULL check for module->name doesn't help.

comment:10 in reply to:  9 ; Changed 9 years ago by bonefish

Replying to andreasf:

Without TRACE_MODULE I reproducibly get the page fault but couldn't spot an obvious cause. The last successful invocation of hash_hash_string is "bus_managers/usb/v3", the next string is damaged. Adding a NULL check for module->name doesn't help.

No, the fault is at 0xdfffdb5d, so I guess that's the string pointer. It just has to be tracked down where it comes from and why.

comment:11 in reply to:  10 ; Changed 9 years ago by andreasf

It comes from USB bus manager's new(std::nothrow) Stack().

Before that call I can dprintf() a test message just fine, while as first instruction of the Stack() constructor I get garbage printed (same for TRACE_USB).

Seeing nothing fundamentally wrong with the constructor itself, I could think of two possible reasons for this failure:

  • new(std::nothrow) might be broken in this context (it looks like a simple malloc though if I read kernel_cpp.h right), or
  • something could be going wrong with the ELF sections (ctors or data). There is a special add-on linker script but it seems it is not used.

Concerning your PLT observations, the kernel linker script does include multiple mentions of plt sections. And then there's the removal of ppc linkhack.so -shared flag that might have undesired effects?

comment:12 in reply to:  11 Changed 9 years ago by andreasf

I've been unable to pinpoint the cause of the USB failure. The operator is okay, and the ELF section loading in the boot loader is shared with x86 and does clear BSS beyond filesz. Possibly stack corruption or some alignment issue?

It is no longer reliably reproducible.

comment:13 Changed 9 years ago by andreasf

Blocking: 6164 added

comment:14 in reply to:  11 Changed 9 years ago by andreasf

The removal of linkhack.so's -shared flag has been ruled out as possible cause with #6210.

And the faulting address does not seem to change at runtime, only at compilation time.

comment:15 Changed 4 years ago by luroh

Milestone: R1Unscheduled

Moving non-x86 related tickets out of R1 milestone.

comment:16 Changed 2 years ago by axeld

Owner: changed from axeld to nobody
Status: newassigned
Note: See TracTickets for help on using tickets.