Opened 11 years ago

Closed 11 years ago

#3110 closed bug (fixed)

[kernel] PANIC: page still has mappings

Reported by: diver Owned by: bonefish
Priority: high Milestone: R1/alpha1
Component: System/Kernel Version: R1/pre-alpha1
Keywords: Cc:
Blocked By: Blocking:
Has a Patch: no Platform: All

Description

Happend while I opened Haiku volume in TextSearch and searched for "beos".

Attachments (6)

kdl.png (60.2 KB ) - added by diver 11 years ago.
kdl_new.png (68.8 KB ) - added by diver 11 years ago.
wacom.png (41.6 KB ) - added by diver 11 years ago.
another crash with above steps
page still has mappings.jpg (68.6 KB ) - added by Adek336 11 years ago.
page still has mappings.2.jpg (42.9 KB ) - added by Adek336 11 years ago.
testproject.zip (73.3 KB ) - added by darkwyrm 11 years ago.
Test project for Paladin to reproduce bug

Download all attachments as: .zip

Change History (29)

by diver, 11 years ago

Attachment: kdl.png added

comment:1 by diver, 11 years ago

I reproduced it again after opening and closing random results of that search in Pe about 10-20 times.

by diver, 11 years ago

Attachment: kdl_new.png added

comment:2 by diver, 11 years ago

While trying to reproduce I get another kdl with strangly auich driver. Maybe another bug?

comment:3 by diver, 11 years ago

This is probably another bug, I'll open new ticket.

by diver, 11 years ago

Attachment: wacom.png added

another crash with above steps

comment:4 by anevilyak, 11 years ago

I just ran into this one as well while closing Firefox after it segfaulted...interestingly it was unable to backtrace into userspace, stopping with a read fault after the iframe. trace looked identical to kdl.png otherwise though.

comment:5 by diver, 11 years ago

Since then I've updated VirtualBox from 2.0.2 to 2.1.2 and now when I try to reproduce this issue VirtualBox with haiku just close itself.

comment:6 by Adek336, 11 years ago

I've encountered these a few times with hybrid hrev29195 on vmware when doing

nc -l -p 2222 >filename

Sometimes in the thread of nc: with Cache::Delete, vm_delete_area (one area), kernel_exit_handle_signals in the backtrace and an kdl error message about a circular stack.

Sometimes in the thread of sh: with Cache::Delete, vm_delete_areas (many areas), _user_exec and a proper backtrace into userland (but without symbol names).

comment:7 by Adek336, 11 years ago

The panic message was "remove page (stuff) from cache (stuff): page still has mappings!"

by Adek336, 11 years ago

by Adek336, 11 years ago

comment:8 by Adek336, 11 years ago

page still has mappings.jpg: in nc thread

page still has mappings.2.jpg: in sh thread

comment:9 by anevilyak, 11 years ago

Just ran into this one while compiling here....specifically:

PANIC: remove page 0x8236b0c8 from cache 0x812f1a2c: page still has mappings!

Welcome to Kernel Debugging Land...
Thread 10652 "cc1plus" running on CPU 0
kdebug> bt
stack trace for thread 10652 "cc1plus"
    kernel stack: 0xa13f6000 to 0xa13fa000
      user stack: 0x7efee000 to 0x7ffee000
frame               caller     <image>:function + offset
 0 a13f9914 (+  48) 8005f1b9   <kernel_x86>:invoke_debugger_command + 0x00f5
 1 a13f9944 (+  64) 8005efa9   <kernel_x86> invoke_pipe_segment(debugger_command_pipe*: 0x80127060, int32: 0, 0x0 "<NULL>") + 0x0079
 2 a13f9984 (+  64) 8005f330   <kernel_x86>:invoke_debugger_command_pipe + 0x009c
 3 a13f99c4 (+  48) 800608e0   <kernel_x86> ExpressionParser<0xa13f9a78>::_ParseCommandPipe(0xa13f9a74) + 0x0234
 4 a13f99f4 (+  64) 8005fd1a   <kernel_x86> ExpressionParser<0xa13f9a78>::EvaluateCommand(0x80117aa0 "bt", 0xa13f9a74) + 0x02ba
 5 a13f9a34 (+ 224) 80061d08   <kernel_x86>:evaluate_debug_command + 0x0088
 6 a13f9b14 (+  64) 8005d0aa   <kernel_x86> kernel_debugger_loop() + 0x01ae
 7 a13f9b54 (+  32) 8005df39   <kernel_x86>:kernel_debugger + 0x004d
 8 a13f9b74 (+ 192) 8005dee1   <kernel_x86>:panic + 0x0029
 9 a13f9c34 (+  96) 800c283f   <kernel_x86> VMCache<0x812f1a2c>::Delete(0x812f1a2c, 0x812f6780, 0x282) + 0x0077
10 a13f9c94 (+  64) 800c2d00   <kernel_x86> VMCache<0x812f1a2c>::Unlock(0x812f0000, 0x812ff0cc, 0x0) + 0x0124
11 a13f9cd4 (+  48) 800c2e07   <kernel_x86> VMCache<0x812f1a2c>::ReleaseRef(0x812f6788, 0xa13f9d54, 0x246, 0x812ff0a0) + 0x002b
12 a13f9d04 (+  48) 800bbb62   <kernel_x86> delete_area(vm_address_space*: 0x812f6780, vm_area*: 0x812ff0a0) + 0x00aa
13 a13f9d34 (+  48) 800be22d   <kernel_x86>:vm_delete_areas + 0x0079
14 a13f9d64 (+  48) 800c23a7   <kernel_x86>:vm_delete_address_space + 0x0033
15 a13f9d94 (+ 112) 80051e56   <kernel_x86>:team_delete_team + 0x0276
16 a13f9e04 (+ 192) 800558f7   <kernel_x86>:thread_exit + 0x03af
17 a13f9ec4 (+  64) 8004a94d   <kernel_x86>:handle_signals + 0x03d5
18 a13f9f04 (+  64) 80055d74   <kernel_x86>:thread_at_kernel_exit + 0x0090
19 a13f9f44 (+ 100) 800d281b   <kernel_x86>:kernel_exit_handle_signals + 0x0006
user iframe at 0xa13f9fa8 (end = 0xa13fa000)
 eax 0x0            ebx 0x532ad4        ecx 0x7ffedde0   edx 0x246
 esi 0x0            edi 0x415007        ebp 0x7ffede0c   esp 0xa13f9fdc
 eip 0xffff0104  eflags 0x207      user esp 0x7ffedde0
 vector: 0x63, error code: 0x0
20 a13f9fa8 (+   0) ffff0104   <commpage>:commpage_syscall + 0x0004
7ffede0c -- read fault
kdebug> vm
no completions
kdebug> cache
  cache       cache_info  cache_tree  caches
kdebug> cache 0x812f1a2c
CACHE 0x812f1a2c:
  ref_count:    0
  source:       0x00000000
  type:         RAM
  virtual_base: 0x0
  virtual_end:  0x1720000
  temporary:    1
  scan_skip:    0
  lock:         0x812f1a7c
  lock.holder:  10652
  areas:
  consumers:
  pages:
        5880 in cache
kdebug> cache_tree 0x812f1a2c
0x812f1a2c <--
kdebug> cache_info 0x812f1a2c
name: ÌÌÌÌ8/8/
lock: 0x812f1a54
object_size: 0
cache_color_cycle: 5880
used_count: 3435973632
empty_count: 2148653760
pressure: 3422552064
slab_size: 24248320
usage: 0
maximum: 0
flags: 0x0
cookie: 0xdeadbeef
kdebug> thread 10652
THREAD: 0x84199000
id:                 10652 (0x299c)
name:               "cc1plus"
all_next:           0x841c6800
team_next:          0x8418f800
q_next:             0x841ba800
priority:           20 (next 20, I/O: -1)
state:              running
next_state:         ready
cpu:                0x801374c0 (0)
sig_pending:        0x100000 (blocked: 0x0)
in_kernel:          1
fault_handler:      0x8005f1d4
args:               0x812f6ca8 0x00000000
entry:              0x80050430
team:               0x8119e198, "kernel_team"
  exit.sem:         623256
  exit.status:      0x0 (No error)
  exit.reason:      0x1
  exit.signal:      0x0
  exit.waiters:
kernel_stack_area:  307181
kernel_stack_base:  0xa13f6000
user_stack_area:    -1
user_stack_base:    0x7efee000
user_local_storage: 0x7ffee000
kernel_errno:       0x0 (No error)
kernel_time:        125913
user_time:          5972668
flags:              0x1
architecture dependant section:
        esp: 0xa13f9df8
        ss: 0x00000010
        fpu_state at 0x841993e0
kdebug> mutex 0x812f1a7c
mutex 0x812f1a7c:
  name:            vm_cache
  flags:           0x0
  holder:          10652
  waiting threads:
kdebug>

Leaving this box in KDL for now in case there's any other useful information I can pull out of this.

comment:10 by anevilyak, 11 years ago

Further info from debug session with Ingo for reference:

kdebug> page 0x8236b0c8
PAGE: 0x8236b0c8
queue_next,prev: 0x8285e468, 0x823856d0
physical_number: c205
cache:           0x812f1a2c
cache_offset:    3921
cache_next:      0x00000000
type:            0
state:           active
wired_count:     0
usage_count:     2
busy_writing:    0
area mappings:
  0x812afe60 (0x1697)
kdebug> area 0x1697
AREA: 0x812afe60
name:           'libtracker.so_seg0ro'
owner:          0xb2
id:             0x1697
base:           0x595000
size:           0x195000
protection:     0x15
wiring:         0x0
memory_type:    0x0
cache:          0x81272d14
cache_type:     vnode
cache_offset:   0x0
cache_next:     0x00000000
cache_prev:     0x00000000
page mappings:  285
kdebug> cache_tree 0x81272d14
0x8127f348
  0x81281364
  0x8128b554
  0x812a65d0
  0x812db554
  0x81272d14 <--
  0x81272b24
kdebug> area address 0x812ff0a0
AREA: 0x812ff0a0
name:           'heap'
owner:          0x299c
id:             0x4b015
base:           0x18000000
size:           0x1720000
protection:     0x33
wiring:         0x0
memory_type:    0x0
cache:          0x812f1a2c
cache_type:     RAM
cache_offset:   0x0
cache_next:     0x00000000
cache_prev:     0x00000000
page mappings:  0

comment:11 by bonefish, 11 years ago

Might be fixed in hrev29992. Since I couldn't reproduce it, I'm leaving the ticket open for the time being. If anyone still encounters the bug, please add a comment.

comment:12 by bonefish, 11 years ago

Milestone: R1R1/alpha1
Priority: normalhigh

Just ran into it with hrev30345. Bumping priority and milestone.

comment:13 by bonefish, 11 years ago

Eliminated a possible cause in hrev30605. When a mapped file was shrunk, pages could be freed without removing their mappings. That would fit the reported case where the mapping was in a "libtracker.so_seg0ro" area, *if* libtracker.so was re-compiled and a program using it was still running. The other reported cases (TextSearch, nc) don't seem to fit, though. So I'm leaving the ticket open. I've introduced a new panic()ing check in hrev30606 which will hopefully trigger earlier.

comment:14 by mmlr, 11 years ago

I can reproduce this with kqemu. I've re-ported the current qemu version (0.10.4, using native APIs) and the corresponding kqemu. QEMU and kqemu work fine as long as they are running, but as soon as I close QEMU this panic happens. This package is hacked up a bit (QEMU requiring a more current GCC I compiled everything of QEMU with GCC4 but then used GCC2 for the native stuff, since QEMU is pure C however this does work). It's of course also possible that my kqemu driver is bogus, but I don't exactly understand how a driver would provoke this panic. The driver does use create/delete_area, lock/unlock_memory and get_memory_map. Could a lock_memory without a matching unlock_memory trigger this? You can get the package at http://haiku.mlotz.ch/qemu-0.10.4-haiku.zip for testing. You can view the driver source at http://haiku.mlotz.ch/kqemu-haiku.c . It is based on the previous BeOS version and doesn't make use of any Haiku specifics except for user_memcpy(). Just tell me if you want me to test anything in that regard.

comment:15 by mmlr, 11 years ago

Never mind, it's of course a bug in my kqemu. The initial lock_memory() runs in the context of the QEMU team, but later when the driver is closed the unlock_memory() is triggered from the kernel. Therefore the team_id has to be stored and unlock_memory_etc() has to be used.

The strange thing is that this exact same (broken) method worked before with the old kqemu of the old QEMU. My theory is that previously the driver was properly closed by QEMU when shutting down the emulation (so again inside the QEMU team context). This would also explain why this panic would occure when the emulation crashed (I think I remember to have seen it in that context). Right now it seems that the driver is not properly closed but instead the kernel cleans it up when the team goes down. This is probably related to some changes I need to adapt my port to again. Sorry for the noise.

The same kind of error could be responsible for this bug though. Seeing that it seems to always be triggered on program termination where the kernel would potentially clean up open handles. In case this is really the case debug output in unlock_memory() indicating the failed unlock should reveal that.

in reply to:  15 comment:16 by bonefish, 11 years ago

Replying to mmlr:

The same kind of error could be responsible for this bug though. Seeing that it seems to always be triggered on program termination where the kernel would potentially clean up open handles. In case this is really the case debug output in unlock_memory() indicating the failed unlock should reveal that.

lock_memory() increments the page's wired_count. In the instances of this bug where we've had a closer look at the page, it still had a mapping in a completely unrelated area, though, so they can't be lock_memory() related.

That aside, there are also still TODO's in lock_memory_etc() which should be tackled eventually. E.g. currently it is possible to delete or shrink an area with locked memory, which will create wired free pages. Unless caused on purpose, it's rather unlikely that this happens in practice, though.

comment:17 by darkwyrm, 11 years ago

I can confirm that this still happens as of hrev30852 in the gcc2 build and can consistently reproduce the trip to KDL by compiling a project with a build of Paladin compiled from within Haiku.

in reply to:  17 comment:18 by darkwyrm, 11 years ago

Replying to darkwyrm:

I can confirm that this still happens as of hrev30852 in the gcc2 build and can consistently reproduce the trip to KDL by compiling a project with a build of Paladin compiled from within Haiku.

I forgot -- I'm running this on real hardware and not in a VM.

in reply to:  17 comment:19 by bonefish, 11 years ago

Replying to darkwyrm:

I can confirm that this still happens as of hrev30852 in the gcc2 build and can consistently reproduce the trip to KDL by compiling a project with a build of Paladin compiled from within Haiku.

Great! Can you give detailed instructions how to reproduce it, including all the necessary files, please.

comment:20 by darkwyrm, 11 years ago

Absolutely. Tested with hrev30907:

1) Download and install Paladin RC6 -- http://www.bebits.com/bob/22146/Paladin1.0rc6.pkg.zip 2) Create /boot/home/projects folder 3) Open a Terminal and go to ~/projects. 4) $ unzip testproject.zip; cd BePhotoMagic 5) $ /boot/apps/Paladin/Paladin BePhotoMagic.pld 6) Hit Alt-M to run a build.

Sometimes it will KDL on the first build attempt after it has progressed most of the way through the build. Sometimes Paladin craps out on an error. In such cases, restarting the build causes it to KDL almost immediately after the restart.

The test rig is a P4 Xeon 3Ghz with Hyperthreading and 1GB RAM in the event SMP is a factor.

by darkwyrm, 11 years ago

Attachment: testproject.zip added

Test project for Paladin to reproduce bug

comment:21 by bonefish, 11 years ago

Owner: changed from axeld to bonefish
Status: newassigned

Sweet, could just reproduce it in VMware. Looking into it.

comment:22 by bonefish, 11 years ago

Can't reproduce in hrev30911 anymore. Might be related to the more defensive mapping in vm_soft_fault(). Trying to understand the original problem...

comment:23 by bonefish, 11 years ago

Resolution: fixed
Status: assignedclosed

Analyzed the problem with Paladin: Two threads of the same team simultaneously fault on the same virtual page, one write the other read. The write fault is handled first, the read fault maps the same page again, adding a second page mapping. Then do a fork(), which moves the cache one level lower and changes the page mappings to read-only. Another write fault in the original team unmaps the page -- removing only one mapping! -- and adds a fresh page in the top-most cache. When the child team exec()s, the top-most cache is merge with its source (the one with the "bad" page). Since the page is shadowed it is ignored while merging and removed when the cache is deleted, triggering the panic, since it still has the invalid mapping.

Long story short: Simultaneous page faults are handled gracefully in hrev30911, so this problem is indeed fixed. It also seems like a good enough explanation for the other reported situations, so I'm closing this bug. Since it's already the third fixed bug which triggered the panic(), please create a new ticket, if encountered again.

Note: See TracTickets for help on using tickets.