Opened 11 years ago

Last modified 2 weeks ago

#10312 new bug

Kernel runs out of address space in checkfs

Reported by: kallisti5 Owned by: axeld
Priority: high Milestone: R1
Component: System/Kernel Version: R1/Development
Keywords: Cc:
Blocked By: Blocking: #12118
Platform: All

Description

If a large number of errors exist on a bfs filesystem, the kernel can run out of address space for checkfs (screenshot attached of kdl)

tested in hrev46530 gcc2h

Attachments (1)

IMG_20131213_231308.jpg (3.2 MB ) - added by kallisti5 11 years ago.

Change History (11)

by kallisti5, 11 years ago

Attachment: IMG_20131213_231308.jpg added

comment:1 by kallisti5, 11 years ago

still exists as of hrev46796

comment:2 by pulkomandy, 10 years ago

Priority: normalhigh

I'm hitting something similar here. I didn't wait until KDL, however.

  • Run checkfs on a small partition (2GB): kernel memory use increases, not freed after checkfs is done
  • Run checkfs on a bigger partition (100GB): kernel memory reaches about 1.7GB, then system becomes very unresponsive (app server struggles to redraw the screen or even keep the mouse cursor moving)

comment:3 by waddlesplash, 10 years ago

Milestone: R1R1/alpha5

Moving to A5 as this sounds like a serious resource leak.

comment:4 by axeld, 10 years ago

It's not a resource leak AFAICT, just a problem of how the block cache maintains its memory.

comment:5 by pulkomandy, 10 years ago

Milestone: R1/alpha5R1/beta1

comment:6 by pulkomandy, 10 years ago

Panic is easily reproductible here with an identical stack trace. I had a look at the block_cache KDL command. There are 3 block caches (1 per mounted volume I guess?).

BLOCK CACHE: 0x8280aa80
 fd:           0
 max_blocks:   1572864
 block_size:   2048
 next_transaction_id: 41
 buffer_cache: 0x82c6ba18
 busy_reading: 0, no waiters
 busy_writing: 0, no waiters
 8018 blocks total, 1 dirty, 0 discarded, 0 referenced, 0 busy, 8017 in unused.


BLOCK CACHE: 0xd300d248
 fd:           34
 max_blocks:   1572864
 block_size:   2048
 next_transaction_id: 226
 buffer_cache: 0xd40cb688
 busy_reading: 0, no waiters
 busy_writing: 0, no waiters
 807 blocks total, 0 dirty, 0 discarded, 0 referenced, 0 busy, 807 in unused.


BLOCK CACHE: 0xd4076190
 fd:           33
 max_blocks:   24588800
 block_size:   2048
 next_transaction_id: 64
 buffer_cache: 0xd40cb548
 busy_reading: 1, no waiters
 busy_writing: 0, no waiters
 311080 blocks total, 20 dirty, 0 discarded, 0 referenced, 1 busy, 311060 in unused.

I notice the first one has fd 0, but I assume this is not a problem (probably the root FS is mounted first and gets the first FD number?)

The third one is relevant. While it is not too busy at this point (only 20 dirty blocks), there is a big number of unused blocks and a very high max_blocks. Still, 311060 * 2048 is only 622MB, not enough to run out of memory. The max_blocks appears to allow allocating 24 million blocks, which at 2K by block would be enough to run out of 32-bit address space (4.8GB). Shouldn't we limit this value somehow?

avail lists about 6.25GB RAM free on this 8GB system.

I'm not sure what else I should be looking for?

comment:7 by pulkomandy, 9 years ago

Milestone: R1/beta1R1

comment:8 by axeld, 8 years ago

Component: File Systems/BFSSystem/Kernel

max_blocks is not the maximum number of blocks to cache, but refers to the highest accessible block number for that device (so 24588800 * 2K = 46,9 GB which should correspond to your partition size).

The root FS does not have any block cache, this should be your boot disk.

I only find the huge number of unused blocks a bit worrying -- that accounts for 640 MB alone. Those should obviously be freed with a bit more enthusiasm.

comment:9 by axeld, 8 years ago

Blocking: 12118 added

(In #12118) Indeed.

comment:10 by waddlesplash, 2 weeks ago

Is this fixed with the adjustments to block cache's memory pressure handling, and then the memory leak fixes?

Note: See TracTickets for help on using tickets.