Opened 8 years ago

Closed 7 years ago

Last modified 7 years ago

#12962 closed bug (fixed)

fflush Crash in BFS

Reported by: AGMS Owned by: axeld
Priority: normal Milestone: Unscheduled
Component: File Systems/BFS Version: R1/Development
Keywords: fflush Cc: agmsmith@…
Blocked By: Blocking:
Platform: All

Description

We're having troubles with a KDL crash in fflush().

Unfortunately it's hard to recreate without waiting days. Also unfortunately, it happens often on that time scale, getting the same crash again and again. It's in Inode::TransactionDone() / Transaction::NotifyListeners() when an fflush() seems to trigger a write (at least it's calling the layers of file system write functions).

The application code logs to text files, using a redirected stdout and stderr (2 separate files), with an fflush (stdout) called once a second, even if no new data is output. Seen somewhere around hrev50163.

Attachments (1)

FFlushKDL.jpg (1.3 MB ) - added by AGMS 8 years ago.
Stack backtrace of fflush crash.

Download all attachments as: .zip

Change History (8)

by AGMS, 8 years ago

Attachment: FFlushKDL.jpg added

Stack backtrace of fflush crash.

comment:1 by waddlesplash, 8 years ago

If anyone else has seen this ever, please do comment here so we can keep track of how much of an issue this is.

comment:2 by axeld, 8 years ago

Is there any way you could retrieve a syslog of when that happens? Inode::UpdateNodeFromDisk() does not check the return code of trying to get the node data (which is a bug, but an easily fixed one).

However, the underlying get_cached_block() only has a couple of reasons it would return NULL (as indicated by your screenshot):

  1. Allocating the memory for that block failed (line 1879).
  2. Reading the block from disk failed (line 1913).
  3. An obviously invalid block number had been requested (line 1865).

The last item ends in a panic, so you probably would have noticed. This leaves 1. and 2. -- the second item leaves a note in the syslog, while the first one doesn't, so that would help to differentiate between the two. Also, if the first one happens, the syslog should mention that it's low on memory.

Item two would either hint towards an interrupt or driver issue, or even broken hardware.

So even we fix the bug in BFS (which we should), the problem might just choose a different outcome for you.

Version 0, edited 8 years ago by axeld (next)

comment:3 by AGMS, 8 years ago

It's quite likely that it's running out of memory, since we often see other programs (like SoundPlay) consuming hundreds of megabytes (there's a logging program monitoring the other ones, once per second, kind of annoying that it triggers the crash when writing the log). I'll see if I can get a syslog the next time it crashes, though if the file system crashes while writing, that may not work for detecting the error :-)

Is there an API for getting the available and free memory totals? It would be useful to see if it's getting close shortly before the crash. The kernel system info just has a page count, which isn't actually memory used, I assume. I did use counting up the sizes of areas for particular program teams to find the memory used by a program, guess I could do that for all teams. Wonder if that's what ProcessController does...

comment:4 by axeld, 7 years ago

Resolution: fixed
Status: newclosed

BFS bug fixed in hrev50820 -- let's see what comes next :-)

in reply to:  4 comment:5 by AGMS, 7 years ago

Replying to axeld:

BFS bug fixed in hrev50820 -- let's see what comes next :-)

Thanks Axel. That could indeed be related to kernel team running out of memory. With further system monitoring (now periodically listing top memory users to a log file) we're seeing out of kernel memory as a major cause of long duration crashes (kernel team grows to 1.5GB then bam!).

comment:6 by axeld, 7 years ago

That sounds like there is a leak to be found. It would be helpful to know where that memory is wasted on.

comment:7 by AGMS, 7 years ago

Leak finding is on the to-do list. The plan is to make a dummy device driver that just iterates over the memory areas and dumps them to a file. Then see what's in those gigabytes (audio, bitmaps, disk sectors?).

Note: See TracTickets for help on using tickets.