#12962 closed bug (fixed)
fflush Crash in BFS
Reported by: | AGMS | Owned by: | axeld |
---|---|---|---|
Priority: | normal | Milestone: | Unscheduled |
Component: | File Systems/BFS | Version: | R1/Development |
Keywords: | fflush | Cc: | agmsmith@… |
Blocked By: | Blocking: | ||
Platform: | All |
Description
We're having troubles with a KDL crash in fflush().
Unfortunately it's hard to recreate without waiting days. Also unfortunately, it happens often on that time scale, getting the same crash again and again. It's in Inode::TransactionDone() / Transaction::NotifyListeners() when an fflush() seems to trigger a write (at least it's calling the layers of file system write functions).
The application code logs to text files, using a redirected stdout and stderr (2 separate files), with an fflush (stdout) called once a second, even if no new data is output. Seen somewhere around hrev50163.
Attachments (1)
Change History (8)
by , 8 years ago
Attachment: | FFlushKDL.jpg added |
---|
comment:1 by , 8 years ago
If anyone else has seen this ever, please do comment here so we can keep track of how much of an issue this is.
comment:2 by , 8 years ago
Is there any way you could retrieve a syslog of when that happens? Inode::UpdateNodeFromDisk() does not check the return code of trying to get the node data (which is a bug, but an easily fixed one).
However, the underlying get_cached_block() only has a couple of reasons it would return NULL
(as indicated by your screenshot):
- Allocating the memory for that block failed (line 1879).
- Reading the block from disk failed (line 1913).
- An obviously invalid block number had been requested (line 1865).
The last item ends in a panic, so you probably would have noticed. This leaves 1. and 2. -- the second item leaves a note in the syslog, while the first one doesn't, so that would help to differentiate between the two. Also, if the first one happens, the syslog should mention that it's low on memory.
Item two would either hint towards an interrupt or driver issue, or even broken hardware.
So even we fix the bug in BFS (which we should), the problem might just choose a different outcome for you.
comment:3 by , 8 years ago
It's quite likely that it's running out of memory, since we often see other programs (like SoundPlay) consuming hundreds of megabytes (there's a logging program monitoring the other ones, once per second, kind of annoying that it triggers the crash when writing the log). I'll see if I can get a syslog the next time it crashes, though if the file system crashes while writing, that may not work for detecting the error :-)
Is there an API for getting the available and free memory totals? It would be useful to see if it's getting close shortly before the crash. The kernel system info just has a page count, which isn't actually memory used, I assume. I did use counting up the sizes of areas for particular program teams to find the memory used by a program, guess I could do that for all teams. Wonder if that's what ProcessController does...
follow-up: 5 comment:4 by , 8 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
BFS bug fixed in hrev50820 -- let's see what comes next :-)
comment:5 by , 8 years ago
Replying to axeld:
BFS bug fixed in hrev50820 -- let's see what comes next :-)
Thanks Axel. That could indeed be related to kernel team running out of memory. With further system monitoring (now periodically listing top memory users to a log file) we're seeing out of kernel memory as a major cause of long duration crashes (kernel team grows to 1.5GB then bam!).
comment:6 by , 8 years ago
That sounds like there is a leak to be found. It would be helpful to know where that memory is wasted on.
comment:7 by , 8 years ago
Leak finding is on the to-do list. The plan is to make a dummy device driver that just iterates over the memory areas and dumps them to a file. Then see what's in those gigabytes (audio, bitmaps, disk sectors?).
Stack backtrace of fflush crash.