Opened 10 years ago

Closed 10 years ago

Last modified 10 years ago

#3010 closed bug (fixed)

A file is routinely corrupt after shutdown

Reported by: donn Owned by: axeld
Priority: blocker Milestone: R1/alpha1
Component: System/Kernel Version: R1/pre-alpha1
Keywords: Cc:
Blocked By: Blocking:
Has a Patch: no Platform: x86

Description

I have a file that will be corrupt 9 times out of 10, after reboot.

It's a 200Kb application cache, that may be written several times during a session. I copy the file before I shut down. On the next boot, the copy is fine, but the original is usually corrupt. The invalid data is random blocks of data from other applications, commonly Firefox, and the boundary between valid and invalid data is at multiples of 4096.

I'm at hrev28263, but it has been going on for a while. I thought this might be interesting because it's so repeatable (on my computer), and so weirdly specific to this file. I would be happy to add logging or gather other information that might be useful.

Change History (13)

comment:1 Changed 10 years ago by axeld

Component: SystemSystem/Kernel
Milestone: R1R1/alpha1
Priority: normalblocker

Is it on emulation or real hardware? What emulator/hardware to you have (ie. SATA/PATA/USB/amount of RAM/etc.)?

What application and file are you talking about? Maybe it's reproducible somewhere else, too.

The 4K blocks would speak for a problem in the I/O scheduler or file cache.

comment:2 in reply to:  1 Changed 10 years ago by donn

Replying to axeld:

Is it on emulation or real hardware? What emulator/hardware to you have (ie. SATA/PATA/USB/amount of RAM/etc.)?

Hardware: http://dev.haiku-os.org/attachment/ticket/2600/listdev. NForce4 controllers, PATA & SATA drives, Haiku is on PATA, doesn't work with SATA. 2 Gb RAM.

What application and file are you talking about? Maybe it's reproducible somewhere else, too.

It's my email - a Python demo that comes with Bethon, that does IMAP. The file is a folder ID and a dict of IMAP 'UID' vs. RFC822 header values, written with open('/boot/home/config/settings/pynr/rc/isp/INBOX', 'w'), cPickle.dump, cPickle.dump, fp.close(). cPickle.dump is C fwrite.

comment:3 Changed 10 years ago by anevilyak

Interestingly, I appear to have a very similar motherboard to yours (CK804/Athlon64), same SATA controller vendor / device ID, and booting via SATA works here. Different audio and graphics chipset though.

comment:4 Changed 10 years ago by donn

Thanks for mentioning, that's interesting - SATA was unusable this summer, but I guess that's why we pay you guys the big bucks, it's working now. I copied all files over and booted from the SATA partition, and ... half a dozen reboots, no corrupt file.

So it could be in some way related to the ide bus, though not proven. I should dd the old filesystem over to the SATA drive intact, if we're interested in this hypothesis.

comment:5 Changed 10 years ago by axeld

Can you please do a "checkfs" (available with current revisions) on your old drive? If it no fatal errors show up, it's likely the bug that I'm currently looking into. If that is the case, please keep your reproducible test case somewhere, so that you can confirm it being fixed once I found it :-)

comment:6 Changed 10 years ago by donn

Well ... chkbfs -n reported a lot of errors, many of them "blocks already set" involving that cache file. But they were old cache files, from before the directory hierarchy rename and copy just before I filed this problem report. I deleted as many of those files as I could (some directories were too ruined to delete), ran chkbfs -e which also fixed a block bitmap error, and now there are only a couple of errors: a directory that used to be in the path to the cache file weeks ago, which has blocks already set and a file in that directory that can't be opened. The summary says

Sorry, can't fix the "blocks already set" error - two files are claiming the same space Errors have been fixed (and exits 0 == sucess.)

So, you will have to decide how to interpret that, but

  • the errors aren't related to the cache file, and
  • I still get the same file corruption on reboot. I tried it after all this chkbfs work, and got a corrupt file on the second reboot cycle.

comment:7 Changed 10 years ago by anevilyak

While not consistent, it seems this behavior manifests itself on SATA too. I just noticed a few of the .svn files in my haiku checkout on haiku are corrupt, also in 4KB increments. Unfortunately I don't seem to have as reliable a testcase as Donn though, this is actually the first time I've noticed this happen here.

comment:8 Changed 10 years ago by anevilyak

checkfs -c indicates the filesystem itself isn't corrupted by the way, so it seems that at least (so far) this has only affected actual file data, and not the inodes.

comment:9 Changed 10 years ago by axeld

I've tracked down and fixed a pretty bad interaction between the file and the block cache in hrev28517 - freed blocks could be written back after they were claimed by the file cache.

This should also fix this bug, please retry.

BTW: the problem with "blocks already set" is that you don't know the second file when that happens (as two files are likely corrupted due to this). You can use "bfswhich" of the BFS-tools to find out about that, though.

In any case, it shouldn't affect new files.

comment:10 in reply to:  9 Changed 10 years ago by donn

Good news: it does fix the bug!

Bad news: now vim causes a "vnode exists" error and kernel debugger visit, when it tries to save ".viminfo.tmp". Highlights of the trace:

panic

new_vnode(...)

...

bfs_create(...)

create_vnode(... ".viminfo.tmp"...)

file_create(-1, "/boot", ...)

comment:11 Changed 10 years ago by axeld

Resolution: fixed
Status: newclosed

Okay, thanks for the update! Can you please open a new bug report (or reopen an older one ;-)) for the other error, and explain how to reproduce this, and if it's always reproducible?

comment:12 Changed 10 years ago by donn

OK. The problem seems to be peculiar to that file, and now that it's gone I haven't seen any other problems.

comment:13 Changed 10 years ago by donn

I'm sorry to report that I have had a couple more of these corrupted cache files - two several days ago, none since. Like the last time, they happened after the parent directory was damaged, and the corrupted file has "blocks already set".

The directory damage shows up as "reading directory xxx: Bad data", and in checkfs as blocks already set and "names don't match", and "could not be opened" for contents.

This is all on the old IDE filesystem.

Note: See TracTickets for help on using tickets.