Opened 8 years ago

Closed 7 months ago

Last modified 6 months ago

#5262 closed bug (fixed)

PANIC: vnode 3:7895866 already exists (node = 0xe0044000, vnode->node = 0xe008bbf4)!

Reported by: Tim Howe Owned by: axeld
Priority: critical Milestone: R1/beta1
Component: File Systems/BFS Version: R1/Development
Keywords: Cc: degea@…
Blocked By: Blocking: #3685, #5753, #7865, #8989, #9123, #9730, #13051
Has a Patch: no Platform: x86

Description

KDL while running svn checkout on hrev35025

Picture of bt is attached. Looks like a possible BFS bug(?)

Attachments (4)

kdl1_r35025.jpg (486.4 KB) - added by Tim Howe 8 years ago.
KDL of crash running svn
vnode_already_exist_r35293.txt (6.2 KB) - added by Michael Franz 8 years ago.
stack trace from hrev35293
vfs.patch (889 bytes) - added by Alastair Poole 5 years ago.
DUPLICATE, Attempt Fix 5753.
duplicate_removed_busy_freed.jpeg (454.6 KB) - added by Michael Lotz 4 years ago.
KDL screenshot showing the existing vnode to be removed and busy and the bfs_inode behind it to be freed.

Download all attachments as: .zip

Change History (41)

Changed 8 years ago by Tim Howe

Attachment: kdl1_r35025.jpg added

KDL of crash running svn

Changed 8 years ago by Michael Franz

stack trace from hrev35293

comment:1 Changed 8 years ago by axeld

Component: File SystemsFile Systems/BFS
Owner: changed from Nobody to axeld

Could be either BFS or kernel, I'll set it to BFS for now.

comment:2 Changed 7 years ago by X512

Still present in hrev39156. KDL when WebPositive loading page with the same back trace.

Last edited 7 years ago by X512 (previous) (diff)

comment:3 Changed 6 years ago by ttcoder

Still present in hrev41843 as well.. Got a KDL while browsing in W+ and launching BeShare. Tried to 'continue' out of KDL (sometimes works) but it just escalated into more panics/failed asserts. So I did a 'reboot', then did the exact same thing (launching W+/BeShare) with the same result.. At next reboot, the situation was different: the panic would occur just as the app_server started, making the system unbootable. After a few minutes of experimenting, I found that the "Safe Mode" checkbox in boot menu would get me back in Haiku again. At this point I opened a Terminal and did this:

~> checkfs /boot/
MediaFiles (inode = 2639785), some blocks weren't allocated
multi_audio_settings (inode = 2639784), some blocks weren't allocated
syslog (inode = 2097298), some blocks weren't allocated
syslog.old (inode = 2097362), some blocks weren't allocated, has blocks already set
sshd.pid (inode = 2097299), some blocks weren't allocated
        42173 nodes checked,
        46 blocks not allocated,
        29 blocks already set,
        17 blocks could be freed

        files           38907
        directories     2756
        attributes      329
        attr. dirs      155
        indices         26
~> 

Then I rebooted sans Safe Mode, and I did get the full-fledged Haiku this time, including network, allowing me to post this.

BFS is still cranky though:

~> checkfs /boot/
DriveSetup (inode = 2107402), some blocks weren't allocated
syslog.old (inode = 2097298), some blocks weren't allocated
        42174 nodes checked,
        29 blocks not allocated,
        0 blocks already set,
        0 blocks could be freed

        files           38908
        directories     2756
        attributes      329
        attr. dirs      155
        indices         26
~>

I never get to the point of having a "100% clean" status...

I expect this will happen again, since I've had it for the last few days.. So now I use everything in a more 'relaxed' way (no multiples pages in W+ ..etc) so as to steer clear from this panic as much as possible.

comment:4 Changed 6 years ago by diver

Try this: checkfs -c /boot/.
Ouch, it seems that -c stands for check only and not for correct as I thought. Never mind then.

Last edited 6 years ago by diver (previous) (diff)

comment:5 Changed 6 years ago by ttcoder

Cc: degea@… added

comment:6 in reply to:  4 Changed 6 years ago by Nickos V

Still present in hrev41843 as well.. Got a KDL while browsing in W+ and launching BeShare.

There were BFS issues while back ago but think they got worked out. Are you only upgrading that partition? Please re-initialize with DriveSetup and see if that fixes things. Make sure to boot off CD or other Haiku partition.

checkfs does a good job for repair but cannot always fix everything. You still have filesystem issues that only will get fixed with re-initialize (DriveSetup).

DriveSetup (inode = 2107402), some blocks weren't allocated
syslog.old (inode = 2097298), some blocks weren't allocated
29 blocks not allocated,

If you continue to use that partition then it will stay corrupt because checkfs is unable to fully fix it, and will only continue to cause filesystem errors. Re-initialize partition to fix it & then test with checkfs. When checkfs says Ok you can test with BeShare & WebPositive to see what happens.

comment:7 Changed 6 years ago by ttcoder

Thanks for feedback; some more input from me, in order of relevance: 1) I've found a sure-fire way to corrupt the FS! 2) checkfs seems to have restored stuff, oddly.

And 3) I'm definitely looking forward upgrading and doing an Initialize on that partition :-) .. Waiting for RAW images to be available again, so that I can upgrade to my spare Haiku partition "for free" (ISOs are for burning to a CD but I want to do this on the cheap *g*) (and VMWare and anyboot images cannot be double-clicked and mounted apparently).

Here's the odd "clean bill of health" I got after a few iterations over the last few days (I have a very solid and enjoyable system these days, so long as I remember to skip 1).. ):

~> checkfs /boot
        44547 nodes checked,
        0 blocks not allocated,
        0 blocks already set,
        0 blocks could be freed

        files           40738
        directories     3261
        attributes      349
        attr. dirs      173
        indices         26

Not sure if this is reliable or if there could be nasties hidden in an inode somewhere though.

As to 1), I'll file a proper ticket after confirming (or maybe update this ticket, not sure whether the symptoms warrant a new ticket or not), but I'm 99% sure of the following:

  • if I use the deskbar's Shutdown/Power off menu item, Haiku shuts down without doing a BFS 'sync' (or whatever it is called), resulting in trouble at next cold start: at a minimum I get checkfs shouting about TrackerSettings, tracker_shelf ..etc (i.e. the kind of things that are written to disk when you shutdown) but most times I get dropped to KDL with the 'vnode PANIC', and it takes 2 "continue" to resume bootup.
  • if I use "Restart System" and THEN press the off switch on the tower (at e.g. BIOS POST time) then at next boot Haiku is 100% good.

I guess this occurs only on my PC, others would have caught this thing before if their system failed to "sync' at shutdown.. So gotta confirm first.

Also noted that several fixes to e.g. the VM system have been done since 41843, I'm eager to upgrade to them... (will keep watching haiku-files...)

P.S. This Haiku partition (and the spare one) were initialized before installing Haiku.. I did play with older builds of Haiku (396xx) but these were on a different hard-drive, not this one. This one is fresh 41843 and started misbehaving very quickly.. So confirming my theory about shutdown/sync is prioritary I guess.

Last edited 6 years ago by ttcoder (previous) (diff)

comment:8 in reply to:  7 Changed 6 years ago by taos

Replying to ttcoder:

[...] Waiting for RAW images to be available again [...]

Alpha 3 raw images might not be advertised on Get Haiku! but can usually be found on the download servers (as haiku-r1alpha3-image.zip or haiku-r1alpha3-image.tar.xz):-)

comment:9 Changed 6 years ago by Rene Gollent

Replying to ttcoder:

I guess this occurs only on my PC, others would have caught this thing before if their system failed to "sync' at shutdown.. So gotta confirm first.

I shut down via that power off item all the time and I haven't had any issues on either of my Haiku partitions (and they see a considerable amount of filesystem activity since they're often doing full OS builds). One possibility that comes to mind is that there are some hard disks out there which lie about when they've completed operations in order to look better on benchmarks, but those are thankfully uncommon.

comment:10 Changed 6 years ago by Nickos V

if I use the deskbar's Shutdown/Power off menu item, Haiku shuts down without doing a BFS 'sync' (or whatever it is called), resulting in trouble at next cold start

Not BFS issue for you. A shutdown-disk sync issue which should go into another ticket. Initialize should fix filesystem for you and worth trying but if sync issue then good chance you will have corruption happening all the time. If you make another ticket then maybe will get fixed but in this ticket not likely to happen.

Something else that maybe affecting you. Seagate drives had faulty BIOSes with certain models. A firmware upgrade for your drive is a very good idea whether you have Seagate or other brand.

http://en.wikipedia.org/wiki/Seagate_Barracuda#7200.11

Try disk firmware upgrade & re-initialize partition and if those do not work then look at filing a disk-shutdown syncing ticket.

comment:11 in reply to:  7 Changed 6 years ago by Michael Lotz

Replying to ttcoder:

As to 1), I'll file a proper ticket after confirming (or maybe update this ticket, not sure whether the symptoms warrant a new ticket or not), but I'm 99% sure of the following:

  • if I use the deskbar's Shutdown/Power off menu item, Haiku shuts down without doing a BFS 'sync' (or whatever it is called), resulting in trouble at next cold start: at a minimum I get checkfs shouting about TrackerSettings, tracker_shelf ..etc (i.e. the kind of things that are written to disk when you shutdown) but most times I get dropped to KDL with the 'vnode PANIC', and it takes 2 "continue" to resume bootup.
  • if I use "Restart System" and THEN press the off switch on the tower (at e.g. BIOS POST time) then at next boot Haiku is 100% good.

As mentioned above this is likely related to the specific hardware. It sounds a lot like a disk write-cache that is not flushed and lost when the power turns off, while it still manages to fully flush when rebooting and then turning off a bit later in the BIOS screen. I'm not sure how this flush can be enforced, but I faintly remember seeing some SCSI/ATA flags that can be set to enforce write synchronization that could help here (and might even already be employed when doing a sync).

Another thing that probably doesn't help in this case is that the sync operation that takes place on shutdown (the "Tidying things up a bit" part in the shutdown status window) is run at the very beginning, i.e. while all applications are still running. So apps that store their settings on quit (like Tracker) don't really get that into this sync. Unmounting the FS still triggers the syncing of course, it just makes it more likely that the sync happens "too late" in this case.

comment:12 Changed 5 years ago by Alastair Poole

Duplicate of #5753 bug. Issue is in vfs.cpp.

Last edited 5 years ago by axeld (previous) (diff)

Changed 5 years ago by Alastair Poole

Attachment: vfs.patch added

DUPLICATE, Attempt Fix 5753.

comment:13 Changed 5 years ago by Alastair Poole

Has a Patch: set

comment:14 Changed 5 years ago by axeld

Blocking: 5753 added

(In #5753) Obviously a duplicate of #5262 indeed.

comment:15 Changed 5 years ago by axeld

The only reason for the panic is that it is a situation that should not be possible to occur. I don't see what you patch is accomplishing besides hiding the actual bug.

comment:16 Changed 5 years ago by axeld

Has a Patch: unset

comment:17 Changed 5 years ago by axeld

Blocking: 8989 added

(In #8989) It's the ever reoccurring "vnode already exists" bug, like #5262.

comment:18 Changed 5 years ago by Rene Gollent

Blocking: 9123 added

comment:19 Changed 4 years ago by Rene Gollent

Blocking: 9730 added

(In #9730) Duplicate of #5262.

comment:20 Changed 4 years ago by Mike Chambers

I opened ticket 9730 which is closed as a dup of this one. This is consistently reproduceable on a Pentium 3 box I put R1 alpha 4.1 on, by running "installoptionalpackage sdllibs" in the terminal. It happens after it downloads, during the unzip operation.

It only has a 10 GB drive, so I imaged it over my LAN from a Linux live CD to my Win 7 system. When I boot the image in QEMU, it is still consistently reproduceable. If it would help, I will put the image on my web server for devs to download.

There is less than 1 GB of space used in the file system, so 7zip's ultra setting should get it down to a reasonable size. Let me know if it would be useful. I'll keep the image around for a while.

Screenshot from QEMU: http://i.imgur.com/F0z0GoF.png

Last edited 4 years ago by Mike Chambers (previous) (diff)

comment:21 Changed 4 years ago by Urias McCullough

Does the issue tracked down by bonefish in ticket:3685#comment:3 sound related to this problem?

comment:22 Changed 4 years ago by Ingo Weinhold

Blocking: 3685 added

(In #3685) Closing this ticket as duplicate of #5262. The situation analyzed by me is tracked in #9839. Whether they are related has not been understood yet.

comment:23 Changed 4 years ago by Michael Lotz

This one happened to me again last weekend while compiling GCC4 with -j8 as a dependency of a port. I've taken a couple of pictures of the KDL session, not sure if they are really relevant. I'm attaching the one where it can be seen that the vnode that already existed has the "removed" and "busy" flags set and that the bfs_inode it references has been freed. I also have pictures with backtraces of the other running threads and their io_contexts and the bfs_inode output for the duplicate vnode, just in case. I tried to follow the code somewhat but as of yet haven't found anything, pointers welcome.

Changed 4 years ago by Michael Lotz

KDL screenshot showing the existing vnode to be removed and busy and the bfs_inode behind it to be freed.

comment:24 Changed 4 years ago by Ingo Weinhold

AFAICT a BFS Inode is deleted only in bfs_get_vnode() (error case and irrelevant here), in bfs_put_vnode(), and in bfs_remove_vnode(). From the VFS point of view which of the latter two is called doesn't really matter. Given that the "removed" flag is set it should be bfs_remove_vnode(). Both are called only from free_vnode(), and bfs_get_vnode() potentially also in get_vnode() in case of an error creating a special subnode. Since I doubt that GCC's build system creates a FIFO or socket file, I think it is safe to assume that free_vnode() has been called. Given that the vnode structure itself has not been overwritten with 0xdeadbeef yet, I would assume that free_vnode() has not finished yet.

The problem is that when create_new_vnode_and_lock() finds a preexisting vnode, it doesn't check its "busy" flag. So it may return a vnode that is doomed and doesn't already exist anymore in the FS. Just like get_vnode() it should check the busy flag and wait for the vnode to become unbusy or disappear.

That change would also fix #9839 (new_vnode() and publish_vnode() use create_new_vnode_and_lock()), but as mentioned there, it would also introduce a potential deadlock.

comment:25 Changed 2 years ago by waddlesplash

Daoick (from IRC) has this bug, and says it happens about twice per day for him. (He can't comment here due to #12030). He's got a Haiku install on real hardware, a Lenovo laptop with 8GB of RAM and a SSD. He says Linux works very reliably on it.

He says that this happens when the system is under heavy load (e.g. compiling GCC), and afterwards, the partition is corrupted and Haiku just endlessly reboots trying to boot from it, and that checkfs is totally useless in attempting to repair the partition.

My vote is to elevate this to a blocker -- at least for R1 anyway.

comment:26 Changed 2 years ago by Urias McCullough

I dunno, that sounds like our usual troll to me.

comment:27 Changed 2 years ago by axeld

At least he hasn't changed his story much. I'd say just leave the ticket here. I haven't seen this bug for years now, and I do quite some stuff with Haiku.

comment:28 Changed 2 years ago by Rene Gollent

Just for the record, I did see it myself once a few weeks ago, though it was the first time in years.

comment:29 Changed 2 years ago by axeld

The bug is still there, no question about that; bonefish nicely investigated it. However, I question that he manages to reproduce it that often. Even twice a day. Ever asked yourself what he could do so important that he goes trough all the hassle to install Haiku twice a day? :-)

comment:30 Changed 2 years ago by Rene Gollent

Not disagreeing about the troll, just misinterpreted your comment about never seeing it as possibly implying it had already been fixed.

comment:31 Changed 12 months ago by korli

Blocking: 13051 added

comment:32 Changed 10 months ago by PulkoMandy

Blocking: 7865 added

comment:33 Changed 10 months ago by PulkoMandy

Milestone: R1R1/beta1

I'm bumping this to beta1, because it is rather easy to trigger on the package build slaves, often in mktemp family of APIs. Either bash or gcc creating a temp file while building something will panic the machine. This is a problem when we need reliable package builder machines. It is hard to recover them from KDL remotely.

comment:34 Changed 7 months ago by axeld

Status: newin-progress

comment:35 Changed 7 months ago by axeld

Please try again with hrev51061.

comment:36 Changed 7 months ago by PulkoMandy

Resolution: fixed
Status: in-progressclosed

For the first time tonight, my build slave got through a complete build of the release branch without hitting this even once (there were other problems, but it's getting better). Thanks for working on this!

comment:37 Changed 6 months ago by axeld

Great, thanks for the update!

Note: See TracTickets for help on using tickets.