Opened 8 years ago

Closed 8 years ago

Last modified 8 years ago

#8408 closed bug (fixed)

Invalid b+tree after installing

Reported by: humdinger Owned by: axeld
Priority: blocker Milestone: R1/alpha4
Component: File Systems/BFS Version: R1/Development
Keywords: invalid b+tree Cc: art@…
Blocked By: Blocking:
Has a Patch: no Platform: All

Description

This is hrev43873 gcc2hybrid.

Checkfs is revealing "invalid b+tree" failures. Even a completely new installation has them. I put a new nightly on the stick and used Installer to put it all on freshly initialized partition. Checkfs says:

size (inode = 4296), invalid b+tree

In the syslog appears:

KERN: inode 4296: duplicate at 318464 has invalid array size 0!
KERN: inode 4296: visited 415 from 416 nodes.

I then initialized the partition again and copied one folder at a time from the USB stick onto it. Now checkfs showed no invalid b+trees. I redid that copying all folders at once via Tracker and still no invalids. Redid the Installer way, and got the same invalid b+tree.

My /boot drive BTW now has:

last_modified (inode = 4261), invalid b+tree
name (inode = 4195), invalid b+tree

I attach the rather longer syslog for that checkfs run with "checkfs-boot-syslog.txt".

My "Generator" partition, where I put the "generated" folders for building Haiku, does one better and says:

last_modified (inode = 4424), invalid b+tree
name (inode = 4358), invalid b+tree
size (inode = 4457), invalid b+tree

I attach the even longer syslog for that checkfs run with "checkfs-Generator-syslog.txt".

The Generator might be corrupted from KDLs I currently experience every time I try to build Haiku.
BTW, I have other partitions that don't show invalid b+trees (yet?).

I'm just putting this ticket out there in case sombody sees a pattern here...

Attachments (10)

checkfs-boot-syslog.txt (11.2 KB ) - added by humdinger 8 years ago.
syslog: checkfs /boot
checkfs-Generator-syslog.txt (76.9 KB ) - added by humdinger 8 years ago.
syslog: checkfs /Generator
Haiku btree failure.png (428.9 KB ) - added by umccullough 8 years ago.
btree failure after svn co
Haiku btree syslog.png (460.5 KB ) - added by umccullough 8 years ago.
syslog output from failure
bfsinfo_failure.txt (874.1 KB ) - added by umccullough 8 years ago.
bfsinfo output from inode 2084
IMG_1511.JPG (4.9 MB ) - added by SeanCollins 8 years ago.
Haiku btree failure with cp and rm.png (397.2 KB ) - added by umccullough 8 years ago.
cp/rm b+tree failure
bfserrors.txt (50.6 KB ) - added by kallisti5 8 years ago.
syslog bfs errors, gcc4 image
bfsfail1.png (307.4 KB ) - added by umccullough 8 years ago.
still failing on hrev43918
bfstest_serial.txt (41.9 KB ) - added by umccullough 8 years ago.
serial output during testing - hit error in #3159 at the end

Change History (44)

by humdinger, 8 years ago

Attachment: checkfs-boot-syslog.txt added

syslog: checkfs /boot

by humdinger, 8 years ago

syslog: checkfs /Generator

comment:1 by anevilyak, 8 years ago

An observation that may be of interest: the B+ tree corruption reported here all appears to involve the index b-trees. On my development partition, which I intentionally initialized without indices for performance reasons, I've never seen this kind of issue, and that partition is months/years old (and has seen many, many full builds and other gruelling disk activity).

Last edited 8 years ago by anevilyak (previous) (diff)

comment:2 by axeld, 8 years ago

As I said in another ticket, I think it's unlikely that BFS itself is to blame here, I would rather start searching in the block cache. I have no idea why the Installer creates errors while a manual copy does not, but it could just be a coincidence on the way how the transactions are combined.

With respect to Renes reply: I only recently added the B+tree check in checkfs, and it's unlikely that any other directory is put under as much stress than the name, last_modified, and size indices. On my laptop, they have over half a million entries each, and are constantly changing. They are likely to be part of any transaction in the system.

comment:3 by anevilyak, 8 years ago

This is true, but at the same time, it seems there's a sudden increase in incidence of these problems lately. Is it not possible that the recent changes with respect to the duplicate arrays might have a subtle issue in them? Those AFAICT only come into play for index trees, which are the only place I've seen this problem so far. From a cursory look through that changeset I don't see an obvious issue though.

comment:4 by axeld, 8 years ago

Sure it's possible that I introduced a new bug with that code. However, the invalid B+trees just look like they always did; if something would be wrong with the array, it should possibly have different effects.

comment:5 by umccullough, 8 years ago

I started searching for a simple case which causes a b+tree error from checkfs, and I've located one that I think is reasonably simple. It's not the same kind of error as mentioned in this ticket, but perhaps it can help track down the cause of such issues?

Environment: fresh gcc2 build on Linux64 host using hrev43884. Creating a vmware image and booting it in VirtualBox 4.0.6

Using the following UserBuildConfig:

DefineBuildProfile bfstest-vmware : vmware-image : "haiku-bfstest.vmdk" ;
HAIKU_IMAGE_SIZE = 500 ;
AddOptionalHaikuImagePackages Development ;
AddOptionalHaikuImagePackages Subversion ;

Subversion is required for this particular test, and without the Development package also installed, it doesn't seem to occur (probably has something to do with number of files, or disk space used on the disk I guess)

Once booted up, wait for the first install tasks to complete and then launch a terminal and type the following:

checkfs -c /boot

to verify a clean install.

after this, I executed the following:

svn co http://ports.haiku-files.org/svn/haikuporter/trunk haikuporter

followed by another:

checkfs -c /boot

which yields the output in the attached screenshots (uploaded separately to the ticket).

I can also supply the before-and-after vmdk if that helps - but I would hope others can repeat this scenario.

by umccullough, 8 years ago

Attachment: Haiku btree failure.png added

btree failure after svn co

by umccullough, 8 years ago

Attachment: Haiku btree syslog.png added

syslog output from failure

comment:6 by SeanCollins, 8 years ago

This is so easy to duplicate, just load up a bunch of programs, wait for the cache to get full and start a heavy I/O operation. Bam, you'll get b+tree error in a hurry, in fact just copy a large directory with a large number of files and folder from one disk to another, and you can generate btree errors.

Really easy to duplicate. I have a verified clean install, it lasted 8 minutes before generating b+tree error. Just download nightly 43886 gcc2 hybrid,

comment:7 by axeld, 8 years ago

If it's so easy to duplicate, chances are good that I messed something up with the test itself :-)

umccullough: could you run bfsinfo -ib <path-to-device-with-the-error> 2068 from the BFS tools, and attach its output? That could help finding the issue.

in reply to:  7 ; comment:8 by SeanCollins, 8 years ago

Replying to axeld:

If it's so easy to duplicate, chances are good that I messed something up with the test itself :-)

umccullough: could you run bfsinfo -ib <path-to-device-with-the-error> 2068 from the BFS tools, and attach its output? That could help finding the issue.

Would you like me to run the same test, since I have the same problem ?

in reply to:  7 comment:9 by umccullough, 8 years ago

Replying to axeld:

umccullough: could you run bfsinfo -ib <path-to-device-with-the-error> 2068 from the BFS tools, and attach its output? That could help finding the issue.

I downloaded http://www.pinc-software.de/files/BFS-tools-0.7.6.zip and unzipped it onto the image.

I then ran bfsinfo -ib /boot 2084 (because I had to re-run the test and this time it was inode 2084) but got the following (transcribed from the screen output):

Copyright (c) 2001-2004 pinc Software.
Raw mode not selected, but raw device specified.
Could not open device or file: Initialization failed

I got the same thing using the path from df which was /dev/disk/ata/0/master/raw - keeping in mind that our build system still doesn't create partition tables on raw/vmware images.

comment:10 by axeld, 8 years ago

Hm, maybe it's an old version of it... I will plan to add the tools to Haiku anyway, I just didn't find the time to do so yet. You didn't need to redo the test, or are you doing so with a clean image every time?

Thanks anyway!

I will try to add a more recent version of bfsinfo ASAP.

in reply to:  8 comment:11 by axeld, 8 years ago

Replying to SeanCollins:

Would you like me to run the same test, since I have the same problem ?

Dunno, I haven't seen a syslog from you, so I cannot tell -- Urias problem is nice and short, so it's a good candidate to look into the issue.

in reply to:  10 ; comment:12 by umccullough, 8 years ago

Replying to axeld:

You didn't need to redo the test, or are you doing so with a clean image every time?

Yep, it is so repeatable for me (pretty much 100% with the scenario I have outlined), I wasn't that worried about losing it, so I just create a new test image for each subsequent test.

FWIW, I am also able to create a last_modified b+tree issue pretty consistently with this:

~> cp -a /boot/common .

And even once when I ran that, I got an ASSERTION FAILED panic in vfs.cpp which appears to be #3159

I'll keep trying to reproduce that particular failure as well, since it seems pretty important.

by umccullough, 8 years ago

Attachment: bfsinfo_failure.txt added

bfsinfo output from inode 2084

in reply to:  10 comment:13 by umccullough, 8 years ago

Replying to axeld:

I will try to add a more recent version of bfsinfo ASAP.

Added bfsinfo -ib /boot 2084 output text file using a newer gcc2 build.

comment:14 by SeanCollins, 8 years ago

From my syslog. Seemed frivilous to attach the entire log. Is this the same issue ? and would my testing assist in debuging ?

KERN: inode 4815: visited 425 from 426 nodes.
KERN: inode 4815: visited 433 from 434 nodes.
KERN: slab memory manager: created area 0xd8801000 (161335)
KERN: bfs: mounted "HaikuStorage" (root node at 524288, device = /dev/disk/ata/2/slave/raw)
KERN: slab memory manager: created area 0xd9001000 (162002)
KERN: slab memory manager: created area 0xd9801000 (162679)
KERN: slab memory manager: created area 0xda001000 (164368)
KERN: slab memory manager: created area 0xda801000 (164975)
KERN: inode 4815: visited 567 from 568 nodes.
KERN: bfs: bfs_access:1486: Operation not allowed
KERN: runtime_loader: Cannot open file crypto: No such file or directory
KERN: runtime_loader: Cannot open file crypto.so.1.0.0: No such file or directory
KERN: runtime_loader: Cannot open file libcrypto: No such file or directory
KERN: runtime_loader: Cannot open file ssl: No such file or directory
KERN: runtime_loader: Cannot open file ssl.so.1.0.0: No such file or directory
KERN: runtime_loader: Cannot open file libssl: No such file or directory
KERN: bfs: bfs_access:1486: Operation not allowed
KERN: runtime_loader: Cannot open file resolv: No such file or directory
KERN: runtime_loader: Cannot open file resolv.so: No such file or directory
KERN: runtime_loader: Cannot open file libresolv: No such file or directory
KERN: runtime_loader: Cannot open file libresolv.so: No such file or directory
KERN: bfs: bfs_read_link:1501: Invalid Argument
KERN: Last message repeated 4 times.
KERN: inode 4815: visited 575 from 576 nodes.
KERN: slab memory manager: created area 0xdb001000 (167907)
KERN: inode 4815: duplicate at 590848 has invalid array size 0!
KERN: inode 4815: visited 581 from 582 nodes.
KERN: inode 4815: duplicate at 590848 has invalid array size 0!
KERN: inode 4815: visited 581 from 582 nodes.
KERN: inode 4815: duplicate at 590848 has invalid array size 0!
KERN: inode 4815: visited 584 from 585 nodes.
KERN: bfs: bfs_read_link:1501: Invalid Argument}}}

in reply to:  12 ; comment:15 by SeanCollins, 8 years ago

Replying to umccullough:

Replying to axeld:

You didn't need to redo the test, or are you doing so with a clean image every time?

Yep, it is so repeatable for me (pretty much 100% with the scenario I have outlined), I wasn't that worried about losing it, so I just create a new test image for each subsequent test.

FWIW, I am also able to create a last_modified b+tree issue pretty consistently with this:

~> cp -a /boot/common .

And even once when I ran that, I got an ASSERTION FAILED panic in vfs.cpp which appears to be #3159

I'll keep trying to reproduce that particular failure as well, since it seems pretty important.

Try unzipping something with allot of folders to the disk with the bfs error. Thats seems to do the trick for reproducing that vfs issue. Works for me at least. Expander and unzip do a handy job of triggering that issue. See it all the time when the build completes and starts making the image.

by SeanCollins, 8 years ago

Attachment: IMG_1511.JPG added

in reply to:  15 comment:16 by umccullough, 8 years ago

Replying to SeanCollins:

Try unzipping something with allot of folders to the disk with the bfs error. Thats seems to do the trick for reproducing that vfs issue. Works for me at least. Expander and unzip do a handy job of triggering that issue. See it all the time when the build completes and starts making the image.

There is a separate ticket for that issue that you should post additional details in: #3159

Also, simple test cases that reliably cause the problem are what is most valuable for these types of issues.

comment:17 by umccullough, 8 years ago

Testing further, I've noticed that the corruption only seems to occur repeatedly when i've run checkfs *FIRST*. I'm guessing this immediately caches all the structures, which somehow increases the likelihood of failure.

I can repeat the process now without the Development or Subversion packages (although, I installed OpenSSH to help me copy the files to another machine).

~> checkfs -c /boot
~> cp -a /boot/system . && rm -rf system
~> checkfs -c /boot

Yields the new screenshot attached.

by umccullough, 8 years ago

cp/rm b+tree failure

in reply to:  17 comment:18 by SeanCollins, 8 years ago

Replying to umccullough:

Testing further, I've noticed that the corruption only seems to occur repeatedly when i've run checkfs *FIRST*. I'm guessing this immediately caches all the structures, which somehow increases the likelihood of failure.

I can repeat the process now without the Development or Subversion packages (although, I installed OpenSSH to help me copy the files to another machine).

~> checkfs -c /boot
~> cp -a /boot/system . && rm -rf system
~> checkfs -c /boot

Yields the new screenshot attached.

Check your syslog after a cleanboot,typically won't see a b+tree error. Copy a directory with a large number of folders, like source. Or git clone the haiku repo, that seems to be enough to trigger it. It happens to me regardless of the use of checkfs or not.

comment:19 by SeanCollins, 8 years ago

I've got this nailed down to sometime between 43769 and 43810. I have a 2 week old 43769, ran all the tests for invalid b tree, I come up with nothing. Drives seem fine no detecable corruption. I start getting file corruption and data loss around 43810. Rev 43810 , I experienced data loss on after a KDL. I copied a bunch of files got thrown into kdl during a unzip and literally nothing I had copied from my data drive to the main install drive, including a few apps actually stayed on the drive when the kdl happened, and it was a good 45 minutes from the time I unzipped the folders to the kdl occuring.

Best info I can give you as I am reverting all my systems back to rev 43769

Last edited 8 years ago by SeanCollins (previous) (diff)

in reply to:  19 ; comment:20 by anevilyak, 8 years ago

Replying to SeanCollins:

I've got this nailed down to sometime between 43769 and 43810. I have a 2 week old 43769, ran all the tests for invalid b tree, I come up with nothing. Drives seem fine no detecable corruption. I start getting file corruption and data loss around 43810. Rev 43810 , I experienced data loss on after a KDL. I copied a bunch of files got thrown into kdl during a unzip and literally nothing I had copied from my data drive to the main install drive, including a few apps actually stayed on the drive when the kdl happened, and it was a good 45 minutes from the time I unzipped the folders to the kdl occuring.

That's more or less completely meaningless since that's precisely the range where B+ tree validation was actually added. hrev43769's checkfs wouldn't report it because it doesn't check for it.

in reply to:  20 ; comment:21 by SeanCollins, 8 years ago

Replying to anevilyak:

That's more or less completely meaningless since that's precisely the range where B+ tree validation was actually added. hrev43769's checkfs wouldn't report it because it doesn't check for it.

I have a 43769, that after 2 weeks of contious use does not have a b+tree error. Thats why I tested this 43769 this morning. I didn't realize the machine I am on right now had the rev, but it does, it also does not show signs of b+tree corruption.

Conversly rev 43810 does corrupt within minutes of use. In particular if you unzip lots of directorys or mkdir gets called when downloading the repo from git.

There is also a data loss issue that crop up in the same time frame.I can only report the info I have on hand.

in reply to:  21 comment:22 by anevilyak, 8 years ago

Replying to SeanCollins:

I have a 43769, that after 2 weeks of contious use does not have a b+tree error. Thats why I tested this 43769 this morning. I didn't realize the machine I am on right now had the rev, but it does, it also does not show signs of b+tree corruption.

You missed the point. 43769 will not tell you if you have tree corruption because it can't. Its checkfs doesn't know how to check for it, ergo it will never report it, whether it's there or not. The whole reason people have suddenly been noticing it is precisely because checkfs can now detect it, and that ability was introduced in the range that you're claiming to be the "problem".

comment:23 by SeanCollins, 8 years ago

Checkfs from rev 43894 " last nights nightly" on hrev 43769 produce this result.

~> checkfs
Usage: checkfs <options> <device> <volume name>

Options:
  -h, --help        - print this help text
  -c, --check-only  - do not make any changes to the file system
~> checkfs  /boot
ru (inode = 3840218), has blocks already set
        210995 nodes checked,
        0 blocks not allocated,
        1 blocks already set,
        4 blocks could be freed

        files           173210
        directories     36870
        attributes      577
        attr. dirs      311
        indices         27

        direct block runs               211293 (11.04 GiB)
        indirect block runs             2964 (in 138 array blocks, 6.37 GiB)
        double indirect block runs      0 (in 0 array blocks, 0 bytes)
~> 

comment:24 by arfonzo, 8 years ago

Cc: art@… added

comment:25 by umccullough, 8 years ago

@SeanCollins

I'm not sure what you're implying. The BPlusTree::Validate() logic wasn't added until hrev43829 - so no amount of testing on revisions prior to that will ever yield the issue we're discussing in this ticket. You can't just copy a new checkfs to an old rev because the validation logic was added in the filesystem code itself, not the checkfs tool.

Furthermore, as Axel implied, it's entirely possible that his "Validate()" check code added in that rev is problematic, and that it's detecting a false positive.

We should probably wait for Axel to evaluate the situation (assuming he can repro it on his end). The problem seems to be easy to reproduce here, and that should be enough to start on for a developer who knows the internals well enough.

comment:26 by kallisti5, 8 years ago

Milestone: R1R1/alpha4
Priority: normalblocker

as umccullough found a way to reproduce this easily, I am going to put this as an Alpha 4 blocker.

This will raise visibility to bug fix weekend as well :)

comment:27 by kallisti5, 8 years ago

copying /boot/system to ~ and erasing it several times running checkfs inbetween each copy / erase resulted in the attached b+tree issues.

by kallisti5, 8 years ago

Attachment: bfserrors.txt added

syslog bfs errors, gcc4 image

comment:28 by axeld, 8 years ago

Thanks Urias! Unfortunately, the /system copy results in much more errors than the neat thing from before, but I was able to reproduce the issue, and it'll do :-).

From the looks of it, I'd say I messed up the duplicate array somehow. I'll have a look over the next few days.

comment:29 by axeld, 8 years ago

Resolution: fixed
Status: newclosed

Should be fixed with hrev43915. Please note that checkfs can now repair broken index B+trees as well, so you can at least fix those issues now :-)

comment:30 by umccullough, 8 years ago

Hmm, I'm still seeing this problem on hrev43918 using the "svn co" method outlined above.

I'll have to delete and re-create all my generated/objects to be sure it's not some weird leftover.

comment:31 by umccullough, 8 years ago

Resolution: fixed
Status: closedreopened

I definitely still see the first failure test I reported back in comment:5 after building a fresh image in a clean generated directory.

Attached is the new screenshot and serial debug output.

Note I also recreated #3159 immediately afterward, which is what you see in the serial debug output.

by umccullough, 8 years ago

Attachment: bfsfail1.png added

still failing on hrev43918

by umccullough, 8 years ago

Attachment: bfstest_serial.txt added

serial output during testing - hit error in #3159 at the end

comment:32 by axeld, 8 years ago

Resolution: fixed
Status: reopenedclosed

Let's try again, should be fixed in hrev43924 for real this time; apparently I was thorough and made the same bug twice.

comment:33 by umccullough, 8 years ago

So far, hrev43924 appears to fix my test cases :)

comment:34 by SeanCollins, 8 years ago

pulling the git repo isn't corrupting b+tree on my end, should be good to go. That was cuasing my failures.

Note: See TracTickets for help on using tickets.