#8408 closed bug (fixed)
Invalid b+tree after installing
Reported by: | humdinger | Owned by: | axeld |
---|---|---|---|
Priority: | blocker | Milestone: | R1/alpha4 |
Component: | File Systems/BFS | Version: | R1/Development |
Keywords: | invalid b+tree | Cc: | art@… |
Blocked By: | Blocking: | ||
Platform: | All |
Description
This is hrev43873 gcc2hybrid.
Checkfs is revealing "invalid b+tree" failures. Even a completely new installation has them. I put a new nightly on the stick and used Installer to put it all on freshly initialized partition. Checkfs says:
size (inode = 4296), invalid b+tree
In the syslog appears:
KERN: inode 4296: duplicate at 318464 has invalid array size 0! KERN: inode 4296: visited 415 from 416 nodes.
I then initialized the partition again and copied one folder at a time from the USB stick onto it. Now checkfs showed no invalid b+trees. I redid that copying all folders at once via Tracker and still no invalids. Redid the Installer way, and got the same invalid b+tree.
My /boot drive BTW now has:
last_modified (inode = 4261), invalid b+tree name (inode = 4195), invalid b+tree
I attach the rather longer syslog for that checkfs run with "checkfs-boot-syslog.txt".
My "Generator" partition, where I put the "generated" folders for building Haiku, does one better and says:
last_modified (inode = 4424), invalid b+tree name (inode = 4358), invalid b+tree size (inode = 4457), invalid b+tree
I attach the even longer syslog for that checkfs run with "checkfs-Generator-syslog.txt".
The Generator might be corrupted from KDLs I currently experience every time I try to build Haiku.
BTW, I have other partitions that don't show invalid b+trees (yet?).
I'm just putting this ticket out there in case sombody sees a pattern here...
Attachments (10)
Change History (44)
by , 13 years ago
Attachment: | checkfs-boot-syslog.txt added |
---|
comment:1 by , 13 years ago
An observation that may be of interest: the B+ tree corruption reported here all appears to involve the index b-trees. On my development partition, which I intentionally initialized without indices for performance reasons, I've never seen this kind of issue, and that partition is months/years old (and has seen many, many full builds and other gruelling disk activity).
comment:2 by , 13 years ago
As I said in another ticket, I think it's unlikely that BFS itself is to blame here, I would rather start searching in the block cache. I have no idea why the Installer creates errors while a manual copy does not, but it could just be a coincidence on the way how the transactions are combined.
With respect to Renes reply: I only recently added the B+tree check in checkfs, and it's unlikely that any other directory is put under as much stress than the name, last_modified, and size indices. On my laptop, they have over half a million entries each, and are constantly changing. They are likely to be part of any transaction in the system.
comment:3 by , 13 years ago
This is true, but at the same time, it seems there's a sudden increase in incidence of these problems lately. Is it not possible that the recent changes with respect to the duplicate arrays might have a subtle issue in them? Those AFAICT only come into play for index trees, which are the only place I've seen this problem so far. From a cursory look through that changeset I don't see an obvious issue though.
comment:4 by , 13 years ago
Sure it's possible that I introduced a new bug with that code. However, the invalid B+trees just look like they always did; if something would be wrong with the array, it should possibly have different effects.
comment:5 by , 13 years ago
I started searching for a simple case which causes a b+tree error from checkfs, and I've located one that I think is reasonably simple. It's not the same kind of error as mentioned in this ticket, but perhaps it can help track down the cause of such issues?
Environment: fresh gcc2 build on Linux64 host using hrev43884. Creating a vmware image and booting it in VirtualBox 4.0.6
Using the following UserBuildConfig:
DefineBuildProfile bfstest-vmware : vmware-image : "haiku-bfstest.vmdk" ; HAIKU_IMAGE_SIZE = 500 ; AddOptionalHaikuImagePackages Development ; AddOptionalHaikuImagePackages Subversion ;
Subversion is required for this particular test, and without the Development package also installed, it doesn't seem to occur (probably has something to do with number of files, or disk space used on the disk I guess)
Once booted up, wait for the first install tasks to complete and then launch a terminal and type the following:
checkfs -c /boot
to verify a clean install.
after this, I executed the following:
svn co http://ports.haiku-files.org/svn/haikuporter/trunk haikuporter
followed by another:
checkfs -c /boot
which yields the output in the attached screenshots (uploaded separately to the ticket).
I can also supply the before-and-after vmdk if that helps - but I would hope others can repeat this scenario.
comment:6 by , 13 years ago
This is so easy to duplicate, just load up a bunch of programs, wait for the cache to get full and start a heavy I/O operation. Bam, you'll get b+tree error in a hurry, in fact just copy a large directory with a large number of files and folder from one disk to another, and you can generate btree errors.
Really easy to duplicate. I have a verified clean install, it lasted 8 minutes before generating b+tree error. Just download nightly 43886 gcc2 hybrid,
follow-ups: 8 9 comment:7 by , 13 years ago
If it's so easy to duplicate, chances are good that I messed something up with the test itself :-)
umccullough: could you run bfsinfo -ib <path-to-device-with-the-error> 2068
from the BFS tools, and attach its output? That could help finding the issue.
follow-up: 11 comment:8 by , 13 years ago
Replying to axeld:
If it's so easy to duplicate, chances are good that I messed something up with the test itself :-)
umccullough: could you run
bfsinfo -ib <path-to-device-with-the-error> 2068
from the BFS tools, and attach its output? That could help finding the issue.
Would you like me to run the same test, since I have the same problem ?
comment:9 by , 13 years ago
Replying to axeld:
umccullough: could you run
bfsinfo -ib <path-to-device-with-the-error> 2068
from the BFS tools, and attach its output? That could help finding the issue.
I downloaded http://www.pinc-software.de/files/BFS-tools-0.7.6.zip and unzipped it onto the image.
I then ran bfsinfo -ib /boot 2084 (because I had to re-run the test and this time it was inode 2084) but got the following (transcribed from the screen output):
Copyright (c) 2001-2004 pinc Software. Raw mode not selected, but raw device specified. Could not open device or file: Initialization failed
I got the same thing using the path from df which was /dev/disk/ata/0/master/raw - keeping in mind that our build system still doesn't create partition tables on raw/vmware images.
follow-ups: 12 13 comment:10 by , 13 years ago
Hm, maybe it's an old version of it... I will plan to add the tools to Haiku anyway, I just didn't find the time to do so yet. You didn't need to redo the test, or are you doing so with a clean image every time?
Thanks anyway!
I will try to add a more recent version of bfsinfo ASAP.
comment:11 by , 13 years ago
Replying to SeanCollins:
Would you like me to run the same test, since I have the same problem ?
Dunno, I haven't seen a syslog from you, so I cannot tell -- Urias problem is nice and short, so it's a good candidate to look into the issue.
follow-up: 15 comment:12 by , 13 years ago
Replying to axeld:
You didn't need to redo the test, or are you doing so with a clean image every time?
Yep, it is so repeatable for me (pretty much 100% with the scenario I have outlined), I wasn't that worried about losing it, so I just create a new test image for each subsequent test.
FWIW, I am also able to create a last_modified b+tree issue pretty consistently with this:
~> cp -a /boot/common .
And even once when I ran that, I got an ASSERTION FAILED panic in vfs.cpp which appears to be #3159
I'll keep trying to reproduce that particular failure as well, since it seems pretty important.
comment:13 by , 13 years ago
Replying to axeld:
I will try to add a more recent version of bfsinfo ASAP.
Added bfsinfo -ib /boot 2084 output text file using a newer gcc2 build.
comment:14 by , 13 years ago
From my syslog. Seemed frivilous to attach the entire log. Is this the same issue ? and would my testing assist in debuging ?
KERN: inode 4815: visited 425 from 426 nodes. KERN: inode 4815: visited 433 from 434 nodes. KERN: slab memory manager: created area 0xd8801000 (161335) KERN: bfs: mounted "HaikuStorage" (root node at 524288, device = /dev/disk/ata/2/slave/raw) KERN: slab memory manager: created area 0xd9001000 (162002) KERN: slab memory manager: created area 0xd9801000 (162679) KERN: slab memory manager: created area 0xda001000 (164368) KERN: slab memory manager: created area 0xda801000 (164975) KERN: inode 4815: visited 567 from 568 nodes. KERN: bfs: bfs_access:1486: Operation not allowed KERN: runtime_loader: Cannot open file crypto: No such file or directory KERN: runtime_loader: Cannot open file crypto.so.1.0.0: No such file or directory KERN: runtime_loader: Cannot open file libcrypto: No such file or directory KERN: runtime_loader: Cannot open file ssl: No such file or directory KERN: runtime_loader: Cannot open file ssl.so.1.0.0: No such file or directory KERN: runtime_loader: Cannot open file libssl: No such file or directory KERN: bfs: bfs_access:1486: Operation not allowed KERN: runtime_loader: Cannot open file resolv: No such file or directory KERN: runtime_loader: Cannot open file resolv.so: No such file or directory KERN: runtime_loader: Cannot open file libresolv: No such file or directory KERN: runtime_loader: Cannot open file libresolv.so: No such file or directory KERN: bfs: bfs_read_link:1501: Invalid Argument KERN: Last message repeated 4 times. KERN: inode 4815: visited 575 from 576 nodes. KERN: slab memory manager: created area 0xdb001000 (167907) KERN: inode 4815: duplicate at 590848 has invalid array size 0! KERN: inode 4815: visited 581 from 582 nodes. KERN: inode 4815: duplicate at 590848 has invalid array size 0! KERN: inode 4815: visited 581 from 582 nodes. KERN: inode 4815: duplicate at 590848 has invalid array size 0! KERN: inode 4815: visited 584 from 585 nodes. KERN: bfs: bfs_read_link:1501: Invalid Argument}}}
follow-up: 16 comment:15 by , 13 years ago
Replying to umccullough:
Replying to axeld:
You didn't need to redo the test, or are you doing so with a clean image every time?
Yep, it is so repeatable for me (pretty much 100% with the scenario I have outlined), I wasn't that worried about losing it, so I just create a new test image for each subsequent test.
FWIW, I am also able to create a last_modified b+tree issue pretty consistently with this:
~> cp -a /boot/common .And even once when I ran that, I got an ASSERTION FAILED panic in vfs.cpp which appears to be #3159
I'll keep trying to reproduce that particular failure as well, since it seems pretty important.
Try unzipping something with allot of folders to the disk with the bfs error. Thats seems to do the trick for reproducing that vfs issue. Works for me at least. Expander and unzip do a handy job of triggering that issue. See it all the time when the build completes and starts making the image.
by , 13 years ago
Attachment: | IMG_1511.JPG added |
---|
comment:16 by , 13 years ago
Replying to SeanCollins:
Try unzipping something with allot of folders to the disk with the bfs error. Thats seems to do the trick for reproducing that vfs issue. Works for me at least. Expander and unzip do a handy job of triggering that issue. See it all the time when the build completes and starts making the image.
There is a separate ticket for that issue that you should post additional details in: #3159
Also, simple test cases that reliably cause the problem are what is most valuable for these types of issues.
follow-up: 18 comment:17 by , 13 years ago
Testing further, I've noticed that the corruption only seems to occur repeatedly when i've run checkfs *FIRST*. I'm guessing this immediately caches all the structures, which somehow increases the likelihood of failure.
I can repeat the process now without the Development or Subversion packages (although, I installed OpenSSH to help me copy the files to another machine).
~> checkfs -c /boot ~> cp -a /boot/system . && rm -rf system ~> checkfs -c /boot
Yields the new screenshot attached.
comment:18 by , 13 years ago
Replying to umccullough:
Testing further, I've noticed that the corruption only seems to occur repeatedly when i've run checkfs *FIRST*. I'm guessing this immediately caches all the structures, which somehow increases the likelihood of failure.
I can repeat the process now without the Development or Subversion packages (although, I installed OpenSSH to help me copy the files to another machine).
~> checkfs -c /boot ~> cp -a /boot/system . && rm -rf system ~> checkfs -c /bootYields the new screenshot attached.
Check your syslog after a cleanboot,typically won't see a b+tree error. Copy a directory with a large number of folders, like source. Or git clone the haiku repo, that seems to be enough to trigger it. It happens to me regardless of the use of checkfs or not.
follow-up: 20 comment:19 by , 13 years ago
I've got this nailed down to sometime between 43769 and 43810. I have a 2 week old 43769, ran all the tests for invalid b tree, I come up with nothing. Drivers seem fine. I start getting file corruption and data loss around 43810. Rev 43810 , I experienced data loss on after a KDL. I copied a bunch of files got thrown into kdl during a unzip and literally nothing I had copied from my data drive to the main install drive, including a few apps actually stayed on the drive when the kdl happened, and it was a good 45 minutes from the time I unzipped the folders to the kdl occuring.
Best info I can give you as I am reverting all my systems back to rev 43769
follow-up: 21 comment:20 by , 13 years ago
Replying to SeanCollins:
I've got this nailed down to sometime between 43769 and 43810. I have a 2 week old 43769, ran all the tests for invalid b tree, I come up with nothing. Drives seem fine no detecable corruption. I start getting file corruption and data loss around 43810. Rev 43810 , I experienced data loss on after a KDL. I copied a bunch of files got thrown into kdl during a unzip and literally nothing I had copied from my data drive to the main install drive, including a few apps actually stayed on the drive when the kdl happened, and it was a good 45 minutes from the time I unzipped the folders to the kdl occuring.
That's more or less completely meaningless since that's precisely the range where B+ tree validation was actually added. hrev43769's checkfs wouldn't report it because it doesn't check for it.
follow-up: 22 comment:21 by , 13 years ago
Replying to anevilyak:
That's more or less completely meaningless since that's precisely the range where B+ tree validation was actually added. hrev43769's checkfs wouldn't report it because it doesn't check for it.
I have a 43769, that after 2 weeks of contious use does not have a b+tree error. Thats why I tested this 43769 this morning. I didn't realize the machine I am on right now had the rev, but it does, it also does not show signs of b+tree corruption.
Conversly rev 43810 does corrupt within minutes of use. In particular if you unzip lots of directorys or mkdir gets called when downloading the repo from git.
There is also a data loss issue that crop up in the same time frame.I can only report the info I have on hand.
comment:22 by , 13 years ago
Replying to SeanCollins:
I have a 43769, that after 2 weeks of contious use does not have a b+tree error. Thats why I tested this 43769 this morning. I didn't realize the machine I am on right now had the rev, but it does, it also does not show signs of b+tree corruption.
You missed the point. 43769 will not tell you if you have tree corruption because it can't. Its checkfs doesn't know how to check for it, ergo it will never report it, whether it's there or not. The whole reason people have suddenly been noticing it is precisely because checkfs can now detect it, and that ability was introduced in the range that you're claiming to be the "problem".
comment:23 by , 13 years ago
Checkfs from rev 43894 " last nights nightly" on hrev 43769 produce this result.
~> checkfs Usage: checkfs <options> <device> <volume name> Options: -h, --help - print this help text -c, --check-only - do not make any changes to the file system ~> checkfs /boot ru (inode = 3840218), has blocks already set 210995 nodes checked, 0 blocks not allocated, 1 blocks already set, 4 blocks could be freed files 173210 directories 36870 attributes 577 attr. dirs 311 indices 27 direct block runs 211293 (11.04 GiB) indirect block runs 2964 (in 138 array blocks, 6.37 GiB) double indirect block runs 0 (in 0 array blocks, 0 bytes) ~>
comment:24 by , 13 years ago
Cc: | added |
---|
comment:25 by , 13 years ago
@SeanCollins
I'm not sure what you're implying. The BPlusTree::Validate() logic wasn't added until hrev43829 - so no amount of testing on revisions prior to that will ever yield the issue we're discussing in this ticket. You can't just copy a new checkfs to an old rev because the validation logic was added in the filesystem code itself, not the checkfs tool.
Furthermore, as Axel implied, it's entirely possible that his "Validate()" check code added in that rev is problematic, and that it's detecting a false positive.
We should probably wait for Axel to evaluate the situation (assuming he can repro it on his end). The problem seems to be easy to reproduce here, and that should be enough to start on for a developer who knows the internals well enough.
comment:26 by , 13 years ago
Milestone: | R1 → R1/alpha4 |
---|---|
Priority: | normal → blocker |
as umccullough found a way to reproduce this easily, I am going to put this as an Alpha 4 blocker.
This will raise visibility to bug fix weekend as well :)
comment:27 by , 13 years ago
copying /boot/system to ~ and erasing it several times running checkfs inbetween each copy / erase resulted in the attached b+tree issues.
comment:28 by , 13 years ago
Thanks Urias! Unfortunately, the /system copy results in much more errors than the neat thing from before, but I was able to reproduce the issue, and it'll do :-).
From the looks of it, I'd say I messed up the duplicate array somehow. I'll have a look over the next few days.
comment:29 by , 13 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Should be fixed with hrev43915. Please note that checkfs can now repair broken index B+trees as well, so you can at least fix those issues now :-)
comment:30 by , 13 years ago
Hmm, I'm still seeing this problem on hrev43918 using the "svn co" method outlined above.
I'll have to delete and re-create all my generated/objects to be sure it's not some weird leftover.
comment:31 by , 13 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
by , 13 years ago
Attachment: | bfstest_serial.txt added |
---|
serial output during testing - hit error in #3159 at the end
comment:32 by , 13 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
Let's try again, should be fixed in hrev43924 for real this time; apparently I was thorough and made the same bug twice.
comment:34 by , 13 years ago
pulling the git repo isn't corrupting b+tree on my end, should be good to go. That was cuasing my failures.
syslog: checkfs /boot