#15123 closed bug (fixed)
KDL on unpacking LibreOffice tarball
Reported by: | diver | Owned by: | waddlesplash |
---|---|---|---|
Priority: | normal | Milestone: | R1/beta2 |
Component: | Drivers/Disk/NVMe | Version: | R1/Development |
Keywords: | Cc: | ||
Blocked By: | Blocking: | #15897 | |
Platform: | All |
Description
hrev53209 x86_64 in VMware Fusion.
Kernel panics when I try to build LO using haikuporter libreoffice
. Sometimes it goes through without crashing but unpacked source code is incomplete (missing many folders).
Attachments (8)
Change History (45)
by , 5 years ago
comment:1 by , 5 years ago
by , 5 years ago
comment:2 by , 5 years ago
After reboot checksum of a tarball is wrong, so it looks like it gets corrupted. Could it be related to hrev53159? My /data partition is NVMe virtual disk.
by , 5 years ago
comment:5 by , 5 years ago
When I switched NVME to IDE in VM settings and reformatted the volume to BFS the crashes went away.
comment:6 by , 5 years ago
Looks like some kind of disk corruption so that's not surprising. Next time this happens please see if it occurs on IDE first, instead of just reformatting without even retesting.
comment:7 by , 5 years ago
I also note that the NVMe driver is not designed for non-NVMe latencies. I.e., using the NVMe driver on a drive that is really backed by a spinning disk will lead to *worse* CPU usage and access times than using IDE.
comment:8 by , 5 years ago
Summary: | [kenel] panics on unpacking LibreOffice tarball → KDL on unpacking LibreOffice tarball |
---|
comment:9 by , 5 years ago
VMware NVMe disk image in on a (non-NVMe) Samsung SSD EVO 860 1TB here.
Well I switched back this image in VM settings from IDE to NVMe, reformatted it and tried to unpack libreoffice translation tarball and again it wasn't unpacked completely. Running checkfs crashed to KDL with the same back trace as in the description (which was continuable).
I'm guessing this is something with nvme
driver as I can't reproduce it with ata
one. Or perhaps BFS driver doesn't play well with it.
wget https://github.com/LibreOffice/translations/archive/libreoffice-6.2.5.1.tar.gz tar xzfv libreoffice-6.2.5.1.tar.gz
After unpacking check translations-libreoffice-6.2.5.1/source
there should be 125 folders. If there are less it is possible that bfs has just been corrupted.
comment:10 by , 5 years ago
Component: | System/Kernel → Drivers/Disk/NVMe |
---|---|
Owner: | changed from | to
comment:11 by , 5 years ago
Can you reproduce these KDLs at all? Getting a syslog from such a KDL would be nice...
comment:12 by , 5 years ago
hrev53404 x86_64.
- Freshly created 10GB NVMe disk in VMware mounted as /data.
- Started building libreoffice off it and Ctrl+C after a few minutes.
- Started checkfs /data and got a KDL.
by , 5 years ago
comment:14 by , 5 years ago
hrev53416 adds some better overflow checking; so please retest after that.
comment:15 by , 5 years ago
Diver reports this: https://imgur.com/a/yCYt2nL
So, it appears there is somehow object_cache corruption. Guess it's time I try to reproduce this under the guarded heap.
comment:17 by , 5 years ago
hrev53962. Tested with:
wget https://github.com/LibreOffice/translations/archive/libreoffice-6.2.5.1.tar.gz tar xzfv libreoffice-6.2.5.1.tar.gz
Result: 88 folders created out of 125. Extraction failed with:
tar: Skipping to next header ranslations-libreoffice-6.2.5.1/source/sr-Latn/forms/ tar: Skipping to next header ranslations-libreoffice-6.2.5.1/source/sr-Latn/formula/ tar: Skipping to next header gzip: stdin: invalid compressed data--crc error gzip: stdin: invalid compressed data--length error tar: Child returned status 1 tar: Error is not recoverable: exiting now
by , 5 years ago
comment:20 by , 5 years ago
Yes, I wonder if there's a race condition in the FS cache. Since NVMe is non sequential (that is, calling read() on an NVMe device from two successive threads may return the second thread before the first), and I don't think any other disk driver supports that (not even ramdisks, iirc) we may be hitting a very interesting bug indeed.
comment:21 by , 5 years ago
Well, I can't to reproduce this in QEMU by running 4 "git gc"s (with 4 CPUs) on a NVMe disk at once, repeatedly, for multiple minutes. All is still OK, and checkfs still succeeds. I'll try the libreoffice tarfile test later, I guess.
comment:22 by , 5 years ago
I managed to reproduce this, finally. hrev53997 seems to fix the KDLs, but the disk corruption remains. Actually, running sha256sum
on the tar.gz gives a different result every time...!
comment:23 by , 5 years ago
Full details of the different checksums in #12698. But I actually can't reproduce that anymore; it's possible my FS was still corrupted and this contributed to the strange cache behavior somehow.
After a full checkfs, deleting all files, and redownloading, I can extract the archive successfully; all directories, no KDLs, and all checksums still look good after a reboot, and extracting it again and running a recursive diff
also produces no differences. Huzzah!
Running a full checkfs now produces no errors. However, a lot of this appears in the syslog:
PageWriteWrapper: Failed to write page 0xffffffff82d558c0: General system error
So, that seems concerning, but I can't find any reason for that. At least after a reboot, everything (including checkfs) all comes out OK. So maybe that was always there and it's "harmless"...?
Please do retest and let me know if this is solved for you, though; at least as far as I can see, it's solved here.
comment:24 by , 5 years ago
hrev54010 x86_64.
tar: Skipping to next header gzip: stdin: invalid compressed data--format violated tar: Child returned status 1 tar: Error is not recoverable: exiting now
sha256sum on the tar.gz gives a different result every other time. After several successful checkfs runs it KDL'ed again:
by , 5 years ago
comment:25 by , 5 years ago
I hope you don't perform this tests on important data because checkfs
can wipe a lot of files even if these files are valid.
comment:26 by , 5 years ago
Yes, I'm aware that checkfs can sometimes corrupt bfs, I think it happened to PulkoMandy at least once and somebody else too, so I'm not using it for anything important.
comment:27 by , 5 years ago
For me it did not "corrupt" BFS, it ran out of memory (I think) and deleted almost all files. But the filesystem itself was fine, just mostly empty.
comment:28 by , 5 years ago
hrev54010 x86_64. Removed work-* and download dirs for LibreOffice on an already existing /data (.vmdk) partition. It was used to successfully build LO with ata driver. Tried again with nvme driver and patch didn't apply, also sha256 was different and the same tar error as above was observed. Something isn't right still.
comment:29 by , 5 years ago
So, the bug is actually a virtual/physical buffer sizing mixup which went undetected due to the get_memory_map usage being incorrect. hrev54077 makes this a panic(), so at least it won't occur silently anymore.
I have a bunch of work to do to refactor the driver to not go through that function in the first place and actually resolve this ticket, still.
comment:30 by , 5 years ago
Blocking: | 15897 added |
---|
by , 5 years ago
Attachment: | photo_2020-04-21_23-24-42.jpg added |
---|
No boot 54077 64 bit NVMe SSD device Mass storage controller (Non-Volatile memory controller, NVM Ex [1|8|2] vendor 15b7: Sandisk Corp device 5002: WD Black 2018/PC SN720 NVMe SSD rev 54010 work fine !
comment:32 by , 5 years ago
To waddlesplash:
Is it possible to make quick fix for this problem like preallocate memory and/or do not allow more than 1 IO operation at same time? Some solution should be done before Beta 2, in worst case NVMe driver should be disabled.
comment:33 by , 5 years ago
Yes I know. It was just extremely late last night when I was making those commits, and when I realized how broken the driver really was, I was considering removing it from the image entirely but went with just committing the assert fix. It's still the better part of 2 weeks until the branch date, I'll have time to fix this before then for sure.
Actually reading the docs further, 0 is an invalid argument for the size anyway; I need to pass in the real sizes here for the assert.
comment:34 by , 5 years ago
I have mostly completed the refactor in my local branch of the NVMe driver to use SGLs (Scatter-Gather Lists) instead of PRPs. Mounting disks works again, but larger I/O randomly fails; I may need to backport some changes from SPDK that add I/O segmentation.
Unfortunately, at least QEMU and VirtualBox have no support for these (there is a set of pending patches for QEMU that add them, which is how I have been testing.) I did not look at VMware yet, but depending on how good support is there and on bare metal, I may wind up having to implement a "slow path" instead of forgoing support for these altogether.
hrev54080 adds some preliminary work here, but the bulk of it is still offline.
comment:35 by , 5 years ago
Please stop attaching screenshots, I know it is still broken and will update this ticket when it is not.
comment:36 by , 5 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
KDL on boot *and* remaining disk corruption from unpacking the tarballs solved in hrev54102. In addition, the new driver is significantly faster and will use a lot less CPU than before.
comment:37 by , 5 years ago
Milestone: | Unscheduled → R1/beta2 |
---|
Assign tickets with status=closed and resolution=fixed within the R1/beta2 development window to the R1/beta2 Milestone
Another crash: