Opened 12 months ago

Closed 2 months ago

Last modified 2 months ago

#15123 closed bug (fixed)

KDL on unpacking LibreOffice tarball

Reported by: diver Owned by: waddlesplash
Priority: normal Milestone: R1/beta2
Component: Drivers/Disk/NVMe Version: R1/Development
Keywords: Cc:
Blocked By: Blocking: #15897
Platform: All

Description

hrev53209 x86_64 in VMware Fusion.

Kernel panics when I try to build LO using haikuporter libreoffice. Sometimes it goes through without crashing but unpacked source code is incomplete (missing many folders).

Attachments (8)

kdl.png (110.6 KB ) - added by diver 12 months ago.
kdl2.png (151.4 KB ) - added by diver 12 months ago.
kdl3.png (138.7 KB ) - added by diver 12 months ago.
kdl4.png (286.6 KB ) - added by diver 10 months ago.
kdl5.png (206.4 KB ) - added by diver 4 months ago.
kdl6.png (268.2 KB ) - added by diver 3 months ago.
photo_2020-04-21_23-24-42.jpg (111.4 KB ) - added by kim1963 2 months ago.
No boot 54077 64 bit NVMe SSD device Mass storage controller (Non-Volatile memory controller, NVM Ex [1|8|2] vendor 15b7: Sandisk Corp device 5002: WD Black 2018/PC SN720 NVMe SSD rev 54010 work fine !
photo_2020-04-24_20-02-21.jpg (134.3 KB ) - added by kim1963 2 months ago.
No boot 54081

Download all attachments as: .zip

Change History (45)

by diver, 12 months ago

Attachment: kdl.png added

comment:1 by diver, 12 months ago

Another crash:

by diver, 12 months ago

Attachment: kdl2.png added

comment:2 by diver, 12 months ago

After reboot checksum of a tarball is wrong, so it looks like it gets corrupted. Could it be related to hrev53159? My /data partition is NVMe virtual disk.

comment:3 by diver, 12 months ago

Yet another crash:

by diver, 12 months ago

Attachment: kdl3.png added

comment:4 by diver, 12 months ago

checkfs /data after freezes Haiku while checking.

comment:5 by diver, 12 months ago

When I switched NVME to IDE in VM settings and reformatted the volume to BFS the crashes went away.

comment:6 by waddlesplash, 12 months ago

Looks like some kind of disk corruption so that's not surprising. Next time this happens please see if it occurs on IDE first, instead of just reformatting without even retesting.

comment:7 by waddlesplash, 12 months ago

I also note that the NVMe driver is not designed for non-NVMe latencies. I.e., using the NVMe driver on a drive that is really backed by a spinning disk will lead to *worse* CPU usage and access times than using IDE.

comment:8 by pulkomandy, 12 months ago

Summary: [kenel] panics on unpacking LibreOffice tarballKDL on unpacking LibreOffice tarball

comment:9 by diver, 12 months ago

VMware NVMe disk image in on a (non-NVMe) Samsung SSD EVO 860 1TB here.

Well I switched back this image in VM settings from IDE to NVMe, reformatted it and tried to unpack libreoffice translation tarball and again it wasn't unpacked completely. Running checkfs crashed to KDL with the same back trace as in the description (which was continuable).

I'm guessing this is something with nvme driver as I can't reproduce it with ata one. Or perhaps BFS driver doesn't play well with it.

wget https://github.com/LibreOffice/translations/archive/libreoffice-6.2.5.1.tar.gz
tar xzfv libreoffice-6.2.5.1.tar.gz

After unpacking check translations-libreoffice-6.2.5.1/source there should be 125 folders. If there are less it is possible that bfs has just been corrupted.

comment:10 by diver, 10 months ago

Component: System/KernelDrivers/Disk/NVMe
Owner: changed from nobody to waddlesplash

comment:11 by waddlesplash, 10 months ago

Can you reproduce these KDLs at all? Getting a syslog from such a KDL would be nice...

comment:12 by diver, 10 months ago

hrev53404 x86_64.

  • Freshly created 10GB NVMe disk in VMware mounted as /data.
  • Started building libreoffice off it and Ctrl+C after a few minutes.
  • Started checkfs /data and got a KDL.
Last edited 10 months ago by diver (previous) (diff)

comment:13 by diver, 10 months ago

Nothing in KDL syslog about nvme at all.

by diver, 10 months ago

Attachment: kdl4.png added

comment:14 by waddlesplash, 10 months ago

hrev53416 adds some better overflow checking; so please retest after that.

comment:15 by waddlesplash, 10 months ago

Diver reports this: https://imgur.com/a/yCYt2nL

So, it appears there is somehow object_cache corruption. Guess it's time I try to reproduce this under the guarded heap.

comment:16 by waddlesplash, 4 months ago

Please retest after hrev53947.

comment:17 by diver, 4 months ago

hrev53962. Tested with:

wget https://github.com/LibreOffice/translations/archive/libreoffice-6.2.5.1.tar.gz
tar xzfv libreoffice-6.2.5.1.tar.gz

Result: 88 folders created out of 125. Extraction failed with:

tar: Skipping to next header
ranslations-libreoffice-6.2.5.1/source/sr-Latn/forms/
tar: Skipping to next header
ranslations-libreoffice-6.2.5.1/source/sr-Latn/formula/
tar: Skipping to next header

gzip: stdin: invalid compressed data--crc error

gzip: stdin: invalid compressed data--length error
tar: Child returned status 1
tar: Error is not recoverable: exiting now

comment:18 by diver, 4 months ago

Repeated a few more times and got this KDL:

by diver, 4 months ago

Attachment: kdl5.png added

comment:19 by X512, 4 months ago

Issue looks similar to #12698.

comment:20 by waddlesplash, 4 months ago

Yes, I wonder if there's a race condition in the FS cache. Since NVMe is non sequential (that is, calling read() on an NVMe device from two successive threads may return the second thread before the first), and I don't think any other disk driver supports that (not even ramdisks, iirc) we may be hitting a very interesting bug indeed.

comment:21 by waddlesplash, 4 months ago

Well, I can't to reproduce this in QEMU by running 4 "git gc"s (with 4 CPUs) on a NVMe disk at once, repeatedly, for multiple minutes. All is still OK, and checkfs still succeeds. I'll try the libreoffice tarfile test later, I guess.

comment:22 by waddlesplash, 3 months ago

I managed to reproduce this, finally. hrev53997 seems to fix the KDLs, but the disk corruption remains. Actually, running sha256sum on the tar.gz gives a different result every time...!

comment:23 by waddlesplash, 3 months ago

Full details of the different checksums in #12698. But I actually can't reproduce that anymore; it's possible my FS was still corrupted and this contributed to the strange cache behavior somehow.

After a full checkfs, deleting all files, and redownloading, I can extract the archive successfully; all directories, no KDLs, and all checksums still look good after a reboot, and extracting it again and running a recursive diff also produces no differences. Huzzah!

Running a full checkfs now produces no errors. However, a lot of this appears in the syslog:

PageWriteWrapper: Failed to write page 0xffffffff82d558c0: General system error

So, that seems concerning, but I can't find any reason for that. At least after a reboot, everything (including checkfs) all comes out OK. So maybe that was always there and it's "harmless"...?

Please do retest and let me know if this is solved for you, though; at least as far as I can see, it's solved here.

comment:24 by diver, 3 months ago

hrev54010 x86_64.

tar: Skipping to next header

gzip: stdin: invalid compressed data--format violated
tar: Child returned status 1
tar: Error is not recoverable: exiting now

sha256sum on the tar.gz gives a different result every other time. After several successful checkfs runs it KDL'ed again:

Last edited 3 months ago by diver (previous) (diff)

by diver, 3 months ago

Attachment: kdl6.png added

comment:25 by X512, 3 months ago

I hope you don't perform this tests on important data because checkfs can wipe a lot of files even if these files are valid.

comment:26 by diver, 3 months ago

Yes, I'm aware that checkfs can sometimes corrupt bfs, I think it happened to PulkoMandy at least once and somebody else too, so I'm not using it for anything important.

comment:27 by pulkomandy, 3 months ago

For me it did not "corrupt" BFS, it ran out of memory (I think) and deleted almost all files. But the filesystem itself was fine, just mostly empty.

comment:28 by diver, 3 months ago

hrev54010 x86_64. Removed work-* and download dirs for LibreOffice on an already existing /data (.vmdk) partition. It was used to successfully build LO with ata driver. Tried again with nvme driver and patch didn't apply, also sha256 was different and the same tar error as above was observed. Something isn't right still.

comment:29 by waddlesplash, 2 months ago

So, the bug is actually a virtual/physical buffer sizing mixup which went undetected due to the get_memory_map usage being incorrect. hrev54077 makes this a panic(), so at least it won't occur silently anymore.

I have a bunch of work to do to refactor the driver to not go through that function in the first place and actually resolve this ticket, still.

comment:30 by waddlesplash, 2 months ago

Blocking: 15897 added

by kim1963, 2 months ago

No boot 54077 64 bit NVMe SSD device Mass storage controller (Non-Volatile memory controller, NVM Ex [1|8|2] vendor 15b7: Sandisk Corp device 5002: WD Black 2018/PC SN720 NVMe SSD rev 54010 work fine !

comment:31 by X512, 2 months ago

Please stop spamming, developers already aware of this problem.

comment:32 by X512, 2 months ago

To waddlesplash:

Is it possible to make quick fix for this problem like preallocate memory and/or do not allow more than 1 IO operation at same time? Some solution should be done before Beta 2, in worst case NVMe driver should be disabled.

comment:33 by waddlesplash, 2 months ago

Yes I know. It was just extremely late last night when I was making those commits, and when I realized how broken the driver really was, I was considering removing it from the image entirely but went with just committing the assert fix. It's still the better part of 2 weeks until the branch date, I'll have time to fix this before then for sure.

Actually reading the docs further, 0 is an invalid argument for the size anyway; I need to pass in the real sizes here for the assert.

comment:34 by waddlesplash, 2 months ago

I have mostly completed the refactor in my local branch of the NVMe driver to use SGLs (Scatter-Gather Lists) instead of PRPs. Mounting disks works again, but larger I/O randomly fails; I may need to backport some changes from SPDK that add I/O segmentation.

Unfortunately, at least QEMU and VirtualBox have no support for these (there is a set of pending patches for QEMU that add them, which is how I have been testing.) I did not look at VMware yet, but depending on how good support is there and on bare metal, I may wind up having to implement a "slow path" instead of forgoing support for these altogether.

hrev54080 adds some preliminary work here, but the bulk of it is still offline.

by kim1963, 2 months ago

No boot 54081

comment:35 by waddlesplash, 2 months ago

Please stop attaching screenshots, I know it is still broken and will update this ticket when it is not.

comment:36 by waddlesplash, 2 months ago

Resolution: fixed
Status: newclosed

KDL on boot *and* remaining disk corruption from unpacking the tarballs solved in hrev54102. In addition, the new driver is significantly faster and will use a lot less CPU than before.

comment:37 by nielx, 2 months ago

Milestone: UnscheduledR1/beta2

Assign tickets with status=closed and resolution=fixed within the R1/beta2 development window to the R1/beta2 Milestone

Note: See TracTickets for help on using tickets.