Opened 16 years ago
Closed 16 years ago
#2902 closed bug (fixed)
USB File corruption with SMP enabled.
Reported by: | netster403 | Owned by: | axeld |
---|---|---|---|
Priority: | critical | Milestone: | R1/alpha1 |
Component: | System/Kernel | Version: | R1/pre-alpha1 |
Keywords: | Cc: | olivier.coursiere@… | |
Blocked By: | Blocking: | ||
Platform: | x86 |
Description
Dell Core 2 duo system page faults while booting.
Booting from USB flash drive dd'ed with haiku image. Two drives tried to eliminate flash device failure as an option.
Serial log attached.
vm_page_fault: vm_soft_fault returned error 'Permission denied' on fault at 0x80788154, ip 0x80788154, write 1, user 0, thread 0xd PANIC: vm_page_fault: unhandled page fault in kernel space at 0x80788154, ip 0x80788154
system locks up at kdebug prompt so BT is not possible.
Attachments (8)
Change History (32)
by , 16 years ago
Attachment: | 10-22-2008-105659.log added |
---|
comment:1 by , 16 years ago
I updated a bit and installed hrev28283 onto a USB HDD and tried again.
Similar error, lots of "Not an executable" errors. I disabled SMP during boot and the system came up. This looks like an SMP issue on Core 2 Duo systems.
Serial logs:
10-23-2008-105028.log -- SMP enabled, lots of Not an "executable" errors, KDL reached but no BT could be run due to complete lockup.
10-23-2008-105223.log -- SMP disabled on boot, system boots and appears normal.
follow-up: 3 comment:2 by , 16 years ago
I saw you had a problem attaching the file. Could you please give me any details on error messages or the like? Can you try to attach the file again?
by , 16 years ago
Attachment: | 10-23-2008-105028.log added |
---|
by , 16 years ago
Attachment: | 10-23-2008-105223.log added |
---|
comment:3 by , 16 years ago
Replying to nielx:
I saw you had a problem attaching the file. Could you please give me any details on error messages or the like? Can you try to attach the file again?
It's working now. Thanks!
comment:4 by , 16 years ago
Component: | - General → System/Kernel |
---|---|
Platform: | All → x86 |
Summary: | Haiku Kernel page fault on Core 2 duo. → Haiku SMP page fault on Core 2 duo. |
comment:5 by , 16 years ago
Summary: | Haiku SMP page fault on Core 2 duo. → Haiku SMP page fault on Core 2 duo booting from USB |
---|
comment:6 by , 16 years ago
I verified that booting the same revision of the OS from a sata hard disk instead of USB HDD works with SMP.
The SMP issue only seems to effect USB boot drives. (10-23-2008-105028.log)
comment:7 by , 16 years ago
Summary: | Haiku SMP page fault on Core 2 duo booting from USB → USB File corruption with SMP enabled. |
---|
Hmm, it seems usb_storage transfers with SMP enabled corrupt files... thats odd and explains the painful boot with SMP enabled.
I have a zip file on a flash disk, freshly formatted fat 16 MD5 sum is correct for file on flash drive File copied over to sata haiku drive the MD5 changes. (smp enabled) File copied over to sata haiku drive the MD5 is correct. (smp disabled)
10-24-2008-112945.log attached.
by , 16 years ago
Attachment: | 10-24-2008-112945.log added |
---|
comment:8 by , 16 years ago
Component: | System/Kernel → Drivers/USB |
---|---|
Owner: | changed from | to
Priority: | normal → critical |
comment:9 by , 16 years ago
I have this system available for testing if needed (it's my work's property so testing abilities may become limited in the future)
Here are my testing steps above retyped to clarify (trac seems to cut some line breaks)
--- Booting test, where I first noticed the issue ---
- Booting from USB stick (SMP enabled) fails with errors in 10-23-2008-105028.log
- Booting from USB stick (SMP disabled) works as expected. 10-23-2008-105223.log
--- Further tests find exact problem location ---
- I have a zip file on a flash disk, freshly formatted fat 16.
- MD5 sum is correct for file on flash drive.
- File from USB to sata haiku drive the MD5 changes. (smp enabled). 10-24-2008-112945.log
- File from USB to sata haiku drive the MD5 is correct. (smp disabled).
Thanks!
comment:10 by , 16 years ago
Cc: | added |
---|
comment:12 by , 16 years ago
Revision: 28521
Same results while trying to boot from USB with SMP enabled. Logs look the same so I didn't do the MD5 test. Booting with SMP disabled from USB works fine.
see 11-06-2008-094320.log (SMP enabled) (failed boot from USB)
see 11-06-2008-094948.log (SMP disabled) (successful boot from USB)
by , 16 years ago
Attachment: | 11-06-2008-094320.log added |
---|
by , 16 years ago
Attachment: | 11-06-2008-094948.log added |
---|
comment:13 by , 16 years ago
Ok, I've tried copying the file over from my USB flash drive to my Haiku disk drive in 28521 and the same thing happens while SMP is enabled on boot.
Verified this behavior with FAT and BeFS on the flash drive, so it does not seem to be a file system issue.
There is some positive results from this latest experiment though! Disabling the one of the two cpu cores via the task manager menu, then plugging in the usb flash drive causes all USB transfer's to occur successfully. This definitely has to do with the USB storage portion not working well with two cpus.
I am attaching my test results.
Midnight_City_by_IsometricPixel.png == PNG transferred from USB flash drive (FAT) to Haiku drive (SMP disabled) (good md5) Midnight_City_by_IsometricPixel.broke.png == PNG transferred from USB flash drive (fat 16) to Haiku drive. (SMP enabled) (bad md5)
alex@houvonglucka02:~/usbtest$ md5sum Midnight_City_by_IsometricPixel.broke.png b0474700c6f7687e1cdaff237dbe85a4 Midnight_City_by_IsometricPixel.broke.png alex@houvonglucka02:~/usbtest$ md5sum Midnight_City_by_IsometricPixel.png efbdeeb1504a038cf102e3043969b703 Midnight_City_by_IsometricPixel.png
by , 16 years ago
Attachment: | Midnight_City_by_IsometricPixel.png added |
---|
by , 16 years ago
Attachment: | Midnight_City_by_IsometricPixel.broke.png added |
---|
comment:14 by , 16 years ago
I opened the two files in a hex editor and it seems the broke one has a huge section of 0's from around 0x00002000 to 0x00002FFF
It is REALLY strange how the missing data is exactly from 2000h - 2FFFh
Maybe one cpu reads a section of 1000h and the next cpu reads the next section and looses it?
comment:15 by , 16 years ago
HA! figured it out.
Blocks 0x00002000 to 0x00002FFF are getting moved to 0x00001000 to 0x00001FFF
The 0x00002000 to 0x00002FFF section is left blanked at 0;
comment:16 by , 16 years ago
I can reproduce this problem and I see similar behaviour. The corruption isn't consistent though, the blocks that get corrupted vary and they are not necessarily 0. I've had it multiple times that contents of the syslog were present in some blocks. These cannot possibly come from USB though. The only way these can end up in a file would be that the blocks were allocated on the heap or on some cache and then aren't filled by the effective file content. I am hesitant to accept that this is a USB problem for one reason though: All usb_disk actions are completely serialized. Every read/write/ioctl hook call will at the very first lock the device at hand through the device mutex and unlock it when everything is done. So unless mutexes are broken, it can't really be a race condition in usb_disk or lower levels, because all transfers are serial.
follow-up: 18 comment:17 by , 16 years ago
Milestone: | R1 → R1/alpha1 |
---|
If the corruption always happen on page boundaries this should at least hint to where the problem happens (and that doesn't sound a lot like USB indeed).
However, you might want to try what happens if you copy blocks using 'dd' as that would at least work around the file system level. Also, you could try copying to another USB device or /dev/dprintf (or some test driver) to see when the corruption appears.
Before BeGeistert, I copied around 50 GB from a USB disk under Haiku on a dual core system. AFAICT the data was not corrupted this way, although I have checked only some files so far.
In any case, I'll bump this to R1/alpha, as it seems to be pretty critical to me.
comment:18 by , 16 years ago
Replying to axeld:
Before BeGeistert, I copied around 50 GB from a USB disk under Haiku on a dual core system. AFAICT the data was not corrupted this way, although I have checked only some files so far.
I'm absolutely sure that I've copied a lot of data without problems back and forth on my dual and quad core machine some weeks ago. Now it happens a lot, corrupting practically every file unless you turn off the other cores. I've tried copying a few mp3 files from my memory stick which were all not recognized and were unplayable after that, so you'd have noticed easily if this happened to you.
In any case, I'll bump this to R1/alpha, as it seems to be pretty critical to me.
Absolutely. Will try with dd now and add some debug memsets to see if it really happens on the USB layer and if, on which one.
comment:19 by , 16 years ago
OK, I've verified that it is a problem within the USB stack itself. The buffers are simply not filled with any incoming data it seems. With dd I cannot reproduce the problem at all, but with data transfers from usb_disk it happens all the time. That the corruptions are on page boundaries comes from the fact that the read requests usb_disk gets are all 4096 in size. These whole buffers are then not filled with any data, so whatever was previously in that page will be written to the target file.
The "funny" thing is that the buffer address is always exactly the same one. So if one transfer doesn't overwrite that page, it should in fact just be the last block repeated again. But it isn't. It is random content, syslog data for example. Possibly the same buffer was used for a read/write on another device though.
I don't quite understand how this can happen, because the mechanism worked previously and wasn't changed. It looks a bit like if the buffer is supposed to be written from another core that it doesn't reach the right page. Some caching/TLB invalidation issue maybe? I remember seeing changes in that direction recently.
comment:20 by , 16 years ago
I don't know if it could be related, but hrev28223 exposed some race condition for SMP machines. It could be the same here.
comment:21 by , 16 years ago
I changed the physical page mapper for x86 in hrev28244. So besides a newly exposed race condition, this could be another origin of the problem. As far as I've seen the USB code doesn't map physical pages directly, which makes it rather unlikely though, since all other code doing it, does it very locally only (e.g. for copying physical pages or PIO).
For what kind of pages does the problem occur. Is it virtual or physical memory? And where does it come from?
comment:22 by , 16 years ago
Component: | Drivers/USB → System/Kernel |
---|---|
Owner: | changed from | to
It looks like the large physical page mapper is the culprit; it maps the page, but invalidates the TLB only on the current CPU.
comment:23 by , 16 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
serial log