Opened 5 years ago

Closed 5 years ago

Last modified 5 years ago

#10880 closed bug (fixed)

haiku_loader can hang for minutes when one or more BIOS reported drives are unreadable

Reported by: MasterM Owned by: pdziepak
Priority: high Milestone: R1/beta1
Component: System/Boot Loader Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Has a Patch: yes Platform: x86

Description

The problem is caused by find_unique_check_sums() that tries to avoid checksum clash for a maximum of 200 iterations. If one or more drives report read errors during that loop it can take a very long time to complete.

Setting maximum tries to a low value would have a side effect of potentially increased change of collision. Better solution is to check if we can read from a drive before adding it to the block device list.

I marked this issue as high priority because the boot loader is a critical part of an operating system. Feel free to change the priority to something lower if necessary.

Attachments (1)

0001-haiku_loader-now-ignores-unusable-drives-reported-by.patch (1.5 KB) - added by MasterM 5 years ago.
Ok, let's try again. :)

Download all attachments as: .zip

Change History (15)

comment:1 Changed 5 years ago by MasterM

Has a Patch: set

comment:2 Changed 5 years ago by diver

Owner: changed from axeld to bonefish
Status: newassigned

comment:3 Changed 5 years ago by jessicah

Owner: changed from bonefish to jessicah

comment:4 Changed 5 years ago by waddlesplash

Milestone: UnscheduledR1/alpha5

comment:5 Changed 5 years ago by pdziepak

  1. Why try to read 512 bytes? Wouldn't 1 byte be enough?
  2. Format string, driveID is uint8 not unsigned.
  3. That's probably out of scope of this patch but if I understand correctly (basing on th description of this ticket) we should try some perfect hashing here (without any bound on number of attempts) that guarantees generation of collision-free hash table in O(n). I admit though that I haven't fully read and understood the code in question yet, so this suggestion may be wrong.
  4. (that's to the "general audience", not only for you, MasterM) Why do we expect people to send patches here? It is very inconvenient to review anything if I cannot directly address the part of code I am unsure of. We already have haiku-dev ML and git already has git send-email. Wouldn't it be better to encourage people to use these?

comment:6 Changed 5 years ago by waddlesplash

I say we use GitHub, but manually pull from repositories and don't use their automerge feature.

comment:7 in reply to:  5 ; Changed 5 years ago by MasterM

Replying to pdziepak:

  1. Why try to read 512 bytes? Wouldn't 1 byte be enough?

BIOSDrive::ReadAt uses BIOS interrupt 13h for disk access. You can only read a multiple of sector size that way and since there are virtually no devices with sector size less than 512 bytes, there is really no performance difference when reading anything <=512 bytes.

Since find_unique_check_sums() uses chunks of 512b anyway I figured it would be good heuristic to check if one could read at least chunk that big without any explicit read errors.

  1. Format string, driveID is uint8 not unsigned.

On most compilers driveID should be promoted to unsigned int automatically. Should I add explicit cast there?

  1. That's probably out of scope of this patch but if I understand correctly (basing on th description of this ticket) we should try some perfect hashing here (without any bound on number of attempts) that guarantees generation of collision-free hash table in O(n). I admit though that I haven't fully read and understood the code in question yet, so this suggestion may be wrong.

Well there are really two seperate problems with that piece of code. One of them is indeed a simple sum used as a poor's man hashing algorithm (a good candidate for replacement would be crc32 for example) and the other one is that it tries to read random sectors from a drive in an attempt to uniquely identify it. So if two drives contain exactly the same data, then it will fail anyway after max tries. This scenario is of course highly unlikely but nevertheless possible.

It is worth noting that this algorithm only runs if haiku_loader is unable to get drive identification data from BIOS (which happens quite often on real hardware but I haven't looked at that code so can't say why).

comment:8 in reply to:  7 Changed 5 years ago by pdziepak

Replying to MasterM:

Replying to pdziepak:

  1. Why try to read 512 bytes? Wouldn't 1 byte be enough?

BIOSDrive::ReadAt uses BIOS interrupt 13h for disk access. You can only read a multiple of sector size that way and since there are virtually no devices with sector size less than 512 bytes, there is really no performance difference when reading anything <=512 bytes.

is_drive_readable doesn't have to care about the implementation of BIOSDrive::ReadAt. All we need to do is to check whether read from the device are successful - 1 byte is enough for that purpose.

  1. Format string, driveID is uint8 not unsigned.

On most compilers driveID should be promoted to unsigned int automatically. Should I add explicit cast there?

Actually, it is promoted to int (assuming sizeof(int) > 1), but that doesn't really matter. It is a good practice to use PRI* macros whenever trying to print [u]int*_t (well, in this case it would be B_PRI* and [u]int*.

comment:9 Changed 5 years ago by pdziepak

I should have mentioned earlier that this patch really needs a proper description. Something similar to what you wrote in this bug report should be good (reference "fixes #...." is not enough).

Also, because of C++11 user-defined literals there has to be space between " and B_PRI. Obviously, we like symmetry very much so you should also add space after B_PRI.

The comments seem a bit redundant, but it is not very important so it can stay the way it is if you want.

comment:10 in reply to:  9 Changed 5 years ago by jessicah

Replying to pdziepak:

I should have mentioned earlier that this patch really needs a proper description. Something similar to what you wrote in this bug report should be good (reference "fixes #...." is not enough).

It already does :) That is the second line, not the first.

Changed 5 years ago by MasterM

Ok, let's try again. :)

comment:11 Changed 5 years ago by pdziepak

Owner: changed from jessicah to pdziepak

In newly added code we prefer BIOSDrive* drive over BIOSDrive *drive. Please stick to that in future, otherwise you risk that your committer will embarrass themselves by failing to use git. ;)

Anyway, merged in hrev47279. Thanks!

comment:12 Changed 5 years ago by pdziepak

Resolution: fixed
Status: assignedclosed

comment:13 Changed 5 years ago by axeld

Some general remarks (just got aware of this ticket):

  • The hash value is only used to identify disks later on in the kernel, if there is no other mapping available (most BIOS implementations do not provide that info, unfortunately). It doesn't really need to be a strong hashing mechanism, as collisions are still not very likely.
  • If the disks are identical, this algorithm will indeed fail, but then it does not really matter, anyway ;-)
  • The real problem in the randomness, however, is that it may fail for non-identical drives, too. That may be an actual problem, however unlikely. The only way to improve that would be to smartly choose the blocks, but that would require intimidate knowledge of the underlying file system. And may also fail unless you read in the complete disk.
  • As discussed on the ML a short time ago, we use Trac for tickets to be able to not lose patches. Whether or not this is a good idea might be debatable, but IMO for simple bug fixes it's not really important to have the committer to stay around to push the same patch over and over until someone finally accepts it. Of course, there are tons of ways to improve this workflow, like using Gerrit instead, or having a "Commit" button in Trac directly.

I think removing drives from the list is a good solution, although I'm not sure it should happen at that point (it could also happen at the initial disk detection, as that would save the extra read).

comment:14 Changed 5 years ago by pulkomandy

Milestone: R1/alpha5R1/beta1
Note: See TracTickets for help on using tickets.