Opened 5 years ago

Last modified 3 years ago

#15660 new task

BOpenHashTable becomes corrupt if duplicates are inserted

Reported by: ambroff Owned by: nobody
Priority: normal Milestone: Unscheduled
Component: System Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Platform: All

Description

In the investigation of #13927 we found the culprit to be the use of the BOpenHashTable template.

Despite it's name, BOpenHashTable is a chained hash table, meaning that a hash function is used to map a key to a bucket, and each bucket is a linked list of elements. The array of buckets grows and shrinks automatically as the load changes in the hash table. The linked list nodes are the items inserted into the hash table, so the design of BOpenHashTable requires that elements inserted into the hash table can be used as intrusive linked-list elements.

The main problem with BOpenHashTable is that inserting the same object multiple times is not supported, but by default there are no runtime checks to prevent it. In a typical chained hash table implementation, an item could be inserted multiple times, which would mean that a new linked list node would be allocated for every insertion. But since BOpenHashTable uses an intrusive linked list, inserting an object multiple times can create a cycle in the list.

There is an optional CheckDuplicates non-type template parameter which defaults to false. If it is set to true, then on insertion or removal a search for duplicates is performed, which requires scanning the entire bucket. If a duplicate is found then debugger() or panic() is called, depending on whether the code is run in userspace or the kernel. This turns the O(1) insertion into a O(N) insertion.

For this task, the implementation of BOpenHashTable should be changed in order to introduce some extra safety without introducing a lot of additional runtime cost. Duplicate insertion should be a noop.

I see several approaches that make sense:

  1. Make BOpenHashTable a true open hash table. This means potentially more copying as the buffer resizes, but may perform better for some workloads, and removes the need for the intrusive linked list.
  2. OR Maintain the chained hash table design but don't use an intrusive linked list, which would mean more allocations but would also mean more safety.
  3. OR Replace the intrusive linked list with a random access collection like a vector or a binary search tree. This means that you could maintain some invariants for each bucket, for example that the items are sorted so that you can always find duplicates in < O(N) time.

Once https://review.haiku-os.org/c/haiku/+/2165 is merged then we'll have a test suite for BOpenHashTable which we can use to at least catch some regressions we might introduce in any changes to its implementation.

Change History (2)

comment:1 by korli, 5 years ago

This comment makes it clear why the linked list is used. Not failing for out of memory is a common feature in the kernel, definitely to retain. "The link between entries is part of the ValueType stored items, which makes sure the table can always accept new items and will never fail because it is out of memory (except at Init time)."

comment:2 by waddlesplash, 3 years ago

Component: - GeneralSystem
Note: See TracTickets for help on using tickets.