Opened 18 months ago

Last modified 4 months ago

#13554 new enhancement

Switch system allocator

Reported by: waddlesplash Owned by: nobody
Priority: high Milestone: R2
Component: System/libroot.so Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Has a Patch: no Platform: All

Description (last modified by waddlesplash)

We presently use an old version of the Hoard allocator, which is licensed under the LGPLv2. Newer versions are GPL or commercial license only, which is obviously not acceptable.

Benchmarks done by the Lockless group also show that the Hoard allocator has a very worrying performance dropoff for larger allocations, see https://locklessinc.com/benchmarks_allocator.shtml

Change History (6)

comment:1 Changed 6 months ago by waddlesplash

Description: modified (diff)
Summary: Investigate switching allocatorsSwitch system allocator

PulkoMandy has some (slightly old) notes in #7420. He also links to lockless' allocator benchmark page.

One of the musl libc developers has a more comprehensive analysis of the problems with jemalloc here: http://www.openwall.com/lists/musl/2018/04/23/2 -- so it looks like it's not the best choice for us.

comment:2 Changed 6 months ago by waddlesplash

So, here are the options, collected / condensed:

jemalloc

Homepage: https://github.com/jemalloc/jemalloc
License: BSD 2-clause
Notable users: FreeBSD, NetBSD, Firefox, Rust
Benchmarks: see Lockless page; [0]: https://github.com/ezrosent/allocators-rs/blob/master/info/elfmalloc-performance.md
Pros: Battle-tested, easily tweakable, very fast.
Cons: Bigger memory footprint vs. others, see [1]: http://www.openwall.com/lists/musl/2018/04/23/2 for more details. Appears to be optimized for speed over memory, up to 33% overhead for smaller (~30MB) heaps [2]: https://news.ycombinator.com/item?id=13912639

tcmalloc

Homepage: http://goog-perftools.sourceforge.net/doc/tcmalloc.html & https://github.com/gperftools/gperftools
License: BSD 3-clause
Notable users: Google Chrome
Benchmarks: see Lockless page, and most of the other pages linked here
Pros: ?
Cons: Does not release memory back to the system without very specific orders to do so, i.e., it's optimized for very specific high-performance mostly-constant-or-increasing memory workflows (servers, browsers...), rather than a generalized allocator.

rpmalloc

Homepage: https://github.com/rampantpixels/rpmalloc
License: Public domain or MIT
Notable users: RemObjects, LMMS,
Benchmarks: see [0], [2], [3]: https://github.com/rampantpixels/rpmalloc/blob/master/BENCHMARKS.md
Pros: Very well balanced between speed vs. memory usage, yet still faster than the competition for most workloads. Well-documented and readable code.
Cons: Not used as widely as jemalloc.

others

I've left the following allocators out of this analysis for various reasons:

  • nedmalloc. Unmaintained since 2014 and all benchmarks show it underperforming vs. them.
  • bmalloc. Significantly worse-performing vs. the options on this list with no apparent advantages.
  • hoard3. Now GPL only, which is obviously not acceptable for use as a system library.
  • ptmalloc2. glibc default allocator; underperforms vs. all of these.

comment:3 Changed 6 months ago by waddlesplash

The benchmarks in [0] show that rpmalloc really does have the best memory-performance tradeoff, and even the producer-consumer scenario that rpmalloc lists as a potential worst-case given its design is actually better than most of the competition. Plus, I believe it's already used on Haiku (we have a LMMS port), so there shouldn't be any compatibility issues.

comment:4 Changed 6 months ago by pulkomandy

A quick read of the rpmalloc readme shows that it avoids fragmentation by using a given 64K page for allocations of the same size (32 bytes, 64 bytes, ...). As a consequence:

  • All allocations are rounded to the next power of two size, at least (I think for larger allocations it is even more)
  • If you don't do a lot of allocations of the same size, you are left with a largely empty 64K page for that size

As a result, while the memory overhead for the allocator itself is low, the drawback is possibly a lot of unused memory, which is technically free, but only allocatable for specific sizes of allocations. As a result you may have many 64-byte blocks free but end up being unable to do a 32 byte allocation, if the system as a whole runs out of memory space.

We should see how this affects our use cases, and maybe it's possible to tweak the settings (several things are configurable). In any case, benchmarks in our real life situation will be needed, there can't be a clear winner without that.

comment:5 Changed 5 months ago by miqlas

I have a jemalloc recipe somewhere, lemme check it. Here you go: https://github.com/haikuports/haikuports/pull/1965

Last edited 5 months ago by miqlas (previous) (diff)

comment:6 Changed 4 months ago by waddlesplash

As a result you may have many 64-byte blocks free but end up being unable to do a 32 byte allocation, if the system as a whole runs out of memory space.

Well, if mmap returns addresses already aligned properly (or we create an API that does -- actually I think create_area can do what we want here?) then the wastage will be for size classes instead of alignment. And in that case, as the README says:

Super spans (spans a multiple > 1 of the span size) can be subdivided into smaller spans to fulfull a need to map a new span of memory. By default the allocator will greedily grab and break any larger span from the available caches before mapping new virtual memory.

So we only need to worry about wasting memory to get proper alignment.

But indeed, benchmarks are needed.

Note: See TracTickets for help on using tickets.