Opened 21 months ago
Last modified 8 weeks ago
#13554 new enhancement
Switch system allocator
Reported by: | waddlesplash | Owned by: | nobody |
---|---|---|---|
Priority: | high | Milestone: | R2 |
Component: | System/libroot.so | Version: | R1/Development |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Has a Patch: | no | Platform: | All |
Description (last modified by )
We presently use an old version of the Hoard allocator, which is licensed under the LGPLv2. Newer versions are GPL or commercial license only, which is obviously not acceptable.
Benchmarks done by the Lockless group also show that the Hoard allocator has a very worrying performance dropoff for larger allocations, see https://locklessinc.com/benchmarks_allocator.shtml
Change History (8)
comment:1 Changed 8 months ago by
Description: | modified (diff) |
---|---|
Summary: | Investigate switching allocators → Switch system allocator |
comment:2 Changed 8 months ago by
So, here are the options, collected / condensed:
jemalloc
Homepage: https://github.com/jemalloc/jemalloc
License: BSD 2-clause
Notable users: FreeBSD, NetBSD, Firefox, Rust
Benchmarks: see Lockless page; [0]: https://github.com/ezrosent/allocators-rs/blob/master/info/elfmalloc-performance.md
Pros: Battle-tested, easily tweakable, very fast.
Cons: Bigger memory footprint vs. others, see [1]: http://www.openwall.com/lists/musl/2018/04/23/2 for more details. Appears to be optimized for speed over memory, up to 33% overhead for smaller (~30MB) heaps [2]: https://news.ycombinator.com/item?id=13912639
tcmalloc
Homepage: http://goog-perftools.sourceforge.net/doc/tcmalloc.html & https://github.com/gperftools/gperftools
License: BSD 3-clause
Notable users: Google Chrome
Benchmarks: see Lockless page, and most of the other pages linked here
Pros: ?
Cons: Does not release memory back to the system without very specific orders to do so, i.e., it's optimized for very specific high-performance mostly-constant-or-increasing memory workflows (servers, browsers...), rather than a generalized allocator.
rpmalloc
Homepage: https://github.com/rampantpixels/rpmalloc
License: Public domain or MIT
Notable users: RemObjects, LMMS,
Benchmarks: see [0], [2], [3]: https://github.com/rampantpixels/rpmalloc/blob/master/BENCHMARKS.md
Pros: Very well balanced between speed vs. memory usage, yet still faster than the competition for most workloads. Well-documented and readable code.
Cons: Not used as widely as jemalloc.
others
I've left the following allocators out of this analysis for various reasons:
- nedmalloc. Unmaintained since 2014 and all benchmarks show it underperforming vs. them.
- bmalloc. Significantly worse-performing vs. the options on this list with no apparent advantages.
- hoard3. Now GPL only, which is obviously not acceptable for use as a system library.
- ptmalloc2. glibc default allocator; underperforms vs. all of these.
comment:3 Changed 8 months ago by
The benchmarks in [0] show that rpmalloc really does have the best memory-performance tradeoff, and even the producer-consumer scenario that rpmalloc lists as a potential worst-case given its design is actually better than most of the competition. Plus, I believe it's already used on Haiku (we have a LMMS port), so there shouldn't be any compatibility issues.
comment:4 Changed 8 months ago by
A quick read of the rpmalloc readme shows that it avoids fragmentation by using a given 64K page for allocations of the same size (32 bytes, 64 bytes, ...). As a consequence:
- All allocations are rounded to the next power of two size, at least (I think for larger allocations it is even more)
- If you don't do a lot of allocations of the same size, you are left with a largely empty 64K page for that size
As a result, while the memory overhead for the allocator itself is low, the drawback is possibly a lot of unused memory, which is technically free, but only allocatable for specific sizes of allocations. As a result you may have many 64-byte blocks free but end up being unable to do a 32 byte allocation, if the system as a whole runs out of memory space.
We should see how this affects our use cases, and maybe it's possible to tweak the settings (several things are configurable). In any case, benchmarks in our real life situation will be needed, there can't be a clear winner without that.
comment:5 Changed 8 months ago by
I have a jemalloc recipe somewhere, lemme check it. Here you go: https://github.com/haikuports/haikuports/pull/1965
comment:6 Changed 6 months ago by
As a result you may have many 64-byte blocks free but end up being unable to do a 32 byte allocation, if the system as a whole runs out of memory space.
Well, if mmap returns addresses already aligned properly (or we create an API that does -- actually I think create_area can do what we want here?) then the wastage will be for size classes instead of alignment. And in that case, as the README says:
Super spans (spans a multiple > 1 of the span size) can be subdivided into smaller spans to fulfull a need to map a new span of memory. By default the allocator will greedily grab and break any larger span from the available caches before mapping new virtual memory.
So we only need to worry about wasting memory to get proper alignment.
But indeed, benchmarks are needed.
comment:7 Changed 2 months ago by
Blocking: | 7420 added |
---|
comment:8 Changed 8 weeks ago by
Blocking: | 7420 removed |
---|
PulkoMandy has some (slightly old) notes in #7420. He also links to lockless' allocator benchmark page.
One of the musl libc developers has a more comprehensive analysis of the problems with jemalloc here: http://www.openwall.com/lists/musl/2018/04/23/2 -- so it looks like it's not the best choice for us.