undefined

upvote

points

by bmenrigh3 hours ago |

upvote

by skavi2 hours ago|

[-]

We evaluated a few allocators for some of our Linux apps and found (modern) tcmalloc to consistently win in time and space. Our applications are primarily written in Rust and the allocators were linked in statically (except for glibc). Unfortunately I didn't capture much context on the allocation patterns. I think in general the apps allocate and deallocate at a higher rate than most Rust apps (or more than I'd like at least).

Our results from July 2025:

rows are <allocator>: <RSS>, <time spent for allocator operations>

  app1:
  glibc: 215,580 KB, 133 ms
  mimalloc 2.1.7: 144,092 KB, 91 ms
  mimalloc 2.2.4: 173,240 KB, 280 ms
  tcmalloc: 138,496 KB, 96 ms
  jemalloc: 147,408 KB, 92 ms

  app2, bench1
  glibc: 1,165,000 KB, 1.4 s
  mimalloc 2.1.7: 1,072,000 KB, 5.1 s
  mimalloc 2.2.4:
  tcmalloc: 1,023,000 KB, 530 ms

  app2, bench2
  glibc: 1,190,224 KB, 1.5 s
  mimalloc 2.1.7: 1,128,328 KB, 5.3 s
  mimalloc 2.2.4: 1,657,600 KB, 3.7 s
  tcmalloc: 1,045,968 KB, 640 ms
  jemalloc: 1,210,000 KB, 1.1 s

  app3
  glibc: 284,616 KB, 440 ms
  mimalloc 2.1.7: 246,216 KB, 250 ms
  mimalloc 2.2.4: 325,184 KB, 290 ms
  tcmalloc: 178,688 KB, 200 ms
  jemalloc: 264,688 KB, 230 ms

tcmalloc was from github.com/google/tcmalloc/tree/24b3f29.

i don't recall which jemalloc was tested.

reply

upvote

by hedora1 hours ago|

[-]

I’m surprised (unless they replaced the core tcmalloc algorithm but kept the name).

tcmalloc (thread caching malloc) assumes memory allocations have good thread locality. This is often a double win (less false sharing of cache lines, and most allocations hit thread-local data structures in the allocator).

Multithreaded async systems destroy that locality, so it constantly has to run through the exception case: A allocated a buffer, went async, the request wakes up on thread B, which frees the buffer, and has to synchronize with A to give it back.

Are you using async rust, or sync rust?

reply

upvote

by skavi1 hours ago|

[-]

modern tcmalloc uses per CPU caches via rseq [0]. We use async rust with multithreaded tokio executors (sometimes multiple in the same application). so relatively high thread counts.

[0]: https://github.com/google/tcmalloc/blob/master/docs/design.m...

reply

upvote

by usrnm35 minutes ago|

[-]

How do you control which CPU your task resumes on? If you don't then it's still the same problem described above, no?

reply

upvote

by ComputerGuru1 hours ago|

[-]

That’s a considerable regression for mimalloc between 2.1 and 2.2 – did you track it down or report it upstream?

Edit: I see mimalloc v3 is out – I missed that! That probably moots this discussion altogether.

reply

upvote

by skavi1 hours ago|

[-]

nope.

reply

upvote

by codexon26 minutes ago|

[-]

This is similar to what I experienced when I tested mimalloc many years ago. If it was faster, it wasn't faster by much, and had pretty bad worst cases.

reply

upvote

by pjmlp2 hours ago|

[-]

If you go into Dr Dobbs, The C/C++ User's Journal and BYTE digital archives, there will be ads of companies whose product was basically special cased memory allocator.

Even toolchains like Turbo Pascal for MS-DOS, had an API to customise the memory allocator.

The one size fits all was never a solution.

reply

upvote

by adgjlsfhk12 hours ago|

[-]

One of the best parts about GC languages is they tend to have much more efficient allocation/freeing because the cost is much more lumped together so it shows up better in a profile.

reply

upvote

by pjmlp2 hours ago|

[-]

Agreed, however there is also a reason why the best ones also pack multiple GC algorithms, like in Java and .NET, because one approach doesn't fit all workloads.

reply

upvote

by nevdka2 hours ago|

[-]

Then there’s perl, which doesn’t free at all.

reply

upvote

by hedora1 hours ago|

[-]

Perl frees memory. It uses refcounting, so you need to break heap cycles or it will leak.

(99% of the time, I find this less problematic than Java’s approach, fwiw).

reply

upvote

by cermicelli2 hours ago|

[-]

Freedom is overrated... :P

reply

upvote

by NooneAtAll32 hours ago|

[-]

doesn't java also?

I heard that was a common complaint for minecraft

reply

upvote

by xxs1 hours ago|

[-]

What do you mean - if Java returns memory to the OS? Which one - Java heap of the malloc/free by the JVM?

reply

upvote

by cogman101 hours ago|

[-]

Java is pretty greedy with the memory it claims. Especially historically it was pretty hard to get the JVM to release memory back to the OS.

To an outsider, that looks like the JVM heap just steadily growing, which is easy to mistake for a memory leak.

reply

upvote

by k_roy1 hours ago|

[-]

> Especially historically it was pretty hard to get the JVM to release memory back to the OS.

This feels like a huge understatement. I still have some PTSD around when I did Java professionally between like 2005 and 2014.

The early part of that was particularly horrible.

reply

upvote

by adgjlsfhk11 hours ago|

[-]

This only really ends up being a problem on windows. On systems with proper virtual memory setups, the cost of unused memory is very low (since the the OS can just page it out)

reply

upvote

by xxs1 hours ago|

[-]

Java has a quite strict max heap setting, it's very uncommon to let it allocate up to 25% of the system memory (the default). It won't grow past that point, though.

Baring bugs/native leaks - Java has a very predictable memory allocation.

reply

upvote

by bluGill1 hours ago|

[-]

When it works. Many programs in GC language end up fighting the GC by allocating a large buffer and managing it by hand anyway because when performance counts you can't have allocation time in there at all. (you see this in C all the time as well)

reply

upvote

by cogman101 hours ago|

[-]

That's generally a bad idea. Not always, but generally.

It was a better idea when Java had the old mark and sweep collector. However, with the generational collectors (which are all Java collectors now. except for epsilon) it's more problematic. Reusing buffers and objects in those buffers will pretty much guarantees that buffer ends up in oldgen. That means to clear it out, the VM has to do more expensive collections.

The actual allocation time for most of Java's collectors is almost 0, it's a capacity check and a pointer bump in most circumstances. Giving the JVM more memory will generally solve issues with memory pressure and GC times. That's (generally) a better solution to performance problems vs doing the large buffer.

Now, that said, there certainly have been times where allocation pressure is a major problem and removing the allocation is the solution. In particular, I've found boxing to often be a major cause of performance problems.

reply

upvote

by CyberDildonics42 minutes ago|

[-]

If people didn't need to do it, they wouldn't generally do it. Not always, but generally.

reply

upvote

by cogman1036 minutes ago|

[-]

People do stuff they shouldn't all the time.

For example, some code I had to clean up pretty early on in my career was a dev, for unknown reasons, reinventing the `ArrayList` and then using that invention as a set (doing deduplication by iterating over the elements and checking for duplicates). It was done in the name of performance, but it was never a slow part of the code. I replaced the whole thing with a `HashSet` and saved ~300 loc as a result.

This individual did that sort of stuff all over the code base.

reply

upvote

by CyberDildonics43 minutes ago|

[-]

Any extra throughput is far overshadowed by trying to control pauses and too much heap allocations happening because too much gets put on the heap. For anything interactive the options are usually fighting the gc or avoiding gc.

reply

upvote

by m4631 hours ago|

[-]

I remember in the early days of web services, using the apache portable runtime, specifically memory pools.

If you got a web request, you could allocate a memory pool for it, then you would do all your memory allocations from that pool. And when your web request ended - either cleanly or with a hundred different kinds of errors, you could just free the entire pool.

it was nice and made an impression on me.

I think the lowly malloc probably has lots of interesting ways of growing and changing.

reply

upvote

by Sesse__2 minutes ago|

[-]

This is called “an arena” more generally, and it is in wide use across many forms of servers, compilers, and others.

reply

upvote

by jra_samba1 hours ago|

[-]

Look into talloc, used inside Samba (and other FLOSS projects like sssd). Exactly this.

reply

upvote

by Dylan1680757 minutes ago|

[-]

I think some operating system improvements could get people motivated to use huge pages a lot better. In particular make them less fragile on linux and make them not need admin rights on windows. The biggest factor causing problems there is that neither OS can swap a 2MB page. So someone needs to care enough to fix that.

reply

upvote

by pocksuppet2 hours ago|

[-]

In many cases you can also do better than using malloc e.g. if you know you need a huge page, map a huge page directly with mmap

Yes, if you want to use huge pages with arbitrary alloc/free, then use a third-party malloc. If your alloc/free patterns are not arbitrary, you can do even better. We treat malloc as a magic black box but it's actually not very good.

reply

upvote

by codexon2 hours ago|

[-]

I've been using jemalloc for over 10 years and don't really see a need for it to be updated. It always holds up in benchmarks against any new flavor of the month malloc that comes out.

Last time I checked mimalloc which was admittedly a while ago, probably 5 years, it was noticebly worse and I saw a lot of people on their github issues agreeing with me so I just never looked at it again.

reply

upvote

by adgjlsfhk12 hours ago|

[-]

Mimalloc v3 has just come out (about a month ago) and is a significant improvement over both v2 and v1 (what you likely last tested)

reply

upvote

by hrmtst938372 hours ago|

[-]

Benchmarks age fast. Treating a ten-year-old allocator as done just because it still wins old tests is tempting fate, since distros, glibc, kernel VM behavior, and high-core alloc patterns keep moving and the failures usually show up as weird regressions in production, not as a clean loss on someone's benchmark chart.

reply

upvote

by codexon2 hours ago|

[-]

It still beat mimalloc when I checked 4-5 years ago.

reply

upvote

by imp0cat2 hours ago|

[-]

You really need to benchmark your workloads, ideally with the "big 3" (jemalloc, tcmalloc, mimalloc). They all have their strengths and weaknesses.

Jemalloc can usually keep the smallest memory footprint, followed by tcmalloc.

Mimalloc can really speed things up sometimes.

As usually, YMMV.

reply

upvote

by codexon1 hours ago|

[-]

I've benchmarked them every few years, they never seem to differ by more than a few percent, and jemalloc seems to fragment and leak the least for processes running for months.

Mimalloc made the claim that they were the fastest/best when they released and that didn't hold up to real world testing, so I am not inclined to trust it now.

reply

upvote

by ComputerGuru1 hours ago|

[-]

> Mimalloc made the claim that they were the fastest/best when they released and that didn't hold up to real world testing

That’s… ahistorical, at least so far as I remember. It wasn’t marketed as either of those; it was marketed as small/simple/consistent with an opt-in high-severity mode, and then its performance bore out as a result of the first set of target features/design goals. It was mainly pushed as easy to adopt, easy to use, easy to statically link, etc.

reply

upvote

by codexon32 minutes ago|

[-]

> It was mainly pushed as easy to adopt, easy to use, easy to statically link, etc.

That is true of basically every single malloc replacement out there, that is not a uniquely defining feature.

reply

upvote

by HackerThemAll47 minutes ago|

[-]

Look up the numbers in other comments above. When it comes to performance, the Google's tcmalloc is unconquered.

reply

upvote

by IshKebab2 hours ago|

[-]

I feel like the real thing that needs to change is we need a more expressive allocation interface than just malloc/realloc. I'm sure that memory allocators could do a significantly better job if they had more information about what the program was intending to do.

reply

upvote

by liuliu2 hours ago|

[-]

There are, look no further than jemalloc API surface itself:

https://jemalloc.net/jemalloc.3.html

One thing to call out: sdallocx integrates well with C++'s sized delete semantics: https://isocpp.org/files/papers/n3778.html

reply

upvote

by hedora1 hours ago|

[-]

You can also play tricks with inlining and constant propagation in C (especially on the malloc path, where the ground-truth allocation size is usually statically known).

reply

upvote

by anthk2 hours ago|

[-]

I used mimalloc to run zenlisp under OpenBSD as it would clash with the paranoid malloc of base.

reply

upvote

by jeffbee2 hours ago|

[-]

Just out of curiosity are you getting 1GB huge pages on Xeon or some other platform? I always thought this class of page is the hardest to exploit, considering that the machine only has, if I recall correctly, one TLB slot for those.

reply

upvote

by bmenrigh2 hours ago|

[-]

Modern x86_64 has supported multiple page sizes for a long time. I'm on commodity Zen 5 hardware (9900X) with 128 GiB of RAM. Linux will still use a base page size of 4kb but also supports both 2 MiB and 1 GiB huge pages. You can pass something like `default_hugepagesz=2M hugepagesz=1G hugepages=16` to your kernel on boot to use 2 MiB pages but reserve 16 1 GiB pages for later use.

The nice thing about mimalloc is that there are a ton of configurable knobs available via env vars. I'm able to hand those 16 1 GiB pages to the program at launch via `MIMALLOC_RESERVE_HUGE_OS_PAGES=16`.

EDIT: after re-reading your comment a few times, I apologize if you already knew this (which it sounds like you did).

reply

upvote

by jeffbee1 hours ago|

[-]

Right but on Intel the 1G page size has historically been the odd one. For example Skylake-X has 1536 L2 shared TLB entries for either 4K or 2M pages, but it only has 16 entries that can be used for 1G pages. It wasn't unified until Cascade Lake. But Skylake-like Xeon is still incredibly common in the cloud so it's hard to target the later ones.

reply

upvote

by Dylan1680743 minutes ago|

[-]

So for any process that's using less than 16GB, it's a significant performance boost. And most processes using more RAM, but not splitting accesses across more than 16 zones in rapid succession, will also see a performance boost.

My old Intel CPU only has 4 slots for 1GB pages, and that was enough to get me about a 20% performance boost on Factorio. (I think a couple percent might have been allocator change but the boost from forcing huge pages was very significant)

reply

upvote

by sylware3 hours ago|

[-]

If there is so much performance difference among generic allocators, it means you need semantic optimized allocators (unless performance is actually not that much important in the end).

reply

upvote

by codexon29 minutes ago|

[-]

Agreed mostly. Going from standard library to something like jemalloc or tcmalloc will give you around 5-10% wins which can be significant, but the difference between those generic allocators seem small. I just made a slab allocator recently for a custom data type and got speedups of 100% over malloc.

reply

upvote

by Cloudef2 hours ago|

[-]

You are not wrong and this is indeed what zig is trying to push by making all std functions that allocate take a allocator parameter.

reply