undefined

points

[-]

> Everyone groups all the requests into a batch, and the GPU computes them together.

You're only saving on fetching read-only parameters, and not even on that if you're using MoE models where each inference in the batch might require a different expert (unless you rearrange batches so that sharing experts becomes more likely, but that's difficult since experts change per-token or even per-layer). Everything else - KV-cache, activations - gets multiplied by your batch size. You scale both compute and memory pressure by largely the same amount. Yes, GPUs are great at hiding memory fetch latency, but that applies also to n=1 inference.

by yorwba1 hours ago|

prev|

[-]

Their latency measurements comparing Mamba-2 and Mamba-3 are done with a batch size of 128. It doesn't seem like Mamba-2 was compute-bound even at that batch size.