undefined

points

[-]

The mechanism is that generating tokens (the "decode" phase) in an LLM is limited by memory bandwidth for the weights, so computing multiple streams amortizes the bandwidth over streams as long as you can keep the contexts in RAM. This is most true for dense models and the always-on expert in MoE models, or when you have significantly more streams than experts for MoE models.

In contrast, prompt prefill is more easily compute-bound, so there are interesting trade-offs for latency of decode vs prefill when the LLM utilization is high.