undefined

points

[-]

That’s correct, and yes - not less compute total on the main model (actually slightly more, since checking failed draft tokens costs you compute), but faster because inference is memory-bandwidth bound. And like you I also think of it as like a “mini prefill” (but on top of the existing KV cache, of course); the code is very similar to prefill if you implement a simple toy version yourself.

Most of the complexity in implementing a simple toy version comes from having to get the KV cache back into a good state for the next cycle (e.g. if only the first half of your draft tokens were correct).

by zozbot2346 hours ago|

prev|

[-]

> But I think the key is that in the standard autoregressive case we get memory bandwidth bound, so there are tons of idle compute resources.

Right, this is the same way batching works. It's "free" until we exhaust available compute resources, at which point decode throughput becomes compute bound. (This is a good place to be, because scaling out compute is a lot easier than adding fast VRAM.) This is why MTP is mostly useful when you have one or few users, which means compute is abundant. When you're running large batches you're better off using that compute to grow your batch size.

Of course, batch size is usually limited by things like bulky KV caches. So perhaps MTP has some residual use in that setting. But if you're sharing cached context in a subagent swarm, or running a model like the recent DeepSeek V4 with its tiny KV cache, you can go a lot further in processing a larger batch.

by mike_hearn3 hours ago|

parent|

[-]

You can disaggregate though. So draft models can run on cheaper hardware with less RAM, saving time on the more expensive machines with more RAM.

by cma4 hours ago|

parent|

prev|

[-]

I think it also gets use in the /fast modes the providers sell at higher cost.

by gunalx2 hours ago|

parent|

[-]

They probably use it on all models. Fast is probably just a resource pool with less congestion and therefore faster throughput per user but less efficent.