undefined

points

[-]

Multi token prediction is the same thing as speculative decoding. This is mentioned in the Google pages describing their MTP implementation.

Google has now provided small models for each of the previous Gemma 4 models, e.g. "gemma-4-26B-A4B-it-assistant" for "gemma-4-26B-A4B-it".

The difference vs. Qwen is that here each small model is not some general-purpose smaller model, but a model that has been optimized specifically for this task, to predict the output of the bigger model with which it is paired.

This specialization and optimization of the Google "gemma-4-*-assistant" models ensures that they are much smaller and thus much faster than general-purpose small models.

by julianlam7 hours ago|

parent|

[-]

So then these models could be used by llama.cpp today with the -md switch?

Interesting, must try tomorrow.

by OneDeuxTriSeiGo16 hours ago|

prev|

[-]

As far as I can tell MTP is unique from regular speculative decode because the small model is trained to consume and operate on the big model's hidden state for prediction.

by dchftcs1 hours ago|

prev|

[-]

It's the same speculative decoding. The news is that it came out for a popular local model.

by 11 hours ago|

prev|

[-]

deleted