undefined

points

[-]

Thanks for the clarification - Google does publish more than others - and I actually really appreciate the work they are doing with the Gemma models, which are truly competitive open models. I do wish they’d publish more in depth papers on their Gemma models but appreciate that they are open weights.

by DiabloD34 hours ago|

prev|

[-]

They weren't the first to do MTP like this, and arguably did it wrong: the MTP heads are kept in a separate file and have to be welded in by the inference engine.

Qwen 3.6 shipped with working MTP first, and had working MTP in llama.cpp first.

by spijdar2 hours ago|

parent|

[-]

Given the MTP drafter is basically a separate model, keeping it separate makes more sense IMO. It's out of my wheelhouse but it seems like you could adjust the MTP drafter model separately from the main model, too.

Ultimately though the real explanation, I think, is Google doesn't care since for their own purposes (in LiteRT-LM), they do bundle them. As far as I know, anyway.

by DiabloD31 hours ago|

parent|

[-]

MTP models share internal state with the main model, and also refer to parameters in the model.

They are more like a single model that has two separate attention head mechanisms.

by anaisbetts2 hours ago|

parent|

prev|

[-]

I mean just like GGUFs aren't technically necessary yet are _way_ more convenient than using Safetensors and configuring the default Jinja prompt by-hand, it makes sense to bundle the draft model too. For all intents and purposes, the only people who will train a draft model are the people who train the original model

by kcb1 hours ago|

parent|

prev|

[-]

Nvidia's Nemotron 3 Super also shipped with MTP.