But with this patch I saw 46t/s with qwen3.6 27B q8... this is insane, it's 250% faster than the original speed, there was no gpu I could upgrade to get that kind of boost, amazing!
https://github.com/ollama/ollama/pull/15980
Edit: Seems they also have a pre-release version out with the functionality added: https://github.com/ollama/ollama/releases/tag/v0.23.1-rc0
theturtle32@ai1:~$ ollama run gemma4:31b-coding-mtp-bf16 pulling manifest Error: pull model manifest: 412: this model requires macOS
For someone who's been running local models for a long while, these are very very exciting times.
I'm surprised the 26B-A4B is better? It should be faster too, interesting. I'm excited to try 31B with MTP, because MTP-2 is what makes 27B bearable on the GB10.
What are you using it for? Agent-based coding, or something else?
However I find qwen unbeatable for toolcallling. I think gemma wasnt trained on that at all.
The AI space moves so fast! I'll check it out again.
There are definitely differences in the eagerness to tool-call that you'll need to manage. And for all local models I've ever used, I've had to micromanage the tools provided by servers to eliminate any possibility that they reach for something wonky or confusing.
Gemma4 chat template seems to had multiple issues, at least with llama.cpp, not sure they're all fixed yet. It assumed simple types for parameters for example.
Why when asking a model to change text in a minor way; are we not asking it to generate the operational transformations necessary to modify the text, and then just executing the ot on the existing text vs reproducing every token? Maybe tools are doing that more than I realize?
[retain(8), delete(6), insert("very very"), retain(10)]
I just asked: Write the operational transformation sequence and command to turn “this is really beautiful” to “this is very very beautiful”
and in return got: You can map this out by moving a virtual cursor across the text and telling it what to keep, remove, or add. You start by retaining the first eight characters to keep "this is " untouched. Then you delete the next six characters to remove the word "really". In that exact spot, you insert the nine characters for "very very". You finish the operation by retaining the final ten characters, which preserves the space and the word "beautiful". You can code this specific command sequence as [retain(8), delete(6), insert("very very"), retain(10)].
In a large paragraph of text I would expect it to be way quicker and cheaper to generate “[retain(800), delete(6), insert("very very"), retain(10000)]” than repredict the entire remainder of the unedited text.
class Foo {
// ....
int calculation() {
return 42;
}
// more stuff
}
where the main model emits something that is a sort of casual under-specified diff format and the merge model figures out how to interpret it as a patch.The current implementation ignores that head but the PR let the tool recognize it, plus does proper integration (run the MTP while running the slower main path then compare the result, I believe.)