undefined

points

by zdw19 hours ago |

comments

by tarruda18 hours ago|

[-]

There is a newer PR which will probably be merged soon: https://github.com/ggml-org/llama.cpp/pull/22673

by xlayn14 hours ago|

parent|

[-]

Ohhhh geee!!! I just applied the patch to my local git copy. You need to use the model on the PR that he submitted, the model is particular because it has extra information that allows the MTP to happen. I have two amd gpus, and qwen3.6 27B qk6 does around 20t/s generation... If I run it only on one I get like 35t/s.

But with this patch I saw 46t/s with qwen3.6 27B q8... this is insane, it's 250% faster than the original speed, there was no gpu I could upgrade to get that kind of boost, amazing!

by entropicdrifter18 hours ago|

parent|

prev|

[-]

Ollama merged a PR for MTP about 2 hours ago, as well:

https://github.com/ollama/ollama/pull/15980

Edit: Seems they also have a pre-release version out with the functionality added: https://github.com/ollama/ollama/releases/tag/v0.23.1-rc0

by theturtle326 hours ago|

parent|

[-]

Sad:

theturtle32@ai1:~$ ollama run gemma4:31b-coding-mtp-bf16 pulling manifest Error: pull model manifest: 412: this model requires macOS

by zozbot2346 hours ago|

parent|

[-]

What's "sad" is how slow the ollama folks are being in vendoring newer versions of ggml into their codebase. That attitude just leaves them stranded without access to newer features.

by 16 hours ago|

parent|

prev|

[-]

deleted

by nzeid16 hours ago|

prev|

[-]

A few days ago I switched again from Qwen3.6 to Gemma 4 - for personal use I've experienced better average performance with the 26B version of the latter than the 27B of the former.

For someone who's been running local models for a long while, these are very very exciting times.

by girvo11 hours ago|

parent|

[-]

Oh that's fascinating. 3.6 27B is pretty damned good, but slow in wall-clock times on my DGX Spark-alike. It generates huge reams of thinking before it gets the (usually correct!) answer, so wall-clock time is rough for tasks even at ~20tk/s

I'm surprised the 26B-A4B is better? It should be faster too, interesting. I'm excited to try 31B with MTP, because MTP-2 is what makes 27B bearable on the GB10.

What are you using it for? Agent-based coding, or something else?

by glenngillen9 hours ago|

parent|

prev|

[-]

I've been thinking about doing more of this too. What spec machine are you running? And are you using long-running autonomous agents or more of the IDE/co-pilot style of collaboration?

by apexalpha15 hours ago|

parent|

prev|

[-]

I’ve been swapping between these too as well.

However I find qwen unbeatable for toolcallling. I think gemma wasnt trained on that at all.

by sigmoid1015 hours ago|

parent|

[-]

Gemma certainly was trained for tool calling, but the implementation in llama.cpp has been troubled because Gemma uses a different chat template format. The processor from the transformers library works fine though.

by apexalpha1 hours ago|

parent|

[-]

Oh I must've missed this.

The AI space moves so fast! I'll check it out again.

by nzeid15 hours ago|

parent|

prev|

[-]

I'm using llama.cpp with Gemma and tool calling is mission critical. It's perfectly fine on my end.

There are definitely differences in the eagerness to tool-call that you'll need to manage. And for all local models I've ever used, I've had to micromanage the tools provided by servers to eliminate any possibility that they reach for something wonky or confusing.

by magicalhippo11 hours ago|

parent|

prev|

[-]

> However I find qwen unbeatable for toolcallling. I think gemma wasnt trained on that at all.

Gemma4 chat template seems to had multiple issues, at least with llama.cpp, not sure they're all fixed yet. It assumed simple types for parameters for example.

by egeres13 hours ago|

prev|

[-]

There's also a growing interest on integrating DFlash: https://github.com/ggml-org/llama.cpp/issues/21978, I can't wait to see how it will compare against MTP

by fridder16 hours ago|

prev|

[-]

I'd love to see this in oMLX too. It has been a rather nice tool

by basch17 hours ago|

prev|

[-]

I have a dumb performance question.

Why when asking a model to change text in a minor way; are we not asking it to generate the operational transformations necessary to modify the text, and then just executing the ot on the existing text vs reproducing every token? Maybe tools are doing that more than I realize?

by XYen0n17 hours ago|

parent|

[-]

The only thing a model can output is tokens; to achieve this, a tool of converting tokens into operational transformations is required. For example, I have an ast-grep skill, it will instruct the model to generate ast-grep rules and run ast-grep to perform file modifications.

by basch15 hours ago|

parent|

[-]

I am saying to directly output the operational transformation instructions as the tokens. You’re essentially telling it to “write the diff” and then applying the patch.

[retain(8), delete(6), insert("very very"), retain(10)]

by mike_hearn3 hours ago|

parent|

[-]

OpenAI models emit a format similar to a regular diff, but without the line numbers. Look at apply_patch

by ritonlajoie12 hours ago|

parent|

prev|

[-]

there is a model in openrouter doing exactly this, it generates diffs. forgot the name though

by sigmoid1016 hours ago|

parent|

prev|

[-]

The simple answer is: because it is not necessary to achieve the same final output. Most LLMs today are trained as autoregressive token predictors. They fundamentally can't work any other way. But we know how to train them really well and they have many applications beyond editing text. Diffusion LLMs exist too, which work a bit closer to what you describe, but they are not yet at the same level of intelligence since training methods are not that mature and they are generally less flexible as well.

by basch15 hours ago|

parent|

[-]

So predict the tokens of the operational transformation.

I just asked: Write the operational transformation sequence and command to turn “this is really beautiful” to “this is very very beautiful”

and in return got: You can map this out by moving a virtual cursor across the text and telling it what to keep, remove, or add. You start by retaining the first eight characters to keep "this is " untouched. Then you delete the next six characters to remove the word "really". In that exact spot, you insert the nine characters for "very very". You finish the operation by retaining the final ten characters, which preserves the space and the word "beautiful". You can code this specific command sequence as [retain(8), delete(6), insert("very very"), retain(10)].

In a large paragraph of text I would expect it to be way quicker and cheaper to generate “[retain(800), delete(6), insert("very very"), retain(10000)]” than repredict the entire remainder of the unedited text.

by sigmoid1015 hours ago|

parent|

[-]

Sounds easy, but isn't in practice. You can look at the edit text file tool in va code copilot for example to see how complicated that can get: https://github.com/microsoft/vscode-copilot-chat/tree/9e668c...

by basch15 hours ago|

parent|

[-]

I have no idea when I’m being lied to anymore but allegedly Aider and Cursor work the way I described, although cursor is using a second model to apply the edit.

by sigmoid101 hours ago|

parent|

[-]

They all do something similar under the hood. Patching files is not a trivial task when you only have the changed text content and not the actual file structure to work with. It kind of works, but is fundamentally limited by the LLM output architecture.

by mike_hearn3 hours ago|

parent|

prev|

[-]

Cursor has a dedicated merge model. It takes input like this:

    class Foo {
        // ....
        int calculation() {
            return 42;
        }
    
        // more stuff
    }

where the main model emits something that is a sort of casual under-specified diff format and the merge model figures out how to interpret it as a patch.

by jfim15 hours ago|

parent|

prev|

[-]

I've seen Claude use sed to edit files on other hosts instead of copying the file back and forth to edit it. Not quite full blown OT but it's going in that direction.

by cryptoz17 hours ago|

parent|

prev|

[-]

This is the approach I take with code edits to existing files at Code+=AI; I wrote a blog post with a simple example of AST modification to illustrate: https://codeplusequalsai.com/static/blog/prompting_llms_to_m...

by endymi0n12 hours ago|

prev|

[-]

I don’t exactly know where MTP inference fits within the inference stack, but does someone know whether it’s possible to implement it for the MLX universe?

by nullc12 hours ago|

prev|

[-]

Thanks for the link,it took qwen3.6-27B-q8 w/256k context on my RTX A6000 from ~20t/s to 55t/s. Prefill is mysteriously slower however, but prefill is so much faster still that I think I'm still bottlenecked on output most of the time.

by _factor10 hours ago|

parent|

[-]

Took 2x AMD MI50s to 50 t/s instead of 20 t/s for Q8 27B. Impressive.

by EGreg19 hours ago|

prev|

[-]

How does this get added in practice?

by flakiness19 hours ago|

parent|

[-]

According to the linked PR, the original model does come with MTP which is another "head" (=output path) in the same model and (supposedly) runs faster.

The current implementation ignores that head but the PR let the tool recognize it, plus does proper integration (run the MTP while running the slower main path then compare the result, I believe.)

by flebron16 hours ago|

parent|

[-]

The standard way of doing MTP is to run the drafter autoregressively for k steps, and then (not concurrently) use the larger model as a verifier for those k tokens at the same time. The larger model can then accept a prefix of those k tokens, and in any case generates one more token (which is needed in case you accepted zero tokens from the drafter). The larger model can effectively use this k as a "batch" dimension, reducing the penalty of large weight loading. Meanwhile the drafter is much smaller, so it's fine for _it_ to be autoregressive, as long as the main model is parallel.

by dakolli19 hours ago|

prev|

[-]

yet, still mostly useless.

by WhitneyLand19 hours ago|

prev|

[-]

Yeah important conceptually to remember MTP is kind of just more weights, but speculative decoding is the runtime algorithm that’s a significant add to whatever code is serving the model.

by HumanOstrich18 hours ago|

parent|

[-]

That is.. inaccurate.

by WhitneyLand15 hours ago|

parent|

[-]

How so? I’m not saying most of work doesn’t go into creating the drafting model or enabling a new head on the primary model, but the point is that however cool it is the result is, more weights. Speculative decoding requires code to be aware of how this works at the inference level.