I have not done in-depth, really controlled testing and there is much about performance tuning I don't understand, but it's fairly clear to me that on an M1 Max, MLX does not have the massive advantage it may have on other machines or other models.
It is wholly possible that MLX is _much_ better on the M3 and up, because the neural engine is that much better.
Frankly I think llama.cpp may simply have caught up quite a lot.
MTP is the same issue. There is always a chance that adding a separate MTP draft model has more compute overhead than it brings in terms of speedup, and since I am using an older machine and the MoE models, I am not actually in a zone where MTP can actually add much. What happens is that there's an enormous advantage in speed handling while the prompt and the early reasoning and it then tails off dramatically to be worse, on average, than non MTP.
(Qwen 3.5 35B shows, possibly, a small advantage if its internal MTP is enabled. But it is small — 10% maybe.)
For the 26B Gemma 4, MLX and MTP combined were noticeably slower than the GGUF is with llama.cpp.
If it were a newer machine with a larger, dense model, I'd definitely expect to see an advantage from MTP, and it is possible that there are some parameters I can tweak (duplicate token penalty, temperature, shared cache stuff) that give MTP more of an edge (keep its successful prediction rate higher).
Either way, it feels like the smallish gain I will see on this particular bit of kit might not be worth the long, long journey down that rabbit hole right now.
I'm using the GGUF too; it appears slightly faster in llama.cpp now than current LM Studio but it's not clear to me if that is down to LM Studio having a little more code overhead, older llama.cpp under the hood, or just parameter differences.