undefined

points

[-]

Unsloth Studio [0] is what I recommend these days, open source alternative to the more widely known LM Studio, and also built by the people who make good quantizations of released models. With MTP support not merged in you should get 2x token generation speed with no accuracy difference. They also have MLX quants if you scroll down a bit, which is a format specifically for macOS' Metal GPU acceleration but that's not integrated into Unsloth Studio just yet.

[0] https://unsloth.ai/docs/models/qwen3.6#mtp-guide

by egorfine1 hours ago|

parent|

[-]

I have researched for quite a bit and so far the fastest runtime is the oMLX one. But there's a caveat: ttft on MLX on M4 Pro is enormous. On M5 Pro it has been greatly sped up.

by regexorcist8 seconds ago|

parent|

[-]

Curious if you tested llama.cpp and still found oMLX faster? I haven't tried the latter myself, might give it a go.

by mft_29 minutes ago|

parent|

prev|

[-]

I tried Unsloth Studio recently and was disappointed - in particular the downloading functionality is half-baked and didn’t cope with resuming downloads. As it seemed to just be a simple wrapper over llama.cpp, I found that huggingface hub, llama.cpp, and a couple of simple scripts actually offered better functionality once it was set up.

by busfahrer1 hours ago|

prev|

[-]

I have been contemplating a M5 Pro MBP, but for the life for me I wasn't able to find benchmarks for real-world models, do you happen to know how many tokens per second roughly you get with MoE models like Qwen 3.6 35B/A3B or Gemma 4 26B?

by embedding-shape30 minutes ago|

parent|

[-]

You need to ask macOS people for their prefill speed as well, there are two numbers you care about here, and current MacBooks have generally terrible numbers when it comes to prefill performance. Surely it'll get better with time, but if you already have a desktop, I'd go the "beefy GPU" route first.

by egorfine1 hours ago|

parent|

prev|

[-]

Qwen 3.6 35B running on oMLX 0.3.9rc1: on oMLX I get 86 t/s on Q4 and 74 t/s on Q6.

Bear in mind that ttft on MLX is much much faster on M5 Pro as compared to M4 Pro.

Also bear in mind that those figures are with NO optimizations whatsoever: no MCP, no DFlash. I am waiting for both to be released for the Qwen models.

by ahknight1 hours ago|

parent|

prev|

[-]

I'm not normally one to share videos as answers, but this particular fellow does a LOT of work with local AIs and Macs and happens to have a nuanced answer. https://youtu.be/XGe7ldwFLSE

by juancn34 minutes ago|

parent|

prev|

[-]

I'm running unsloth/Qwen3.6-35B-A3B-UD-Q8_K_XL on an M3 Max, 64GB at ~57 t/s with llama-server

by brcmthrowaway27 minutes ago|

parent|

[-]

Prefill speed and 27B number?