undefined

points

[-]

On paper the M4 should be roughly 1/3 of the M5, in practice it is only 1/2. With the right, optimized model like qwen3.6 35B MoE MLX you can get over 40 tok / sec on it. I run dozens of background jobs that are not time-critical on it.

by bfjvibybd6cuvu69 hours ago|

parent|

[-]

What kind of jobs?

by bigyabai19 hours ago|

prev|

[-]

> When it comes to inference speed, you want your model to fit in memory, and then to have as much memory bandwidth as possible.

This is only true when your GPU isn't bottlenecked building a KV cache, which it usually will be on Apple Silicon. The Achilles heel of the M-series chips are their weak, SOC-grade GPU that holds back the Max and Ultra models from having interactive TTFTs on larger models and contexts.

by fancyfredbot5 hours ago|

parent|

[-]

Normally people refer to the compute-bound phase as "prefill". Nothing wrong with saying it's building the kv cache though, it's accurate just unusual.