The new tensor cores, sorry, "Neural Accelerator" only really help with prompt preprocessing aka prefill, and not with token generation. Token generation is memory bound.
Hopefully the Ultra version (if it exists) has a bigger jump in memory bandwidth and maximum RAM.
Most stuff ends up running Metal -> GPU I thought
https://creativestrategies.com/research/m5-apple-silicon-its...
That's actually the biggest growth area in LLMs, it is no longer about smart, it is about context windows (usable ones, note spec-sheet hypotheticals). Smart enough is mostly solved, combating larger problems is slowly improving with every major release (but there is no ceiling).
This seems even likely as the memory bandwidth hasn't increased enough for those kinds of speedups, and I guess prefill is more likely to be compute-bound (vs mem bw bound).
I'd take that tradeoff. On my M3 Ultra, the inference is surprisingly fast, but the prompt processing speed makes it painful except as a fallback or experimentation, especially with agentic coding tools.
Wondering if local LLM (for coding) is a realistic option, otherwise I wouldn't have to max out the RAM.
For reference:
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen35 ?B Q5_K - Medium | 6.12 GiB | 8.95 B | MTL,BLAS | 6 | pp512 | 288.90 ± 0.67 |
| qwen35 ?B Q5_K - Medium | 6.12 GiB | 8.95 B | MTL,BLAS | 6 | tg128 | 16.58 ± 0.05 |
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | MTL,BLAS | 6 | pp512 | 615.94 ± 2.23 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | MTL,BLAS | 6 | tg128 | 42.85 ± 0.61 |
Klein 4B completes a 1024px generation in 72seconds.I think the truth is somewhere in the middle, many people don't realize just how performant (especially with MLX) some of these models have become on Mac hardware, and just how powerful the shared memory architecture they've built is, but also there is a lot of hype and misinformation on performance when compared to dedicated GPU's. It's a tradeoff between available memory and performance, but often it makes sense.