upvote
Interesting benchmarks, thanks for sharing!

If you're optimizing for lower power draw + higher throughput on Mac (especially in MLX), definitely keep an eye on the Desloth LLMs that are starting to appear.

Desloth models are basically aggressively distilled and QAT-optimized versions of larger instruction models (think: 7B → 1.3B or 2B) designed specifically for high tokens/sec at minimal VRAM. They're tiny but surprisingly capable for structured outputs, fast completions, and lightweight agent pipelines.

I'm seeing Desloth-tier models consistently hit >50 tok/sec on M1/M2 hardware without needing active cooling ramps, especially when combined with low-bit quant like Q4_K_M or Q5_0.

If you care about runtime efficiency per watt + low-latency inference (vs. maximum capability), these newer Desloth styled architectures are going to be a serious unlock.

reply
> apple silicon chips are also more energy efficient for llm (or gaming) then nvidia.

Which benchmarks are you working off of, exactly? Unless your memory is bottlenecked, neither raster or compute workloads on M4 are more energy efficient than Nvidia's 50-series silicon: https://browser.geekbench.com/opencl-benchmarks

reply
NVIDIA GeForce RTX 5090 - 376224 - 400-550W for gpu + 250-500W for cpu/ram/cooling/etc.

Apple M3 Ultra - 131247 - 200W [1]

Looks like it might be 2.8x faster in the benchmark, but ends up using 3.25x more power at a minimum.

[1] https://www.tweaktown.com/news/103891/apples-new-m3-ultra-ru...

reply
Thank you for the numbers

What have you used those models for, and how would you rate them in those tasks?

reply
RPG prompts works very very well with many of the models, but not the reasoning ones because it ends up thinking endlessly about how to be the absolute best game master possible...
reply
Great use case. And very funny situation with the reasoning models! :)
reply
How does mlx compare with the llama.cpp backend for LM Studio?
reply