I ran llama-bench a couple of weeks ago when there was a big speed improvement on llama.cpp (https://github.com/ggml-org/llama.cpp/pull/20361#issuecommen...):
% llama-bench -m ~/ml-models/huggingface/ubergarm/Qwen3.5-397B-A17B-GGUF/smol-IQ2_XS/Qwen3.5-397B-A17B-smol-IQ2_XS-00001-of-00004.gguf -fa 1 -t 1 -ngl 99 -b 2048 -ub 2048 -d 0,10000,20000,30000,40000,50000,60000,70000,80000,90000,100000,150000,200000,250000
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.008 sec
ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
ggml_metal_device_init: GPU name: MTL0
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: has tensor = false
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 134217.73 MB
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -------: | -: | --------------: | -------------------: |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 | 189.67 ± 1.98 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 | 19.98 ± 0.01 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d10000 | 168.92 ± 0.55 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d10000 | 18.93 ± 0.02 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d20000 | 152.42 ± 0.22 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d20000 | 17.87 ± 0.01 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d30000 | 139.37 ± 0.28 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d30000 | 17.12 ± 0.01 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d40000 | 128.38 ± 0.33 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d40000 | 16.38 ± 0.00 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d50000 | 118.07 ± 0.55 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d50000 | 15.66 ± 0.00 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d60000 | 108.44 ± 0.38 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d60000 | 14.98 ± 0.01 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d70000 | 98.85 ± 0.18 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d70000 | 14.36 ± 0.00 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d80000 | 91.39 ± 0.49 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d80000 | 13.84 ± 0.00 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d90000 | 85.76 ± 0.24 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d90000 | 13.30 ± 0.00 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d100000 | 80.19 ± 0.83 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d100000 | 12.82 ± 0.00 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d150000 | 54.46 ± 0.33 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d150000 | 10.17 ± 0.09 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d200000 | 47.05 ± 0.15 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d200000 | 9.04 ± 0.02 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | pp512 @ d250000 | 40.71 ± 0.26 |
| qwen35moe 397B.A17B Q8_0 | 113.41 GiB | 396.35 B | MTL,BLAS | 1 | 2048 | 1 | tg128 @ d250000 | 8.01 ± 0.02 |
build: d28961d81 (8299)
So it starts at 20 tps tg and 190 tps pp with empty context and ends at 8 tps tg and 40 tps pp with 250k prefill.I suspect that there are still a lot of optimizations to be implemented for Qwen 3.5 on llama.cpp, wouldn't be surprised to reach 25 tps in a few months.
> You're the guy who launched Neovim!
That's me ;D
> I use it every day.
So do I for the past 12 years! Though I admit in the past year I greatly reduced the amount of code I write by hand :/
@justinmk deserves the credit for this!
Have you compared against MLX? Sometimes I’m getting much faster responses but it feels like the quality is worse (eg tool calls not working, etc)
I don't think MLX supports similar 2-bit quants, so I never tried 397B with MLX.
However I did try 4-bit MLX with other Qwen 3.5 models and yes it is significantly faster. I still prefer llama.cpp due to it being a one in all package:
- SOTA dynamic quants (especially ik_llama.cpp) - amazing web ui with MCP support - anthropic/openai compatible endpoints (means it can be used with virtually any harness) - JSON constrained output which basically ensures tool call correctness. - routing mode