undefined

points

[-]

> What's the tok/s you get these days?

I ran llama-bench a couple of weeks ago when there was a big speed improvement on llama.cpp (https://github.com/ggml-org/llama.cpp/pull/20361#issuecommen...):

    % llama-bench -m ~/ml-models/huggingface/ubergarm/Qwen3.5-397B-A17B-GGUF/smol-IQ2_XS/Qwen3.5-397B-A17B-smol-IQ2_XS-00001-of-00004.gguf -fa 1 -t 1 -ngl 99 -b 2048 -ub 2048 -d 0,10000,20000,30000,40000,50000,60000,70000,80000,90000,100000,150000,200000,250000
    ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
    ggml_metal_library_init: using embedded metal library
    ggml_metal_library_init: loaded in 0.008 sec
    ggml_metal_rsets_init: creating a residency set collection (keep_alive = 180 s)
    ggml_metal_device_init: GPU name:   MTL0
    ggml_metal_device_init: GPU family: MTLGPUFamilyApple7  (1007)
    ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
    ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
    ggml_metal_device_init: simdgroup reduction   = true
    ggml_metal_device_init: simdgroup matrix mul. = true
    ggml_metal_device_init: has unified memory    = true
    ggml_metal_device_init: has bfloat            = true
    ggml_metal_device_init: has tensor            = false
    ggml_metal_device_init: use residency sets    = true
    ggml_metal_device_init: use shared buffers    = true
    ggml_metal_device_init: recommendedMaxWorkingSetSize  = 134217.73 MB
    | ------------------------------ | ---------: | ---------: | ---------- | ------: | -------: | -: | --------------: | -------------------: |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |           pp512 |        189.67 ± 1.98 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |           tg128 |         19.98 ± 0.01 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d10000 |        168.92 ± 0.55 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d10000 |         18.93 ± 0.02 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d20000 |        152.42 ± 0.22 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d20000 |         17.87 ± 0.01 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d30000 |        139.37 ± 0.28 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d30000 |         17.12 ± 0.01 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d40000 |        128.38 ± 0.33 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d40000 |         16.38 ± 0.00 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d50000 |        118.07 ± 0.55 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d50000 |         15.66 ± 0.00 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d60000 |        108.44 ± 0.38 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d60000 |         14.98 ± 0.01 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d70000 |         98.85 ± 0.18 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d70000 |         14.36 ± 0.00 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d80000 |         91.39 ± 0.49 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d80000 |         13.84 ± 0.00 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  pp512 @ d90000 |         85.76 ± 0.24 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 |  tg128 @ d90000 |         13.30 ± 0.00 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | pp512 @ d100000 |         80.19 ± 0.83 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | tg128 @ d100000 |         12.82 ± 0.00 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | pp512 @ d150000 |         54.46 ± 0.33 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | tg128 @ d150000 |         10.17 ± 0.09 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | pp512 @ d200000 |         47.05 ± 0.15 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | tg128 @ d200000 |          9.04 ± 0.02 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | pp512 @ d250000 |         40.71 ± 0.26 |
    | qwen35moe 397B.A17B Q8_0       | 113.41 GiB |   396.35 B | MTL,BLAS   |       1 |     2048 |  1 | tg128 @ d250000 |          8.01 ± 0.02 |

    build: d28961d81 (8299)

So it starts at 20 tps tg and 190 tps pp with empty context and ends at 8 tps tg and 40 tps pp with 250k prefill.

I suspect that there are still a lot of optimizations to be implemented for Qwen 3.5 on llama.cpp, wouldn't be surprised to reach 25 tps in a few months.

> You're the guy who launched Neovim!

That's me ;D

> I use it every day.

So do I for the past 12 years! Though I admit in the past year I greatly reduced the amount of code I write by hand :/

by hnfong5 hours ago|

parent|

[-]

Apologies to others for the offtopic comment, but thank you so much for neovim. I started using Vim 25 years ago and I almost don't know how to type without a proper Vi-based editor. I don't write as much code these days, but I write other stuff (which definitely needs to be mostly hand written) in neovim and I feel so grateful that this tool is still receiving love and getting new updates.

by tarruda4 hours ago|

parent|

[-]

> in neovim and I feel so grateful that this tool is still receiving love and getting new updates.

@justinmk deserves the credit for this!

by terhechte5 hours ago|

parent|

prev|

[-]

Thank you for NeoVim! I also use it every day, mostly for thinking / text / markdown though these days.

Have you compared against MLX? Sometimes I’m getting much faster responses but it feels like the quality is worse (eg tool calls not working, etc)

by tarruda4 hours ago|

parent|

[-]

> Have you compared against MLX?

I don't think MLX supports similar 2-bit quants, so I never tried 397B with MLX.

However I did try 4-bit MLX with other Qwen 3.5 models and yes it is significantly faster. I still prefer llama.cpp due to it being a one in all package:

- SOTA dynamic quants (especially ik_llama.cpp) - amazing web ui with MCP support - anthropic/openai compatible endpoints (means it can be used with virtually any harness) - JSON constrained output which basically ensures tool call correctness. - routing mode

by arjie5 hours ago|

parent|

prev|

[-]

That's surprisingly fast. Thanks for sharing.