undefined

points

[-]

One other thing you might want to check out for running locally. (I have not independently verified yet, it's on the TODO list though)

https://docs.vllm.ai/en/latest/api/vllm/model_executor/layer...

vLLM apparently already has an implementation of turboquant available - which is said to losslessly reduce the memory footprint required by 6x and improve inference speed by 8x.

From what I understand, the steps are:

1. launch vLLM 2. execute a vLLM configure command like "use kv-turboquant for model xyz" 3. that's it

I've got two kids under 8 years old, a full time job, and a developer-tools project that takes like 105% of my mental interests... so there's been a bit of a challenge finding the time to swap from ollama to vLLM in order to find out if that is true.

SO buyer beware :D - and also - if anyone tries it, please let me know if it is worth the time to try it!

by fshen12 hours ago|

prev|

[-]

I use the same computer as you do. m5 can run faster:

pip install mlx_lm

python -m mlx_vlm.convert --hf-path Qwen/Qwen3.6-27B --mlx-path ~/.mlx/models/Qwen3.6-27B-mxfp4 --quantize --q-mode mxfp4 --trust-remote-code

mlx_lm.generate --model ~/.mlx/models/Qwen3.6-27B-mxfp4 -p 'how cpu works' --max-tokens 300

Prompt: 13 tokens, 51.448 tokens-per-sec Generation: 300 tokens, 35.469 tokens-per-sec Peak memory: 14.531 GB

by hresvelgr13 hours ago|

prev|

[-]

You have better specs than I do and I'm running the same model almost twice as fast through GGUF on llama cpp. I'd try some different harnesses.

by mswphd14 hours ago|

prev|

[-]

this is a dense model, so that's expected. On a mac you'd want to try out the Mixture of Experts Qwen3.6 release, namely Qwen3.6-35B-A3B. On an M4 Pro I get ~70 tok/s with it. If your numbers are slower than this, it might be because you're accidentally using a "GGUF" formatted model, vs "MLX" (an apple-specific format that is often more performant for macs).

by someguydave15 hours ago|

prev|

[-]

I got about 7 tokens/sec generation on an M2 max macbook running 8-bit quant on an MLX version.

by noman-land16 hours ago|

prev|

[-]

OpenCode seems to be a lot better than Claude at using local models.

by jedisct18 hours ago|

prev|

[-]

For local models, you should check out https://swival.dev instead of Claude Code.