https://docs.vllm.ai/en/latest/api/vllm/model_executor/layer...
vLLM apparently already has an implementation of turboquant available - which is said to losslessly reduce the memory footprint required by 6x and improve inference speed by 8x.
From what I understand, the steps are:
1. launch vLLM 2. execute a vLLM configure command like "use kv-turboquant for model xyz" 3. that's it
I've got two kids under 8 years old, a full time job, and a developer-tools project that takes like 105% of my mental interests... so there's been a bit of a challenge finding the time to swap from ollama to vLLM in order to find out if that is true.
SO buyer beware :D - and also - if anyone tries it, please let me know if it is worth the time to try it!
pip install mlx_lm
python -m mlx_vlm.convert --hf-path Qwen/Qwen3.6-27B --mlx-path ~/.mlx/models/Qwen3.6-27B-mxfp4 --quantize --q-mode mxfp4 --trust-remote-code
mlx_lm.generate --model ~/.mlx/models/Qwen3.6-27B-mxfp4 -p 'how cpu works' --max-tokens 300
Prompt: 13 tokens, 51.448 tokens-per-sec Generation: 300 tokens, 35.469 tokens-per-sec Peak memory: 14.531 GB