undefined

points

[-]

Quantized, heavily, and offloading everything possible to sysram. You can run it this way, just barely reachable with consumer hardware with 16 to 24gb vram and 256gb sysram. Before the spike in prices you could just about build such a system for $2500, but the ram along probably adds another $2k onto that now. Nvidia dgx boxes and similar setups with 256gb unified ram can probably manage it more slowly ~1-2 tokens per second. Unsloth has the quantized models. I’ve test Kimi though don’t have quite the headroom at home for it, and I don’t yet see a significant enough difference between it and the Qwen 3 models that can run in more modest setups: I get a highly usable 50 tokens per second out of the A3B instruct that fits into 16gb VRAM with enough left over not to choke Netflix and other browser tasks, it performs on par with what I ask out of Haiku in Claude Code, and better as my own tweaking improves with the also ever better tooling that comes out near weekly.

by oceanplexian18 hours ago|

prev|

[-]

I have an AMD Epyc machine with 512GB of RAM and a humble NVIDIA 3090. You will have to run a quantized version but you can get a couple tokens per second out of it since these models are optimized to split across the GPU/RAM and it's about as good as Claude was 12 months ago.

Full disclosure, I use OpenRouter and pay for models most of the time since it's more practical than 5-10 tokens per second, but the option to run it "If I had to, worst case" is good enough for me. We're also in a rapidly developing technology space and the models are getting smaller and better by the day, ever year the smaller models get better