I was using GLM on ZAI coding plan (jerry rigged Claude Code for $3/month), but finding myself asking Sonnet to rewrite 90% of the code GLM was giving me. At some point I was like "what the hell am I doing" and just switched.
To clarify, the code I was getting before mostly worked, it was just a lot less pleasant to look at and work with. Might be a matter of taste, but I found it had a big impact on my morale and productivity.
This is a very common sequence of events.
The frontier hosted models are so much better than everything else that it's not worth messing around with anything lesser if doing this professionally. The $20/month plans go a long way if context is managed carefully. For a professional developer or consultant, the $200/month plan is peanuts relative to compensation.
I've also been testing OpenClaw. It burned 8M tokens during my half hour of testing, which would have been like $50 with Opus on the API. (Which is why everyone was using it with the sub, until Anthropic apparently banned that.)
I was using GLM on Cerebras instead, so it was only $10 per half hour ;) Tried to get their Coding plan ("unlimited" for $50/mo) but sold out...
(My fallback is I got a whole year of GLM from ZAI for $20 for the year, it's just a bit too slow for interactive use.)
Kimi K2.5 is a trillion parameter model. You can't run it locally on anything other than extremely well equipped hardware. Even heavily quantized you'd still need 512GB of unified memory, and the quantization would impact the performance.
Also the proprietary models a year ago were not that good for anything beyond basic tasks.
Are there a lot of options how "how far" do you quantize? How much VRAM does it take to get the 92-95% you are speaking of?
So many: https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overvie...
> How much VRAM does it take to get the 92-95% you are speaking of?
For inference, it's heavily dependent on the size of the weights (plus context). Quantizing an f32 or f16 model to q4/mxfp4 won't necessarily use 92-95% less VRAM, but it's pretty close for smaller contexts.
I'm completely over these hypotheticals and 'testing grade'.
I know Nvidia VRAM works, not some marketing about 'integrated ram'. Heck look at /r/locallama/ There is a reason its entirely Nvidia.
That's simply not true. NVidia may be relatively popular, but people use all sorts of hardware there. Just a random couple of recent self-reported hardware from comments:
- https://www.reddit.com/r/LocalLLaMA/comments/1qw15gl/comment...
- https://www.reddit.com/r/LocalLLaMA/comments/1qw0ogw/analysi...
- https://www.reddit.com/r/LocalLLaMA/comments/1qvwi21/need_he...
- https://www.reddit.com/r/LocalLLaMA/comments/1qvvf8y/demysti...
This is a _remarkably_ aggressive comment!
I'll add on https://unsloth.ai/docs/models/qwen3-coder-next
The full model is supposedly comparable to Sonnet 4.5 But, you can run the 4 bit quant on consumer hardware as long as your RAM + VRAM has room to hold 46GB. 8 bit needs 85.
But as a counterpoint: there are whole communities of people in this space who get significant value from models they run locally. I am one of them.
It is probably enough to handle a lot of what people use the big-3 closed models for. Somewhat slower and somewhat dumber, granted, but still extraordinarily capable. It punches way above its weight class for an 80B model.
PC or Mac? A PC, ya, no way, not without beefy GPUs with lots of VRAM. A mac? Depends on the CPU, an M3 Ultra with 128GB of unified RAM is going to get closer, at least. You can have decent experiences with a Max CPU + 64GB of unified RAM (well, that's my setup at least).
That said, Claude Code is designed to work with Anthropic's models. Agents have a buttload of custom work going on in the background to massage specific models to do things well.
That might incentivize it to perform slightly better from the get go.
Whether it's a giant corporate model or something you run locally, there is no intelligence there. It's still just a lying engine. It will tell you the string of tokens most likely to come after your prompt based on training data that was stolen and used against the wishes of its original creators.