undefined

upvote

points

by paxys4 hours ago |

upvote

by andai1 hours ago|

[-]

Yeah this is why I ended up getting Claude subscription in the first place.

I was using GLM on ZAI coding plan (jerry rigged Claude Code for $3/month), but finding myself asking Sonnet to rewrite 90% of the code GLM was giving me. At some point I was like "what the hell am I doing" and just switched.

To clarify, the code I was getting before mostly worked, it was just a lot less pleasant to look at and work with. Might be a matter of taste, but I found it had a big impact on my morale and productivity.

reply

upvote

by Aurornis11 minutes ago|

[-]

> but finding myself asking Sonnet to rewrite 90% of the code GLM was giving me. At some point I was like "what the hell am I doing" and just switched.

This is a very common sequence of events.

The frontier hosted models are so much better than everything else that it's not worth messing around with anything lesser if doing this professionally. The $20/month plans go a long way if context is managed carefully. For a professional developer or consultant, the $200/month plan is peanuts relative to compensation.

reply

upvote

by MuffinFlavored58 minutes ago|

[-]

Did you eventually move to a $20/mo Claude plan, $100/mo Claude plan, $200/mo, or API based? if API based, how much are you averaging a month?

reply

upvote

by andai4 minutes ago|

[-]

The $20 one, but it's hobby use for me, would probably need the $200 one if I was full time. Ran into the 5 hour limit in like 30 minutes the other day.

I've also been testing OpenClaw. It burned 8M tokens during my half hour of testing, which would have been like $50 with Opus on the API. (Which is why everyone was using it with the sub, until Anthropic apparently banned that.)

I was using GLM on Cerebras instead, so it was only $10 per half hour ;) Tried to get their Coding plan ("unlimited" for $50/mo) but sold out...

(My fallback is I got a whole year of GLM from ZAI for $20 for the year, it's just a bit too slow for interactive use.)

reply

upvote

by zozbot2344 hours ago|

[-]

The best open models such as Kimi 2.5 are about as smart today as the big proprietary models were one year ago. That's not "nothing" and is plenty good enough for everyday work.

reply

upvote

by Aurornis13 minutes ago|

[-]

> The best open models such as Kimi 2.5 are about as smart today as the big proprietary models were one year ago

Kimi K2.5 is a trillion parameter model. You can't run it locally on anything other than extremely well equipped hardware. Even heavily quantized you'd still need 512GB of unified memory, and the quantization would impact the performance.

Also the proprietary models a year ago were not that good for anything beyond basic tasks.

reply

upvote

by reilly30004 hours ago|

[-]

Which takes a $20k thunderbolt cluster of 2 512GB RAM Mac Studio Ultras to run at full quality…

reply

upvote

by 0xbadcafebee1 hours ago|

[-]

Most benchmarks show very little improvement of "full quality" over a quantized lower-bit model. You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.

reply

upvote

by MuffinFlavored57 minutes ago|

[-]

> You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.

Are there a lot of options how "how far" do you quantize? How much VRAM does it take to get the 92-95% you are speaking of?

reply

upvote

by bigyabai47 minutes ago|

[-]

> Are there a lot of options how "how far" do you quantize?

So many: https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overvie...

> How much VRAM does it take to get the 92-95% you are speaking of?

For inference, it's heavily dependent on the size of the weights (plus context). Quantizing an f32 or f16 model to q4/mxfp4 won't necessarily use 92-95% less VRAM, but it's pretty close for smaller contexts.

reply

upvote

by MuffinFlavored19 minutes ago|

[-]

Thank you. Could you give a tl;dr on "the full model needs ____ this much VRAM and if you do _____ the most common quantization method it will run in ____ this much VRAM" rough estimate please?

reply

upvote

by bigyabai1 hours ago|

[-]

"Full quality" being a relative assessment, here. You're still deeply compute constrained, that machine would crawl at longer contexts.

reply

upvote

by PlatoIsADisease2 hours ago|

[-]

[flagged]

reply

upvote

by 2 hours ago|

[-]

deleted

reply

upvote

by zozbot2342 hours ago|

[-]

70B dense models are way behind SOTA. Even the aforementioned Kimi 2.5 has fewer active parameters than that, and then quantized at int4. We're at a point where some near-frontier models may run out of the box on Mac Mini-grade hardware, with perhaps no real need to even upgrade to the Mac Studio.

reply

upvote

by PlatoIsADisease2 hours ago|

[-]

>may

I'm completely over these hypotheticals and 'testing grade'.

I know Nvidia VRAM works, not some marketing about 'integrated ram'. Heck look at /r/locallama/ There is a reason its entirely Nvidia.

reply

upvote

by hnfong11 minutes ago|

[-]

> Heck look at /r/locallama/ There is a reason its entirely Nvidia.

That's simply not true. NVidia may be relatively popular, but people use all sorts of hardware there. Just a random couple of recent self-reported hardware from comments:

- https://www.reddit.com/r/LocalLLaMA/comments/1qw15gl/comment...

- https://www.reddit.com/r/LocalLLaMA/comments/1qw0ogw/analysi...

- https://www.reddit.com/r/LocalLLaMA/comments/1qvwi21/need_he...

- https://www.reddit.com/r/LocalLLaMA/comments/1qvvf8y/demysti...

reply

upvote

by K0balt36 minutes ago|

[-]

Mmmm, not really. I have both a4x 3090 box and a Mac m1 with 64 gb. I find that the Mac performs about the same as a 2x 3090. That’s nothing stellar, but you can run 70b models at decent quants with moderate context windows. Definitely useful for a lot of stuff.

reply

upvote

by sealeck2 hours ago|

[-]

Are you an NVIDIA fanboy?

This is a _remarkably_ aggressive comment!

reply

upvote

by PlatoIsADisease1 hours ago|

[-]

Not at all. I don't even know why someone would be incentivized by promoting Nvidia outside of holding large amounts of stock. Although, I did stick my neck out suggesting we buy A6000s after the Apple M series didn't work. To 0 people's surprise, the 2xA6000s did work.

reply

upvote

by teaearlgraycold4 hours ago|

[-]

Which while expensive is dirt cheap compared to a comparable NVidia or AMD system.

reply

upvote

by SchemaLoad3 hours ago|

[-]

It's still very expensive compared to using the hosted models which are currently massively subsidised. Have to wonder what the fair market price for these hosted models will be after the free money dries up.

reply

upvote

by cactusplant73743 hours ago|

[-]

Inference is profitable. Maybe we hit a limit and we don't need as many expensive training runs in the future.

reply

upvote

by paxys2 hours ago|

[-]

Inference APIs are probably profitable, but I doubt the $20-$100 monthly plans are.

reply

upvote

by teaearlgraycold2 hours ago|

[-]

For sure Claude Code isn’t profitable

reply

upvote

by bdangubic1 hours ago|

[-]

Neither was Uber and … and …

reply

upvote

by plagiarist30 minutes ago|

[-]

Businesses will desire me for my insomnia once Anthropics starts charging congestion pricing.

reply

upvote

by blharr3 hours ago|

[-]

What speed are you getting at that level of hardware though?

reply

upvote

by 0xbadcafebee1 hours ago|

[-]

Kimi K2.5 is fourth place for intelligence right now. And it's not as good as the top frontier models at coding, but it's better than Claude 4.5 Sonnet. https://artificialanalysis.ai/models

reply

upvote

by corysama2 hours ago|

[-]

The article mentions https://unsloth.ai/docs/basics/claude-codex

I'll add on https://unsloth.ai/docs/models/qwen3-coder-next

The full model is supposedly comparable to Sonnet 4.5 But, you can run the 4 bit quant on consumer hardware as long as your RAM + VRAM has room to hold 46GB. 8 bit needs 85.

reply

upvote

by paxys3 hours ago|

[-]

LOCAL models. No one is running Kimi 2.5 on their Macbook or RTX 4090.

reply

upvote

by DennisP2 hours ago|

[-]

On Macbooks, no. But there are a few lunatics like this guy:

https://www.youtube.com/watch?v=bFgTxr5yst0

reply

upvote

by teaearlgraycold4 hours ago|

[-]

Having used K2.5 I’d judge it to be a little better than that. Maybe as good as proprietary models from last June?

reply

upvote

by bityard2 hours ago|

[-]

Correct, a rack full of datacenter equipment is not going to compete with anything that fits on your desk or lap. Well spotted.

But as a counterpoint: there are whole communities of people in this space who get significant value from models they run locally. I am one of them.

reply

upvote

by kamov2 hours ago|

[-]

What do you use local models for? I'm asking generally about possible applications of these smaller models

reply

upvote

by Gravey2 hours ago|

[-]

Would you mind sharing your hardware setup and use case(s)?

reply

upvote

by CamperBob22 hours ago|

[-]

Not the GP but the new Qwen-Coder-Next release feels like a step change, at 60 tokens per second on a single 96GB Blackwell. And that's at full 8-bit quantization and 256K context, which I wasn't sure was going to work at all.

It is probably enough to handle a lot of what people use the big-3 closed models for. Somewhat slower and somewhat dumber, granted, but still extraordinarily capable. It punches way above its weight class for an 80B model.

reply

upvote

by redwood_2 hours ago|

[-]

Agree, these new models are a game changer. I switched from Claude to Qwen3-Coder-Next for day-to-day on dev projects and don't see a big difference. Just use Claude when I need comprehensive planning or review. Running Qwen3-Coder-Next-Q8 with 256K context.

reply

upvote

by zozbot2342 hours ago|

[-]

IIRC, that new Qwen model has 3B active parameters so it's going to run well enough even on far less than 96GB VRAM. (Though more VRAM may of course help wrt. enabling the full available context length.) Very impressive work from the Qwen folks.

reply

upvote

by 2 hours ago|

[-]

deleted

reply

upvote

by seanmcdirmid53 minutes ago|

[-]

> (ones you run on beefy 128GB+ RAM machines)

PC or Mac? A PC, ya, no way, not without beefy GPUs with lots of VRAM. A mac? Depends on the CPU, an M3 Ultra with 128GB of unified RAM is going to get closer, at least. You can have decent experiences with a Max CPU + 64GB of unified RAM (well, that's my setup at least).

reply

upvote

by QuantumNomad_44 minutes ago|

[-]

Which models do you use, and how do you run them?

reply

upvote

by mycall58 minutes ago|

[-]

There is tons of improvements in the near future. Even Claude Code developer said he aimed at delivering a product that was built for future models he betted would improve enough to fulfill his assumptions. Parallel vLLM MoE local LLMs on a Strix Halo 128GB has some life in it yet.

reply

upvote

by 0xbadcafebee1 hours ago|

[-]

The best local models are literally right behind Claude/Gemini/Codex. Check the benchmarks.

That said, Claude Code is designed to work with Anthropic's models. Agents have a buttload of custom work going on in the background to massage specific models to do things well.

reply

upvote

by girvo17 minutes ago|

[-]

The benchmarks simply do not match my experience though. I don’t put that much stock in them anymore.

reply

upvote

by richstokes1 hours ago|

[-]

This. It's a false economy if you value your time even slightly, pay for the extra tokens and use the premium models.

reply

upvote

by dheera3 hours ago|

[-]

Maybe add to the Claude system prompt that it should work efficiently or else its unfinished work will be handed off to to a stupider junior LLM when its limits run out, and it will be forced to deal with the fallout the next day.

That might incentivize it to perform slightly better from the get go.

reply

upvote

by kridsdale32 hours ago|

[-]

"You must always take two steps forward, for when you are off the clock, your adversary will take one step back."

reply

upvote

by bicx3 hours ago|

[-]

Exactly. The comparison benchmark in the local LLM community is often GPT _3.5_, and most home machines can’t achieve that level.

reply

upvote

by DANmode2 hours ago|

[-]

and you really should be measuring based on the worst-case scenario for tools like this.

reply

upvote

by nik2820003 hours ago|

[-]

> intelligence

Whether it's a giant corporate model or something you run locally, there is no intelligence there. It's still just a lying engine. It will tell you the string of tokens most likely to come after your prompt based on training data that was stolen and used against the wishes of its original creators.

reply