The best "free" experience I've found is using OpenCode with Big Pickle. It's not especially smart, so it often won't produce the correct result the first time, but the free tier is generous enough that I don't think I've hit the limit more than twice over around a month with frequent multi-hour sessions. If running locally is truly the goal, it's not going to fit the bill, but if the goal is just "get the best experience without having to pay for a sub or tokens", it's the least bad option I've found so far.
Locally I haven’t gone much further than 8k. That is sufficient for small changes on small code bases. And you need condensed tool output.
I haven’t tried any tool that compresses the tokens yet.
1. The hardware will eventually catch up.
2. This keeps the delta between frontier models smaller.
3. We can still fine tune and own the weights.
4. The models will be more useful, faster, and reliable.
RTX is hobbyist tier, not professional tier.
Gated cloud models from hyperscalers treat us like hobbyists in their own right.
We need equivalent scale models, but open.
I have absolutely zero interest in free. I honestly don't think I'm even remotely in the same demographic as people using free tiers / models.
I want to pay. I don't want my data used for training. I want it to be open. I want it to be consistently up (more than Claude!). I want it to be fast. I don't want it to be subsidized as that's just an excuse for shitty quality. Deepseek flash knocks it out of the park on all of these except you're data is used in training. I'm fine with it being hosted since there's no way I'm using it 24/7, but data MUST be private.
Basically I want Hetzner and OVH to run open model clouds. I'm convinced this is going to happen eventually when everyone realizes this is a commodity.
For me, paying from $200 - $500 / month is reasonable if I can sustain a disruption free flow that doesn't require constant yak shaving. What I've found experimenting with DeepSeek on some open source library stuff is that it's actually going to cost me much less if I don't need frontier vibing (which I don't).
There'll probably need to be a threat of massive litigation should they fail to comply with such a policy.
I'm interested in this thought. There is significant motivation for providers to create a verifiable way for them not to deal with having access to client interactions with LLMs at all. Whatever standards and protocols have to be come up with in order to reassure clients.
Any good standards for privacy when interacting with LLMs could also trickle down to smaller providers, and everyone could offer guarantees. Even if the guarantee was literally just an insurance policy and a private court to decide if it pays out.
I wonder if there are competent models trained purely on permissive open-source code like MIT or Apache 2.0.
If you throw the right GPUs at the problem, they become much better - but still not quite in the realm of Sonnet or DeepSeek 4 Flash, let alone Opus / DeepSeek Pro or Mythos/Fable/GPT-5.5.
Given enough budget, power, and cooling, you can run some pretty good data pipelines, but for code, I think it still makes sense to shell out to an API provider most of the time.
Ultimately if you skip over the opportunity to play with these models on your own machine you are losing out on a lot of really interesting educational opportunities — it helps make a lot of stuff feel more concrete in a way that only tinkering can.
But then I think once I had an idea of something that I was building against Gemma 4 or Qwen 3.6 I would be looking at openrouter etc., to stabilise it for the next tier of experimentation (and to get back a kind of multi-device access without tailscale/lm link etc.).
Are they good enough to replace what people seem to want to do with Claude? Maybe not. But it's an unparalleled learning opportunity.
Even faster with the MLX builds.
Then when I need more heavy lifting I fire up a larger model.
IMHO the issue isn't the models. I've had OpenClaw give the same results as Claude using open models locally. Slower but does the job. Something that can do optimal model switching is what's needed.
> but still not quite in the realm of Sonnet or DeepSeek 4 Flash
these are not mutually exclusive anymore. DS4 has set the bar for me these days. https://github.com/antirez/ds4
This setup is extremely optimized down to the last flag. Changing any param above the temp flag craters performance.
I don't have enough system RAM to properly handle the large context windows so I don't use local models.
# 1,257 tokens 17s 72.18 t/s
$env:CUDA_DEVICE_SCHEDULE = "SPIN"
cd D:\src\llama.cpp\
.\build\bin\Release\llama-server.exe `
--port 8080 `
--host 127.0.0.1 `
-m "D:\LLM\Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf" `
-fitt 2048 `
-c 98304 `
-n 32768 `
-fa on `
-np 1 `
--kv-unified `
-ctk q8_0 `
-ctv q8_0 `
-ctkd q8_0 `
-ctvd q8_0 `
-ctxcp 64 `
--mlock `
--no-warmup `
--spec-type draft-mtp `
--spec-draft-n-max 2 `
--spec-draft-p-min 0.1 `
--chat-template-kwargs '{\"preserve_thinking\": true}' `
--temp 0.6 `
--top-p 0.95 `
--top-k 20 `
--min-p 0.0 `
--presence-penalty 0.0 `
--repeat-penalty 1.0Also, funny lumping the M4 "and" the M5, I find them 15% to 45% different performance, depending.
And for a good deal of work, an M3 Studio Ultra outpaces the M4 and ties the M5 on single work at a time, outpaces both doing multiple work at a time.
The Q4_K_XL bit for those not in the know.
local models do involve some context engineering to get it okay, but it's not that rough
It outperforms all the Qwen models (even 100B+) for rule following/automation style tasks in my experience. Its image interpretation is also very good, and out-benchmarks Opus.
Qwen seems to ignore instructions and consistently outputs incorrect formats (when token generation format is not explicitly constrained)
But yes, on the DGX Spark Gemma 31B Q4 with MTP runs around 20 tok/s and Gemma 26B A4B around 60 tok/s. Still quite slow. But on a high end Nvidia card would run significantly faster and still fit in memory.
I'd recommend for anyone getting into local models to focus on memory bandwidth over RAM. Models under 100B parameters are now sufficient and hugely useful for automation.
I agree that for coding/creation use cases, there's still not a compelling argument for local models.
But e.g. if you want to scan a list of stocks and interpret news/high pass filtering, interpreting logs, interpreting screenshots, the local models are more than sufficient already.
Gemma will just stop mid-tool call. It's been slower and I've had to reduce context size to run it. Qwen3.6 27b has been rock solid using club 3090's single card setup for agentic use -- https://github.com/noonghunna/club-3090/blob/master/docs/SIN...
E.g. prompt A to achieve X, output in format Y. Use Y to do something in prompt B.
Agentic loops will underperform deterministic control flow pipelines (with non-determinism constrained to LLM calls).
Agents are more general, which is the main advantage. But inherently a more general solution will waste context on unnecessary reasoning.
Try asking the smaller Qwen models to output a JSON in a specific format. It basically can't do it consistently with a moderately sized prompt unless you constrain the token generation via GGML or are extremely repetitive and specific about it. (Thinking disabled)
Gemma 4 will do it correctly pretty much 100% of the time. (Thinking disabled)
Applies to other rule following as well in my experience.
Qwen may be better at toolcalling and certainly probably codegen.
It seems to me Google explicitly designed Gemma for edge device automation, and didn't fine tune for agentic or coding use cases.
I'm really surprised how much slower a DGX spark is for the same price.
1. Here's my command.
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \ vllm serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit \ --dtype auto \ --gpu-memory-utilization 0.95 \ --kv-cache-dtype fp8 \ --enable-chunked-prefill \ --enable-prefix-caching \ --trust-remote-code \ --enable-auto-tool-choice \ --tool-call-parser gemma4 \ --reasoning-parser gemma4 \ --max-num-batched 16000 \ --max-model-len 64000 \ --max-num-seqs 12 --speculative-config '{"model": "./gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 4}'
You can run multiple instances of these models in parallel on the DGX Spark which somewhat mitigates the difference if your task is parallelizable.
But I'd take the simplicity of a single thread and higher throughput personally.
Overall of course still better to wait for next gen devices if you can.
I was expecting it would run Q8 in 50 tok/s.
I guess that’s good I stopped thinking about buying it because I would be disappointed.
If you don’t need the machine to respond instantly (or explain your own business model to you) everything can be local and it’s been like that for a few years now.
In fact it really feels like OpenAI models have taken a nose dive this week compared with Claude. At least for my specific workloads (these things are so variable it’s like trying to compare Google results…)
I don't care how many tokens per second of nonsense it can generate.
Well, you aren't going to give it a 20k line sec and have it churn out a full app after 4 hours hours.
But, you can get it to write code for you if you do the design.
It is probably not smart enough for "design this whole architecture of this complex system from scratch, make no mistakes", but that is not something I want from a coding tool anyway. I want a model that I can point to a file and tell it to make some changes to the file and related files. Or that I can ask to review a PR with regards to certain aspects.
My suggestion is to simply try it and see what it feels like.
I find devstral (even though it’s weak generally) much better at writing and documentation than Opus. I’m actually now delegating all documentation to devstral and away from Claude, which makes a mess.
(geddit?)
The carpenter has to get up close and personal with the wood. He can't match the crew's throughput, but maybe that's not what he's trying to do.
you get a macbook for work, you run the macbook
they're not going to start giving GPUs to employees to run local models
Our GPU computer server cost $110k.
1) a "programming desktop" with a $500 upper mid range Ryzen (idr exact), 8GB VRAM Radeon card I bought solely for RuneScape, and 64GB ram
2) a maxed out Alienware 16 Area51, so it's a 5090 with 24GB vram and 64GB system ram. I bought it for gaming, of course.
I run qwen 3.6 35B A3B Q6 with 200k context window. I compare this to Claude pro max or whatever that I use at work.
The main difference between the machines is that the one with the RuneScape gpu does 10 TPS while the Alienware does 30-40tps. Both are fine though the 30-40tps is obviously a lot snappier.
I find with both models that:
- they do really well at "be a 30GB zip file of reddit and stackoverflow answers"
- they do really well at point fixing random bullshit errors that would otherwise waste my time (this is related to above of course)
- they do quite well at, given a pretty good specification of what you want, figuring it out, even if you've specified several steps needed
- they both cannot really be given a large ish task and left to just drive it on their own
The main difference between the two is with that last one, Claude is somewhat better and figuring SOMETHING out, but if Claude is having to figure it out, it's probably because I don't know what I want and it's very likely to not make a sane choice, and will generally produce slop given even the slightest amount of leash still.
I've also found that the boundary between "well specified small to medium thing" and "idk just do thing and figure it out" is the difference between you keeping control of the code and losing control. There's an "escape velocity" of AI use that, when you hit it, you're doomed to slop forever. (Or you have to deorbit... enjoy that).
So, are these models "worse" than Claude? Yeah. Am I looking forward to continued improvements? Yeah. But I now also have no desire to pay anthropic any amount of money, which has the nice side effect that i won't be helping them end up with so much money that they can distort our democracy.
You generally want to run q8 or some kind of "6bit" quantization at least.
40GB of VRAM is the entry-point in my experience, you can run qwen 3.6 35b a3b with full context or qwen 27b with about 92k of context.
Before you get fully discouraged, you don't need 1 gpu with 40GBs you can use multiple cards, with minimum impact on performance.
i use it usecases like that latter and they are fine.