(ai.georgeliu.com)
https://pchalasani.github.io/claude-code-tools/integrations/...
The 26BA4B is the most interesting to run on such hardware, and I get nearly double the token-gen speed (40 tok/s) compared to Qwen3.5 35BA3B. However the tau2 bench results[1] for this Gemma4 variant lag far behind the Qwen variant (68% vs 81%), so I don’t expect the former to do well on heavy agentic tool-heavy tasks:
Related note from someone building in this space: I've been working on cloclo (https://www.npmjs.com/package/cloclo), an open-source coding agent CLI, and this is exactly the direction I'm excited about. It natively supports LM Studio, Ollama, vLLM, Jan, and llama.cpp as providers alongside cloud models, so you can swap between local and hosted backends without changing how you work.
Feels like we're getting closer to a good default setup where local models are private/cheap enough to use daily, and cloud models are still there when you need the extra capability.
cloclo is a runtime for agent toolkits. You plug it into your own agents and it gives them multi-agent orchestration (AICL protocol), 13 providers, skill registry, native browser/docs/phone tools, memory, and an NDJSON bridge. Zero native deps.
For example, this article was posted recently, Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed [0].
Users are definitely going to get more software and more features and redesigns in the software they use, but I have strong doubts that it's going to get better.
If pre-LLM developer productivity was used to build all sorts of deranged anti-user promo-padding bullshit, imagine how much more of it we can do with a 2x more productive employee base.
ollama launch claude --model gemma4:26b OLLAMA_CONTEXT_LENGTH=64000 ollama serve
or if you're using the app, open the Ollama app's Settings dialog and adjust there.Codex also works:
ollama launch codex --model gemma4:26b ollama launch claude --model gemma4:26b-a4b-it-q8_0UPD: tried ollama-vulkan. It works, gemma4:31b-it-q8_0 with 64k context!
Bump it to native (or -c 0 may work too)
I found this visualisation helpful - https://vectree.io/c/sparse-activation-patterns-and-memory-e...
I mean yeah true but depends on how big the model is. The example I gave (Qwen 3.5 35BA3B) was fitting a 35B Q4 K_M (say 20 GB in size) model in 12 GB VRAM. With a 4070Ti + high speed 32 GB DDR5 ram you can easily get 700 token/sec prompt processing and 55-60 token/sec generation which is quite fast.
On the other hand if I try to fit a 120B model in 96 GB of DDR5 + the same 12 GB VRAM I get 2-5 token/sec generation.
There's also the Nvidia DGX Spark.
The MCP piece is where the workflow gets interesting. Instead of building a client that calls endpoints, you describe tools declaratively and the model decides when to invoke them. For financial data this is surprisingly effective — a query like "compare this company's leverage trend to sector peers over 10 years" gets decomposed automatically into the right sequence of tool calls without you hardcoding that logic.
One thing I haven't seen discussed much: tool latency sensitivity is much higher in conversational MCP use than in batch pipelines. A 2s tool response feels fine in a script but breaks conversational flow. We ended up caching frequently accessed tables in-memory (~26MB) to get sub-100ms responses. Have you noticed similar thresholds where latency starts affecting the quality of the model's reasoning chain?
Here’s an article from Anthropic explaining why, but it is 5 months old so perhaps it's irrelevant ancient history at this point.
https://www.anthropic.com/engineering/code-execution-with-mc...
Using ollama's api doesn't have the same issue, so I've stuck to using ollama for local development work.
And if you somehow managed to open up a big enough VRAM playground, the open weights models are not quite as good at wrangling such large context windows (even opus is hardly capable) without basically getting confused about what they were doing before they finish parsing it.
I'd rate their coding agent harness as slightly to significantly less capable than claude code, but it also plays better with alternate models.
I'm with you on this. I've tried Gemma and Claude code and it's not good. Forgets it can use bash!
However, Gemma running locally with Pi as the harness is a beast.
I measured a 4bit quant of this model at 1300t/s prefill and ~60t/s decode on Ryzen 395+.
So, framework laptops are great for chatting but nearly useless in agentic coding.
My Radeon W7900 answers a question ("what is this project") in 2 minutes, it takes my Framework 16 with 5070 addon around 11 minutes without the addon - around 23 (qwen 3.5 27b, claude code)
There are benefits too. Some developers might learn to use Claude Code outside of work with cheaper models and then advocate for using Claude Code at work (where their companies will just buy access from Anthropic, Bedrock, etc). Similar to how free ESXi licenses for personal use helped infrastructure folks gain skills with that product which created a healthy supply of labor and VMware evangelists that were eager to spread the gospel. Anthropic can't just give away access to Claude models because of cost so there is use in allowing alternative ways for developers to learn how to use Claude Code and develop a workflow with it.
And is running a local model with Claude Code actually usable for any practical work compared to the hosted Anthropic models?
It's an okay-enough tool, but I don't see a lot of point in using it when open sources tools like Pi and OpenCode exist (or octofriend, or forge, or droid, etc).
$ llama-server --reasoning auto --fit on -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL --temp 1.0 --top-p 0.95 --top-k 64
$ uvx swival --provider llamacpp
Done.
Why/why not?
It's so jank, there are far superior cli coding harness out there