Running Gemma 4 locally with LM Studio's new headless CLI and Claude Code

upvote

Running Gemma 4 locally with LM Studio's new headless CLI and Claude Code

(ai.georgeliu.com)

384 points

by vbtechguy1 days ago |

upvote

by d4rkp4ttern8 hours ago|

[-]

You can use llama.cpp server directly to serve local LLMs and use them in Claude Code or other CLI agents. I’ve collected full setup instructions for Gemma4 and other recent open-weight LLMs here, tested on my M1 Max 64 GB MacBook:

https://pchalasani.github.io/claude-code-tools/integrations/...

The 26BA4B is the most interesting to run on such hardware, and I get nearly double the token-gen speed (40 tok/s) compared to Qwen3.5 35BA3B. However the tau2 bench results[1] for this Gemma4 variant lag far behind the Qwen variant (68% vs 81%), so I don’t expect the former to do well on heavy agentic tool-heavy tasks:

[1] https://news.ycombinator.com/item?id=47616761

reply

upvote

by peder7 hours ago|

[-]

Did you have any Anthropic vs OpenAI specification issues with Claude Code? I have been using mlx_vlm and vMLX and I get 400 Bad Request errors from Claude Code. Presumably you're not seeing those issues with llama-server ?

reply

upvote

by d4rkp4ttern4 hours ago|

[-]

Correct, no issues because since at least a few months, llama.cpp/server exposes an Anthropic messages API at v1/messages, in addition to the OpenAI-compatible API at v1/chat/completions. Claude Code uses the former.

reply

upvote

by selectodude5 hours ago|

[-]

I’ve jumped over to oMLX. A ton of rough edges but I think it’s the future.

reply

upvote

by vlowther3 hours ago|

[-]

Same. Opencode + oMLX (0.3.4) + unsloth-Qwen3-Coder-Next-mlx-8bit on my M5 Max w 128GB is the sweet spot for me locally. The prompt decode caching keeps things coherent and fast even when contexts get north of 100k tokens.

reply

upvote

by tatrions6 hours ago|

[-]

[flagged]

reply

upvote

by seifbenayed199212 hours ago|

[-]

Local models are finally starting to feel pleasant instead of just "possible." The headless LM Studio flow is especially nice because it makes local inference usable from real tools instead of as a demo.

Related note from someone building in this space: I've been working on cloclo (https://www.npmjs.com/package/cloclo), an open-source coding agent CLI, and this is exactly the direction I'm excited about. It natively supports LM Studio, Ollama, vLLM, Jan, and llama.cpp as providers alongside cloud models, so you can swap between local and hosted backends without changing how you work.

Feels like we're getting closer to a good default setup where local models are private/cheap enough to use daily, and cloud models are still there when you need the extra capability.

reply

upvote

by SeriousM10 hours ago|

[-]

How does cloclo differ from pi-mono?

reply

upvote

by seifbenayed19923 hours ago|

[-]

pi-mono is a great toolkit — coding agent CLI, unified LLM API, web UI, Slack bot, vLLM pods.

cloclo is a runtime for agent toolkits. You plug it into your own agents and it gives them multi-agent orchestration (AICL protocol), 13 providers, skill registry, native browser/docs/phone tools, memory, and an NDJSON bridge. Zero native deps.

reply

upvote

by hackerman7000012 hours ago|

[-]

The real story here isn't Gemma 4 specifically, it's that the harness and the model are now fully decoupled. Claude Code, OpenCode, Pi, Codex all work with any backend. The coding agent is becoming a commodity layer and the competition is moving to model quality and cost. Good for users, bad for anyone whose moat was the harness

reply

upvote

by satvikpendem5 hours ago|

[-]

Sounds like the exact opposite, models are being commoditized while the harness and tooling around a model is what actually gets significant gains, especially with RL around specific models.

For example, this article was posted recently, Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed [0].

[0] https://news.ycombinator.com/item?id=46988596

reply

upvote

by bckr5 hours ago|

[-]

I think it’s ALL getting commoditized. The winners here are engineers (who are onboard with the agentic surge) and, hopefully, users who get more and better software.

reply

upvote

by vkou1 hours ago|

[-]

> hopefully, users who get more and better software.

Users are definitely going to get more software and more features and redesigns in the software they use, but I have strong doubts that it's going to get better.

If pre-LLM developer productivity was used to build all sorts of deranged anti-user promo-padding bullshit, imagine how much more of it we can do with a 2x more productive employee base.

reply

upvote

by chappyasel4 hours ago|

[-]

[dead]

reply

upvote

by Havoc4 hours ago|

[-]

You could always point Claude Code and open code at a local http endpoint

reply

upvote

by jeremie_strand10 hours ago|

[-]

[dead]

reply

upvote

by trvz1 days ago|

[-]

  ollama launch claude --model gemma4:26b

reply

upvote

by gcampos19 hours ago|

[-]

You need to increase the context window size or the tool calling feature wont work

reply

upvote

by mil2218 hours ago|

[-]

For those wondering how to do this:

  OLLAMA_CONTEXT_LENGTH=64000 ollama serve

or if you're using the app, open the Ollama app's Settings dialog and adjust there.

Codex also works:

  ollama launch codex --model gemma4:26b

reply

upvote

by datadrivenangel1 days ago|

[-]

It's amazing how simple this is, and it just works if you have ollama and claude installed!

reply

upvote

by pshirshov23 hours ago|

[-]

For some reason, that doesn't work for me, claude never returns from some ill loop. Nemotron, glm and qwen 3.5 work just fine, gemma - doesn't.

reply

upvote

by trvz22 hours ago|

[-]

Since that defaults to the q4 variant, try the q8 one:

  ollama launch claude --model gemma4:26b-a4b-it-q8_0

reply

upvote

by pshirshov21 hours ago|

[-]

Even tried gemma4:31b and gemma4:31b with 128k context (I have 72GiB VRAM). Nothing. I'm cursed I guess. That's ollama-rocm if that matters (I had weird bugs on Vulkan, maybe gemma misbehaves on radeons somehow?..).

UPD: tried ollama-vulkan. It works, gemma4:31b-it-q8_0 with 64k context!

reply

upvote

by alfiedotwtf12 hours ago|

[-]

The default context is 128k for the smaller Gemma 4’s and 256k for the bigger ones, so you’re cutting off context and it doesn’t know how to continue.

Bump it to native (or -c 0 may work too)

reply

upvote

by pshirshov9 hours ago|

[-]

In that case the model descriptor on ollama.com is incorrect, because it defaults to 16k. So I have to manually change that to 64/128k. I think you are talking about maximum context size.

reply

upvote

by trvz8 hours ago|

[-]

No, the default context in Ollama varies by the memory available: https://docs.ollama.com/context-length

reply

upvote

by martinald1 days ago|

[-]

Just FYI, MoE doesn't really save (V)RAM. You still need all weights loaded in memory, it just means you consult less per forward pass. So it improves tok/s but not vram usage.

reply

upvote

by functional_dev4 hours ago|

[-]

This confused me at first as well.. inactive experts skip compute, but weights are sill loaded. So memory does not shrink at all.

I found this visualisation helpful - https://vectree.io/c/sparse-activation-patterns-and-memory-e...

reply

upvote

by IceWreck1 days ago|

[-]

It does if you use an inference engine where you can offload some of the experts from VRAM to CPU RAM. That means I can fit a 35 billion param MoE in let's say 12 GB VRAM GPU + 16 gigs of memory.

reply

upvote

by Yukonv22 hours ago|

[-]

With that you are taking a significant performance penalty and become severely I/O bottlenecked. I've been able to stream Qwen3.5-397B-A17B from my M5 Max (12 GB/s SSD Read) using the Flash MoE technique at the brisk pace of 10 tokens per second. As tokens are generated different experts need to be consulted resulting in a lot of I/O churn. So while feasible it's only great for batch jobs not interactive usage.

reply

upvote

by IceWreck21 hours ago|

[-]

> So while feasible it's only great for batch jobs not interactive usage.

I mean yeah true but depends on how big the model is. The example I gave (Qwen 3.5 35BA3B) was fitting a 35B Q4 K_M (say 20 GB in size) model in 12 GB VRAM. With a 4070Ti + high speed 32 GB DDR5 ram you can easily get 700 token/sec prompt processing and 55-60 token/sec generation which is quite fast.

On the other hand if I try to fit a 120B model in 96 GB of DDR5 + the same 12 GB VRAM I get 2-5 token/sec generation.

reply

upvote

by zozbot23421 hours ago|

[-]

Your 120B model likely has way more active parameters, so it can probably only fit a few shared layers in the VRAM for your dGPU. You might be better off running that model on a unified memory platform, slower VRAM but a lot more of it.

reply

upvote

by zozbot23421 hours ago|

[-]

10 tok/s is quite fine for chatting, though less so for interaction with agentic workloads. So the technique itself is still worthwhile for running a huge model locally.

reply

upvote

by charcircuit23 hours ago|

[-]

You never need to have all weights in memory. You can swap them in from RAM, disk, the network, etc. MOE reduces the amount of data that will need to be swapped in for the next forward pass.

reply

upvote

by martinald22 hours ago|

[-]

Yes you're right technically, but in reality you'd be swapping them the (vast?) majority in and out per inference request so would create an enormous bottleneck for the use case the author is using for.

reply

upvote

by zozbot23421 hours ago|

[-]

With unified memory, reading from RAM to GPU compute buffer is not that painful, and you can use partial RAM caching to minimize the impact of other kinds of swapping.

reply

upvote

by mikkupikku9 hours ago|

[-]

In practical terms, is this kind of architecture available to consumers except through Apple?

reply

upvote

by the_pwner2248 hours ago|

[-]

AMD Strix Halo. Available in the Framework desktop, various mini PCs, and the Asus Rog Flow Z13 "gaming tablet." The Z13 is still at $2700 for 128 GB which is an incredible deal with today's RAM prices.

There's also the Nvidia DGX Spark.

reply

upvote

by charcircuit19 hours ago|

[-]

You don't have to only have the experts being actively used in VRAM. You can load as many weights as will fit. If there is a "cache miss" you have to pay the price to swap in the weights, but if there is a hit you don't.

reply

upvote

by vbtechguy1 days ago|

[-]

Here is how I set up Gemma 4 26B for local inference on macOS that can be used with Claude Code.

reply

upvote

by canyon2891 days ago|

[-]

This is a nice writeup!

reply

upvote

by edinetdb18 hours ago|

[-]

Claude Code has become my primary interface for iterating on data pipeline work — specifically, normalizing government regulatory filings (XBRL across three different accounting standards) and exposing them via REST and MCP.

The MCP piece is where the workflow gets interesting. Instead of building a client that calls endpoints, you describe tools declaratively and the model decides when to invoke them. For financial data this is surprisingly effective — a query like "compare this company's leverage trend to sector peers over 10 years" gets decomposed automatically into the right sequence of tool calls without you hardcoding that logic.

One thing I haven't seen discussed much: tool latency sensitivity is much higher in conversational MCP use than in batch pipelines. A 2s tool response feels fine in a script but breaks conversational flow. We ended up caching frequently accessed tables in-memory (~26MB) to get sub-100ms responses. Have you noticed similar thresholds where latency starts affecting the quality of the model's reasoning chain?

reply

upvote

by mjlee8 hours ago|

[-]

I find MCP beneficial too, but do be aware of token usage. With a naive implementation MCP can use significantly more input tokens (and context) than equivalent skills would. With a handful of third party MCPs I’ve seen tens of thousands of tokens used before I’ve started anything.

Here’s an article from Anthropic explaining why, but it is 5 months old so perhaps it's irrelevant ancient history at this point.

https://www.anthropic.com/engineering/code-execution-with-mc...

reply

upvote

by chappyasel4 hours ago|

[-]

[dead]

reply

upvote

by tatrions16 hours ago|

[-]

[flagged]

reply

upvote

by drob5185 hours ago|

[-]

Seems like this might be a great way to do web software testing. We’ve had Selenium and Puppeteer for a long time but they are a bit brittle with respect to the web design. Change something about the design and there’s a high likelihood that a test will break. Seems like this might be able to be smarter about adapting to changes. That’s also a great use for a smaller model like this.

reply

upvote

by robot_jesus1 hours ago|

[-]

Yeah. I think that's an interesting use case. Especially if I can kick it off or schedule it when I'm not actively working. Inference speed (especially with tool calling involved) won't be great on my machines, but if I schedule nightly usability tests of dev sites while I sleep, that could be really cool.

reply

upvote

by drob51855 minutes ago|

[-]

You’re right about inference speed being a concern. I was assuming it’s a small model but even then, one of the browser automation frameworks is going to be faster.

reply

upvote

by ttul5 hours ago|

[-]

I could see a future in which the major AI labs run a local LLM to offload much of the computational effort currently undertaken in the cloud, leaving the heavy lifting to cloud-hosted models and the easier stuff for local inference.

reply

upvote

by dominotw5 hours ago|

[-]

wouldnt that be counter to their whole business model?

reply

upvote

by ttul4 hours ago|

[-]

I don't think so. Acquiring hardware for inference is a chokepoint on growth. If they can offload some inference to the customer's machine, that allows them to use more of their online capacity to generate money.

reply

upvote

by jonplackett1 days ago|

[-]

So wait what is the interaction between Gemma and Claude?

reply

upvote

by unsnap_biceps1 days ago|

[-]

lm studio offers an Anthropic compatible local endpoint, so you can point Claude code at it and it'll use your local model for it's requests, however, I've had a lot of problems with LM Studio and Claude code losing it's place. It'll think for awhile, come up with a plan, start to do it and then just halt in the middle. I'll ask it to continue and it'll do a small change and get stuck again.

Using ollama's api doesn't have the same issue, so I've stuck to using ollama for local development work.

reply

upvote

by keerthiko1 days ago|

[-]

Claude Code is fairly notoriously token inefficient as far as coding agent/harnesses go (i come from aider pre-CC). It's only viable because the Max subscriptions give you approximately unlimited token budget, which resets in a few hours even if you hit the limit. But this also only works because cloud models have massive token windows (1M tokens on opus right now) which is a bit difficult to make happen locally with the VRAM needed.

And if you somehow managed to open up a big enough VRAM playground, the open weights models are not quite as good at wrangling such large context windows (even opus is hardly capable) without basically getting confused about what they were doing before they finish parsing it.

reply

upvote

by unsnap_biceps1 days ago|

[-]

I use CC at work, so I haven't explored other options. Is there a better one to use locally? I presumed they were all going to be pretty similar.

reply

upvote

by jaggederest23 hours ago|

[-]

If you want to experiment with same-harness-different-models Opencode is classically the one to use. After their recent kerfluffle with Anthropic you'll have to use API pricing for opus/sonnet/haiku which makes it kind of a non-starter, but it lets you swap out any number of cloud or local models using e.g. ollama or z.ai or whatever backend provider you like.

I'd rate their coding agent harness as slightly to significantly less capable than claude code, but it also plays better with alternate models.

reply

upvote

by blitzar22 hours ago|

[-]

I am hopeful the leaked claude code narrows the capability, perhaps even googles offering will be viable once they borrow some ideas from claude.

reply

upvote

by andhuman13 hours ago|

[-]

I have good experience with Mistral Vibe.

reply

upvote

by satvikpendem18 hours ago|

[-]

OpenCode

reply

upvote

by storus1 days ago|

[-]

Can't you use Claude caveman mode?

https://github.com/JuliusBrussee/caveman

reply

upvote

by aplomb102622 hours ago|

[-]

[dead]

reply

upvote

by tatrions16 hours ago|

[-]

[flagged]

reply

upvote

by mbesto22 hours ago|

[-]

I don't get why I would use Claude Code when OpenCode, Cursor, Zed, etc. all exist, are "free" and work with virtually any llm. Seems like a weird use case unless I'm missing something.

reply

upvote

by superb_dev21 hours ago|

[-]

From my experience, Claude Code is just better. Although I recently started using Zed and it’s pretty good

reply

upvote

by blitzar22 hours ago|

[-]

previously I have found claude code to be just better than the alternatives, using large models or local. It is, however, closer now and not much excuse for the competition after the claude code leak. Personally, I will be giving this a go with OpenCode.

reply

upvote

by panagathon16 hours ago|

[-]

> I don't get why I would use Claude Code when OpenCode, Cursor, Zed, etc. all exist, are "free" and work with virtually any llm. Seems like a weird use case unless I'm missing something.

I'm with you on this. I've tried Gemma and Claude code and it's not good. Forgets it can use bash!

However, Gemma running locally with Pi as the harness is a beast.

reply

upvote

by bdangubic22 hours ago|

[-]

this is like asking why use intellij or vscode or … when there is vim and emacs

reply

upvote

by NamlchakKhandro20 hours ago|

[-]

No it's more like, why use a Microsoft paid for distro of nvim when lazyvim, astronvim exist

reply

upvote

by asymmetric22 hours ago|

[-]

Is a framework desktop with >48GB of RAM a good machine to try this out?

reply

upvote

by pshirshov20 hours ago|

[-]

Only for chat sessions, not for agentic coding. It's just too slow to be practical (10 minutes to answer a simple question about a 2k LoC project - and that's with a 5070 addon card).

reply

upvote

by ac2913 hours ago|

[-]

This article is about a MoE model with only 4B active parameters, it shouldn't take 10 minutes to answer a question about a small project.

I measured a 4bit quant of this model at 1300t/s prefill and ~60t/s decode on Ryzen 395+.

reply

upvote

by nl16 hours ago|

[-]

Doesn't the framework desktop have a Ryzen 395 AI? That's a unified memory architecture like the Macs.

reply

upvote

by pshirshov6 hours ago|

[-]

Ah, forgot to add, it's not really "unified" you have to explicitly specify your allocations. You may have a reasonably good 48gb chunk assigned to the GPU, but that DDR5 is 5-10 times slower than GDDR/HBM and the GPU itself isn't stellar.

So, framework laptops are great for chatting but nearly useless in agentic coding.

My Radeon W7900 answers a question ("what is this project") in 2 minutes, it takes my Framework 16 with 5070 addon around 11 minutes without the addon - around 23 (qwen 3.5 27b, claude code)

reply

upvote

by pshirshov9 hours ago|

[-]

That's discrete DDR5, it's not as fast as your regular VRAM.

reply

upvote

by Imanari6 hours ago|

[-]

How well do the Gemma 4 models perform on agentic coding? What are your impressions?

reply

upvote

by janalsncm12 hours ago|

[-]

Qwen3-coder has been better for coding in my experience and has similar sizes. Either way, after a bunch of frustration with the quality and price of CC lately I’m happy there are local options.

reply

upvote

by AbuAssar11 hours ago|

[-]

omlx gives better performance than ollama on apple silicon

reply

upvote

by Someone12341 days ago|

[-]

Using Claude Code seems like a popular frontend currently, I wonder how long until Anthropic releases an update to make it a little to a lot less turn-key? They've been very clear that they aren't exactly champions of this stuff being used outside of very specific ways.

reply

upvote

by nerdix23 hours ago|

[-]

I don't think there is any incentive to do so right now because the open models aren't as good. The vast majority of businesses are going to just pay the extra cost for access to a frontier model. The model is what gives them a competitive advantage, not the harness. The harness is a lot easier to replicate than Opus.

There are benefits too. Some developers might learn to use Claude Code outside of work with cheaper models and then advocate for using Claude Code at work (where their companies will just buy access from Anthropic, Bedrock, etc). Similar to how free ESXi licenses for personal use helped infrastructure folks gain skills with that product which created a healthy supply of labor and VMware evangelists that were eager to spread the gospel. Anthropic can't just give away access to Claude models because of cost so there is use in allowing alternative ways for developers to learn how to use Claude Code and develop a workflow with it.

reply

upvote

by deskamess20 hours ago|

[-]

Are the Claude Code (desktop) models very different from what Bedrock has? I thought you could hook up VSCode (not Claude Desktop) to Bedrock Anthropic models. Are there features in Claude Desktop that are not in VSCode/cli?

reply

upvote

by chvid1 days ago|

[-]

Is it not about the same as using OpenCode?

And is running a local model with Claude Code actually usable for any practical work compared to the hosted Anthropic models?

reply

upvote

by falcor8422 hours ago|

[-]

Well, if they did, it would probably be shooting themselves in the foot, seeing that the Claude Code source is out there now, and people are waiting for an excuse to "clean-room" reimplement and fork it

reply

upvote

by moomin1 days ago|

[-]

Right now it suits them down to the ground. You pay for the product and you don’t cost their servers anything.

reply

upvote

by phainopepla21 days ago|

[-]

You don't pay anything to use Claude Code as a front end to non-Anthropic models

reply

upvote

by quinnjh1 days ago|

[-]

so no subscription is needed?

reply

upvote

by kenmacd20 hours ago|

[-]

not to use the cli tool. You can install it and change the settings to point to pretty much any other model.

It's an okay-enough tool, but I don't see a lot of point in using it when open sources tools like Pi and OpenCode exist (or octofriend, or forge, or droid, etc).

reply

upvote

by alfiedotwtf12 hours ago|

[-]

Yet Codex specifically aims out to be compatible with all backends! Up until Gemma 4 though it’s been pretty solid, but totally fails with unknown tool (I’m guessing a template issue)

reply

upvote

by wyre23 hours ago|

[-]

I think CC is popular because they are catering to the common denominator programmer and are going to continue to do that, not because CC is particularly turn-key.

reply

upvote

by jedisct15 hours ago|

[-]

Running Gemma 4 with llama.cpp and Swival:

$ llama-server --reasoning auto --fit on -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL --temp 1.0 --top-p 0.95 --top-k 64

$ uvx swival --provider llamacpp

Done.

reply

upvote

by aetherspawn20 hours ago|

[-]

Can you use the smaller Gemma 4B model as speculative decoding for the larger 31B model?

Why/why not?

reply

upvote

by MeetRickAI20 hours ago|

[-]

[dead]

reply

upvote

by tiku6 hours ago|

[-]

I hate that my M5 with 24 gb has so much trouble with these models. Not getting any good speeds, even with simple models.

reply

upvote

by meidad_g7 hours ago|

[-]

[dead]

reply

upvote

by techpulselab20 hours ago|

[-]

[dead]

reply

upvote

by meidad_g23 hours ago|

[-]

[dead]

reply

upvote

by maxbeech10 hours ago|

[-]

[dead]

reply

upvote

by aplomb102620 hours ago|

[-]

[dead]

reply

upvote

by inzlab19 hours ago|

[-]

awesome, the lighter the hardware running big softwares the more novelty.

reply

upvote

by NamlchakKhandro20 hours ago|

[-]

I don't know why people bother with Claude code.

It's so jank, there are far superior cli coding harness out there

reply

upvote

by loveparade20 hours ago|

[-]

What do you recommend? I've tried both pi and opencode and both are better than claude imo, but I wonder if there are others.

reply

upvote

by tarruda19 hours ago|

[-]

Codex is the best out-of-box experience, especially due to its builtin sandboxing. Only drawback is that its edit tool requires the LLM to output a diff which only GPTs are trained to do correctly.

reply

upvote

by loveparade19 hours ago|

[-]

Interesting, I don't like codex exactly because of its built-in sandboxing. If I need a sandbox I rather do a simple bwrap myself around the agent process, I prefer that over the agent cli doing a bunch of sandboxing magic that gets in my way.

reply

upvote

by prettyblocks18 hours ago|

[-]

how is codex sandbox different from /sandbox on claude code?

reply

upvote

by dimgl19 hours ago|

[-]

Vagueposting in Hacker News?

reply

upvote

by z0mghii20 hours ago|

[-]

Can you elaborate what is jank about it?

reply

upvote

by threethirtytwo11 hours ago|

[-]

it has visual artifacts when inferencing.

reply

upvote

by smcleod9 hours ago|

[-]

Did you try the MLX model instead? In general MLX tends provide much better performance than GGUF/Llama.cpp on macOS.

reply