upvote
Yes, you can use it for local coding. Most harnesses can be pointed at a local endpoint which provides an OpenAI compatible API, though I've had some trouble using recent versions of Codex with llama.cpp due to an API incompatibility (Codex uses the newer "responses" API, but in a way that llama.cpp hasn't fully supported).

I personally prefer Pi as I like the fact that it's minimalist and extensible. But some people just use Claude Code, some OpenCode, there are a ton of options out there and most of them can be used with local models.

reply
It needs to support tool calling and many of the quantized ggufs don't so you have to check.

I've got a workaround for that called petsitter where it sits as a proxy between the harness and inference engine and emulates additional capabilities through clever prompt engineering and various algorithms.

They're abstractly called "tricks" and you can stack them as you please.

https://github.com/day50-dev/Petsitter

You can run the quantized model on ollama, put petsitter in front of it, put the agent harness in front of that and you're good to go

If you have trouble, file bugs. Please!

Thank you

edit: just checked, the ollama version supports everything

    $ llcat -u http://localhost:11434 -m gemma4:latest --info
    ["completion", "vision", "audio", "tools", "thinking"]
so you can just use that.
reply