upvote
This link [1] features some good insight on how to adapt your usage to smaller models which require more explicit or deliberate prompting. I have been using Gemma 4 31B a lot and have found it very competent. It can be a bit unstable and start spiraling or end up in infinite loops that you need to reset, but for the most part it's been really good.

[1]: https://itayinbarr.substack.com/p/honey-i-shrunk-the-coding-...

reply
Yeah. Context size matters a lot. With OpenCode dumping like 10k tokens in the system prompt it takes like 4 rounds before it had to compact at say 64k. It's not really worth it to run at anything below 100k and even then the models aren't all that useful.

They're also pretty terrible at summarization. Pretty much always some file read or write in the middle of the task would cross the context margin and it would mark it as completed in the summary. I think leaving the first prompt as well as the last few turns intact would improve this issue quite a lot, but at low context sizes thats pretty much the whole context ...

reply
You're not sharing what quantization you're using, in my experience, anything below Q8 and less than ~30B tends to basically be useless locally, at least for what you typically use codex et al for, I'm sure it works for very simple prompts.

But as soon as you go below Q8, the models get stuck in repeating loops, get the tool calling syntax wrong or just starts outputting gibberish after a short while.

reply
will do that in an edit to the post
reply
Sure, waiting :)

In the meantime, Ollama seems to default to "Q4_K_M" which is barely usable for anything, and really won't be useful for agentic coding, the quantization level is just too low. Not sure why Ollama defaults to basically unusable quantizations, but that train left a long time ago, they're more interesting in people thinking they can run stuff, rather than flagging things up front, and been since day 1.

reply
Ollama is definitely not the way to go once you have an interest beyond "how quickly can I run a new LLM" rather then "how do I use a local llm to do things in a remotely optimal way"
reply
I'm currently giving club3090 a try, it seems to have lots of pre-configured setups depending on the workflow. I'm trying vllm first, then with llama.cpp.
reply
I can see that and I don't know your setup, but there are people pushing >70t/s with MTP on a single 3090, with big contexts still >50t/s. 64k is not a lot for agentic coding, and IIRC 128k with turboquant and the likes should be possible for you. r/LocalLLM/ and r/LocalLLaMA/ are worth a visit IMO.

EDIT: just found this recipe repo, may wanna give it a go: https://github.com/noonghunna/club-3090

EDIT-2: this can also shave off a lot of context need for tool calling -> https://github.com/rtk-ai/rtk

reply
will give more info in the post

EDIT: thanks for the links!

reply
I see your updated post. Switch over to llamacpp and look up recommended quants and settings. A good place for this info is on /r/localllama
reply
Yep! I'm currently trying vllm, then I'll give llamacpp a try too
reply
Qwen3.6 supports 266k context out of the box. Try using q8 kv cache to enable more of it.
reply
I limited it to 64k expecting 24GB vram to not be enough to make use of the entire context window, but I'll try with other's suggestions.
reply
I agree for planning it's not there yet. But I wouldn't be surprised if something came out that was in a similar weight class.
reply
Try oh-my-openagent plan mode.
reply