undefined

points

[-]

This link [1] features some good insight on how to adapt your usage to smaller models which require more explicit or deliberate prompting. I have been using Gemma 4 31B a lot and have found it very competent. It can be a bit unstable and start spiraling or end up in infinite loops that you need to reset, but for the most part it's been really good.

[1]: https://itayinbarr.substack.com/p/honey-i-shrunk-the-coding-...

by dminik1 hours ago|

prev|

[-]

Yeah. Context size matters a lot. With OpenCode dumping like 10k tokens in the system prompt it takes like 4 rounds before it had to compact at say 64k. It's not really worth it to run at anything below 100k and even then the models aren't all that useful.

They're also pretty terrible at summarization. Pretty much always some file read or write in the middle of the task would cross the context margin and it would mark it as completed in the summary. I think leaving the first prompt as well as the last few turns intact would improve this issue quite a lot, but at low context sizes thats pretty much the whole context ...

by embedding-shape1 hours ago|

prev|

[-]

You're not sharing what quantization you're using, in my experience, anything below Q8 and less than ~30B tends to basically be useless locally, at least for what you typically use codex et al for, I'm sure it works for very simple prompts.

But as soon as you go below Q8, the models get stuck in repeating loops, get the tool calling syntax wrong or just starts outputting gibberish after a short while.

by gchamonlive1 hours ago|

parent|

[-]

will do that in an edit to the post

by embedding-shape1 hours ago|

parent|

[-]

Sure, waiting :)

In the meantime, Ollama seems to default to "Q4_K_M" which is barely usable for anything, and really won't be useful for agentic coding, the quantization level is just too low. Not sure why Ollama defaults to basically unusable quantizations, but that train left a long time ago, they're more interesting in people thinking they can run stuff, rather than flagging things up front, and been since day 1.

by 2ndorderthought44 minutes ago|

parent|

[-]

Ollama is definitely not the way to go once you have an interest beyond "how quickly can I run a new LLM" rather then "how do I use a local llm to do things in a remotely optimal way"

by gchamonlive45 minutes ago|

parent|

prev|

[-]

I'm currently giving club3090 a try, it seems to have lots of pre-configured setups depending on the workflow. I'm trying vllm first, then with llama.cpp.

by pferdone1 hours ago|

prev|

[-]

I can see that and I don't know your setup, but there are people pushing >70t/s with MTP on a single 3090, with big contexts still >50t/s. 64k is not a lot for agentic coding, and IIRC 128k with turboquant and the likes should be possible for you. r/LocalLLM/ and r/LocalLLaMA/ are worth a visit IMO.

EDIT: just found this recipe repo, may wanna give it a go: https://github.com/noonghunna/club-3090

EDIT-2: this can also shave off a lot of context need for tool calling -> https://github.com/rtk-ai/rtk

by gchamonlive1 hours ago|

parent|

[-]

will give more info in the post

EDIT: thanks for the links!

by 2ndorderthought54 minutes ago|

prev|

[-]

I see your updated post. Switch over to llamacpp and look up recommended quants and settings. A good place for this info is on /r/localllama

by gchamonlive45 minutes ago|

parent|

[-]

Yep! I'm currently trying vllm, then I'll give llamacpp a try too

by nixon_why691 hours ago|

prev|

[-]

Qwen3.6 supports 266k context out of the box. Try using q8 kv cache to enable more of it.

by gchamonlive1 hours ago|

parent|

[-]

I limited it to 64k expecting 24GB vram to not be enough to make use of the entire context window, but I'll try with other's suggestions.

by 2ndorderthought2 hours ago|

prev|

[-]

I agree for planning it's not there yet. But I wouldn't be surprised if something came out that was in a similar weight class.

by regexorcist2 hours ago|

prev|

[-]

Try oh-my-openagent plan mode.