undefined

upvote

points

by 2ndorderthought2 hours ago |

upvote

by gchamonlive2 hours ago|

[-]

I see that going around, and either the test cases are too simplistic or I'm doing something wrong. I have a server with a 3090 in it, enough to run qwen3.6, but I haven't had much luck using it with either codex or oh-my-pi. They work, but the model gets really slow with ~64k context and the attention degrades quickly. You'll sometimes execute a prompt, the model will load a test file and say something like "I was presented with a test file but no command. What should I do with it?".

So yeah, while it's true that qwen3.6 is good for agentic coding, it's not very good for exploring the codebase and coming up with plans. You need to pair it today with a model capable of ingesting the whole context and providing a detailed plan, and even then the implementation might take 10x the amount of time it'd take for sonnet or Gemini 3 to crunch through the plan.

EDIT:

My setup is really as simple as possible. I run ollama on a remote server on my local network. In my laptop I set OLLAMA_HOST and do `ollama pull qwen3.6:27b`, which then becomes available to the agent harnesses. I am not sure now how I set the context, but I think it was directly in oh-my-pi. So server config- and quantization-wise, it's the defaults.

reply

upvote

by simjnd5 minutes ago|

[-]

This link [1] features some good insight on how to adapt your usage to smaller models which require more explicit or deliberate prompting. I have been using Gemma 4 31B a lot and have found it very competent. It can be a bit unstable and start spiraling or end up in infinite loops that you need to reset, but for the most part it's been really good.

[1]: https://itayinbarr.substack.com/p/honey-i-shrunk-the-coding-...

reply

upvote

by dminik1 hours ago|

[-]

Yeah. Context size matters a lot. With OpenCode dumping like 10k tokens in the system prompt it takes like 4 rounds before it had to compact at say 64k. It's not really worth it to run at anything below 100k and even then the models aren't all that useful.

They're also pretty terrible at summarization. Pretty much always some file read or write in the middle of the task would cross the context margin and it would mark it as completed in the summary. I think leaving the first prompt as well as the last few turns intact would improve this issue quite a lot, but at low context sizes thats pretty much the whole context ...

reply

upvote

by embedding-shape1 hours ago|

[-]

You're not sharing what quantization you're using, in my experience, anything below Q8 and less than ~30B tends to basically be useless locally, at least for what you typically use codex et al for, I'm sure it works for very simple prompts.

But as soon as you go below Q8, the models get stuck in repeating loops, get the tool calling syntax wrong or just starts outputting gibberish after a short while.

reply

upvote

by gchamonlive1 hours ago|

[-]

will do that in an edit to the post

reply

upvote

by embedding-shape1 hours ago|

[-]

Sure, waiting :)

In the meantime, Ollama seems to default to "Q4_K_M" which is barely usable for anything, and really won't be useful for agentic coding, the quantization level is just too low. Not sure why Ollama defaults to basically unusable quantizations, but that train left a long time ago, they're more interesting in people thinking they can run stuff, rather than flagging things up front, and been since day 1.

reply

upvote

by 2ndorderthought43 minutes ago|

[-]

Ollama is definitely not the way to go once you have an interest beyond "how quickly can I run a new LLM" rather then "how do I use a local llm to do things in a remotely optimal way"

reply

upvote

by gchamonlive45 minutes ago|

[-]

I'm currently giving club3090 a try, it seems to have lots of pre-configured setups depending on the workflow. I'm trying vllm first, then with llama.cpp.

reply

upvote

by pferdone1 hours ago|

[-]

I can see that and I don't know your setup, but there are people pushing >70t/s with MTP on a single 3090, with big contexts still >50t/s. 64k is not a lot for agentic coding, and IIRC 128k with turboquant and the likes should be possible for you. r/LocalLLM/ and r/LocalLLaMA/ are worth a visit IMO.

EDIT: just found this recipe repo, may wanna give it a go: https://github.com/noonghunna/club-3090

EDIT-2: this can also shave off a lot of context need for tool calling -> https://github.com/rtk-ai/rtk

reply

upvote

by gchamonlive1 hours ago|

[-]

will give more info in the post

EDIT: thanks for the links!

reply

upvote

by 2ndorderthought53 minutes ago|

[-]

I see your updated post. Switch over to llamacpp and look up recommended quants and settings. A good place for this info is on /r/localllama

reply

upvote

by gchamonlive44 minutes ago|

[-]

Yep! I'm currently trying vllm, then I'll give llamacpp a try too

reply

upvote

by nixon_why691 hours ago|

[-]

Qwen3.6 supports 266k context out of the box. Try using q8 kv cache to enable more of it.

reply

upvote

by gchamonlive1 hours ago|

[-]

I limited it to 64k expecting 24GB vram to not be enough to make use of the entire context window, but I'll try with other's suggestions.

reply

upvote

by 2ndorderthought2 hours ago|

[-]

I agree for planning it's not there yet. But I wouldn't be surprised if something came out that was in a similar weight class.

reply

upvote

by regexorcist2 hours ago|

[-]

Try oh-my-openagent plan mode.

reply

upvote

by pizza2341 hours ago|

[-]

Vibe coding on consumer hardware is still very limited; this is especially true on GPUs, whose RAM limit is around 16 - maybe 24 - GB for the vast majority (although Macs change the equation).

These are two realworld experiments, whose results are disappointing for those expecting levels of performance comparable to cloud services:

- https://deploy.live/blog/running-local-llms-offline-on-a-ten...

- https://betweentheprompts.com/40000-feet/

The first is even the 35b version of qwen3.6.

reply

upvote

by 2ndorderthought1 hours ago|

[-]

I don't see how it's disappointing? 95% correct using the 35b model before the right quants came out on a laptop? And they still got tons of code written for them.

On a real GPU using 27b with the latest quants the experience is better. It's still not the same as opus running on a subsidized GPU farm. Well it is better for privacy at least.

I find it interesting how 2 people can read the same thing and come to very different conclusions.

reply

upvote

by iugtmkbdfil8341 hours ago|

[-]

Eh. It is good in terms of results ( accuracy, good recommendations and so on ), but slow when it comes to actual inference. On local 128gb machine, it took over 5 minutes to brainstorm garage door opening mechanism with some additional restrictions for spice.

reply

upvote

by 2ndorderthought58 minutes ago|

[-]

I find it hilarious how waiting 5 whole minutes to design software is considered slow in a way that people refer to as not useful. My god lol.

Is that 128gb RAM or VRAM?

reply

upvote

by iugtmkbdfil83445 minutes ago|

[-]

Its the unified memory in this case ( Ryzen AI max ) so obviously there is some room for improvement there. Still, I would not dismiss the speed out of hand. Remember, we are trying to argue here that 'it is pretty close already'. In ways, it is. It others, it is not yet.

reply