So yeah, while it's true that qwen3.6 is good for agentic coding, it's not very good for exploring the codebase and coming up with plans. You need to pair it today with a model capable of ingesting the whole context and providing a detailed plan, and even then the implementation might take 10x the amount of time it'd take for sonnet or Gemini 3 to crunch through the plan.
EDIT:
My setup is really as simple as possible. I run ollama on a remote server on my local network. In my laptop I set OLLAMA_HOST and do `ollama pull qwen3.6:27b`, which then becomes available to the agent harnesses. I am not sure now how I set the context, but I think it was directly in oh-my-pi. So server config- and quantization-wise, it's the defaults.
[1]: https://itayinbarr.substack.com/p/honey-i-shrunk-the-coding-...
They're also pretty terrible at summarization. Pretty much always some file read or write in the middle of the task would cross the context margin and it would mark it as completed in the summary. I think leaving the first prompt as well as the last few turns intact would improve this issue quite a lot, but at low context sizes thats pretty much the whole context ...
But as soon as you go below Q8, the models get stuck in repeating loops, get the tool calling syntax wrong or just starts outputting gibberish after a short while.
In the meantime, Ollama seems to default to "Q4_K_M" which is barely usable for anything, and really won't be useful for agentic coding, the quantization level is just too low. Not sure why Ollama defaults to basically unusable quantizations, but that train left a long time ago, they're more interesting in people thinking they can run stuff, rather than flagging things up front, and been since day 1.
EDIT: just found this recipe repo, may wanna give it a go: https://github.com/noonghunna/club-3090
EDIT-2: this can also shave off a lot of context need for tool calling -> https://github.com/rtk-ai/rtk
EDIT: thanks for the links!
These are two realworld experiments, whose results are disappointing for those expecting levels of performance comparable to cloud services:
- https://deploy.live/blog/running-local-llms-offline-on-a-ten...
- https://betweentheprompts.com/40000-feet/
The first is even the 35b version of qwen3.6.
On a real GPU using 27b with the latest quants the experience is better. It's still not the same as opus running on a subsidized GPU farm. Well it is better for privacy at least.
I find it interesting how 2 people can read the same thing and come to very different conclusions.
Is that 128gb RAM or VRAM?