I use llama-server that comes with llama.cpp instead of using ollama. Here are the exact settings I use.
llama-server -ngl 99 -c 192072 -fa on --cache-type-k q4_0 --cache-type-v q4_0 --host 0.0.0.0 --sleep-idle-seconds 300 -m Qwen3.5-27B-Q4_K_M.gguf
How did you land on that model? Hard to tell if I should be a) going to 3.5, b) going to fewer parameters, c) going to a different quantization/variant.
I didn't consider those other flags either, cool.
Are you having good luck with any particular harnesses or other tooling?
If you want to keep using the same model, these settings worked for me.
llama-server -ngl 99 -c 262144 -fa on --cache-type-k q4_0 --cache-type-v q4_0 --host 0.0.0.0 --sleep-idle-seconds 300 -m Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf
For the harness, I use pi (https://pi.dev/). And sometimes, I use the Roo Code plugin for VS Code. (https://roocode.com/)
I prefer simplicity in my tooling, so I can understand them easier. But you might have better luck with other harnesses.
You can keep the KV cache on GPU which means it's pretty damn fast and you should be able to hold a reasonable context window size (on your GPU).
I've had really impressive results locally with this.
I'd strongly recommend cloning llama.cpp locally btw (in wsl2) and asking a frontier model in eg Claude code to set it up for you and tweak it. In my experience the apps that sit on top of llama.cpp don't expose all the options and flags and one wrong flag can mean terrible performance (eg context windows not being cached). If you compile it from source with a coding agent it can look up the actual code when things go wrong.
You should be able to get at least 20-40tok/s on that machine on Gemma 4 which is very usable, probabaly faster on qwen3.6 since it's only 3b active params.
Aside: what is your tooling setup? Which harness you're using (if any), what's running the inference and where, what runs in WSL vs Windows, etc.
I struggle to even ask the right questions about the workflow and environment.
Working well so far.
Grab a recent win-vulkan-x64 build of llama.cpp here: https://github.com/ggml-org/llama.cpp/releases - llama.cpp is the engine used by Ollama and common wisdom is to just use it directly. You can try CUDA as well for a speedup but in my experience Vulkan is most likely to "just work" and is not too far behind in speed.
For best quality, download the biggest version of Qwen 3.5 27B you can fit on your 4090 while still leaving room for context and overhead: https://huggingface.co/unsloth/Qwen3.5-27B-GGUF - I would try the UD-Q5_K_XL but you might have to drop down to Q5_K_S. For best speed, you could use Qwen 3.6 35B-A3B (bigger model but fewer parameters are active per token): https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF - probably the UD-Q4_K_S for this one.
Now you need to make sure the whole model is fitting in VRAM on the 4090 - if anything gets offloaded to system memory it's going to slow way down. You'll want to read the docs here: https://github.com/ggml-org/llama.cpp/tree/master/tools/serv... (and probably random github issues and posts on r/localllama as well), but to get started:
llama-server -m /path/to/above/model/here.gguf --no-mmap --fit on --fit-ctx 20000 --parallel 1
This will spit out a whole bunch of info; for now we want to look just above the dotted line for "load_tensors: offloading n/n layers to GPU" - if fewer than 100% of the layers are on GPU, inference is going to be slower and you probably want to drop down to a smaller version of the model. The "dense" 27B will be slowed more by this than the "mixture-of-experts" 35B-A3B, which has to move fewer weights per token from memory to the GPU.Go to the printed link (localhost:8080 by default) and check that the model seems to be working normally in the default chat interface. Then, you're going to want more context space than 20k tokens, so look at your available VRAM (I think the regular Windows task manager resource monitor will show this) and incrementally increase the fit-ctx target until it's almost full. 100k context is enough for basic coding, but more like 200k would be better. Qwen's max native context length is 262,144. If you want to push this to the limit you can use `--fit-target <amount of memory in MB>` to reduce the free VRAM target to less than the default 1024 - this may slow down the rest of your system though.
Finally, start hooking up coding harnesses (llama-server is providing an OpenAI-compatible API at localhost:8080/v1/ with no password/token). Opencode seems to work pretty reliably, although there's been some controversy about telemetry and such. Zed has a nice GUI but Qwen sometimes has trouble with its tools. Frankly I haven't found an open harness I'm really happy with.
> nothing you can run locally, on that machine anyways, is going to compare with Opus
Definitely not expecting that. Just wanted to find a setup that individuals were content with using a coding harness and a model that is usable locally.
What does your setup look like? Model, harness, etc.