undefined

points

by rayboy199512 hours ago|

[-]

I'm running Qwen 3.6 27B Q5 K M GGUF on a Tesla P40 and koboldcpp using pi.dev as the harness, I gotta say I am impressed. Took some setup and configuring but I already have some code it has made commited and pushed. It can be slow on my hardware at >50k tokens, but the fact I bought this one P40 for like $150 back when the LLM trend started I can't complain. (I have a second one too but I couldn't physically fit the card in my server unfortunately.)

The setup I had to do was important and I had to compile koboldcpp with a few special params for my hardware, I mostly just had Claude figure it out. I don't remember everything I did now but it was very slow and would often stop mid task, it seems it was mostly a parsing issue. It made the model seem broken/dumb, but once I had all that settled I actually am able to use this how I use Claude Code. Disclaimer, I am pretty explicit with requirements, I imagine this fails more when you leave it to figure out things on its own but for my flow its pretty rad.

Currently setting it up as an automated agent now to pull Trello cards, create PRs for them, and move the card to be reviewed.

Command I am using to run: python koboldcpp.py \ --port 61514 --quiet --multiuser --gpulayers 999 --contextsize 262144 --quantkv 2 \ --usecublas normal --threads 4 --jinja --jinja_tools --jinja_kwargs '{"enable_thinking":true, "preserve_thinking":false}' \ --skiplauncher --model /data/models/Qwen3.6-27B-Q5_K_M.gguf --smartcache 5

by lostmsu10 hours ago|

parent|

[-]

Qwen recommends to preserve_thinking: true for agentic/coding workloads.

by rayboy19958 hours ago|

parent|

[-]

Thanks!! I had disabled that previously while debugging, I can confirm this is helping accuracy from what I can tell so far. (And speed since the cache is preserved more often!)

by satvikpendem6 hours ago|

parent|

[-]

Use the MTP models which 2x token generation speed, for example: https://unsloth.ai/docs/models/qwen3.6#mtp-guide

by rayboy19951 hours ago|

parent|

[-]

Very interesting I'll have to check this out thank you. This is why I love HN.

by vibe4211 hours ago|

prev|

[-]

I'm using the pi-mono coding agent (open source, free) without any extensions and very simple prompts. The 3.6 27B model (BF16, 250k context) uses 67GB VRAM on an RTX PRO 9000.

It's very capable on almost any coding task I've thrown at it, and very good for easy-to-medium hard scripts, new code bases.

It struggles on some complex tasks in larger code bases, e.g. using to debug and fix bugs in llama.cpp it gets close to working code but often introduces errors. For such tasks its still very useful as a search/explore tool and drafting fixes.