undefined

points

[-]

I use Vulkan mostly instead of ROCm. Vulkan is actually a bit faster, paradoxically. I do switch out and try them both out, and it's not a huge difference, but I've been mostly saying on Vulkan.

The re-processing context every turn problem is definitely something I've hit. Some of the causes have been solved upstream in llama.cpp; make sure you're up to date.

But another cause of the issue that has a big effect is that older Qwen models didn't support preserving thinking. This means that each time you have a long sequence of tool calls with interleaved thinkging, as soon as you had your next turn in the chat, it would have to re-process all of that as it would drop all of the reasoning.

Qwen 3.6, however, now supports preserving thinking. This can use a bit more context, becasue you're not dropping the thinking every turn, but it re-uses the cache better, not causing you to have to reprocess a whole turn at a time each time.

In my models.ini, I have this for the Qwen3.6 models:

  chat-template-kwargs = {"preserve_thinking": true}

There are still occasional issues I hit where it will have to re-process, but getting up to date and enabling preserve_thinking has helped a ton.

by ndom911 hours ago|

parent|

[-]

+1 using llama.cpp Vulkan releases with the Qwen models - runs much better than the ROCm releases.

I'll have to give the preserve_thinking a shot.

by LoganDark58 minutes ago|

prev|

[-]

What harness are you using? Some of them (e.g. OpenCode) mutate the system prompt every turn, and therefore can't work with a KV cache.

I've had the best luck with Pi so far, but it comes without some bells and whistles you might be used to (e.g. plan mode, subagents, MCP client support)