undefined

points

[-]

I'm getting around 45 tps on a single r9700 for Q6 27B with build b9811 ( using https://github.com/kyuz0/amd-r9700-ai-toolboxes ) with the following parameters:

llama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:Q6_K -c 135000 -ngl 999 -np 2 -t 16 --temp 0.0 --top-p 0.95 --top-k 20 --min-p 0.00 -b 4096 -ub 4096 --chat-template-kwargs '{"preserve_thinking": true}' -fa 1 --spec-type draft-mtp --spec-draft-n-max 2

by ThunderSizzle6 hours ago|

parent|

[-]

I'll give 27B-MTP a try. I think I can tolerate 45 tps if the results are technically better. 35B is pretty good, but definitely shows it's inabilities at times (probably either due to the heavy caching quantization I'm doing, or the heavy model quantization vs what 2 GPUs could run).

My biggest gripe is that both pi and opencode seem to have trouble parsing the thinking blocks at times, and the model sometimes cuts-off mid-thinking or prints out weird character tokens at times. I don't know if that's because of llamacpp, pi/opencode, or qwen3.6, or some weird combination of them all, as I haven't investigated that problem fully yet.