$uname -r
6.8.0-107-generic
$ollama --version
ollama version is 0.20.2
$ollama run "gemma4:31b" --verbose "write fizzbuzz in python."
[...]
total duration: 45.141599637s
load duration: 143.633498ms
prompt eval count: 21 token(s)
prompt eval duration: 48.047609ms
prompt eval rate: 437.07 tokens/s
eval count: 1057 token(s)
eval duration: 44.676612241s
eval rate: 23.66 tokens/sThe model that is currently loaded full time for all workloads on this machine is Unsloth's Q3_K_M quant of Qwen 3.5 122b, which has 10b active parameters. With almost no context usage it will generate 59 tok/sec. At 10,000 input tokens it will prefill at about 1500 tok/sec and generate at 51 tok/sec. At 110,000 input tokens it will prefill at about 950 tok/sec and generate at 30 tok/sec.
Smaller MoE models with 3b active will push 70 tok/sec at 10,000 context. Dense models like Qwen 3.5 27b and Devstral Small 2 at 24b will only generate at around 13 - 15 tok/sec with 10,000 context.
This is all on llama.cpp with the Vulkan backend. I didn't get to far in testing / using anything that requires ROCm because there is an outstanding ROCm bug where the GPU clock stays at 100% (and drawing like 60 watts) even when the model is not processing anything. The issue is now closed but multiple commenters indicate it is still a problem. Using the Vulkan backend my per-card idle draw is between 1 and 2 watts with the display outputs shut down and no kernel frame buffer.