undefined

points

[-]

Right. Tokens/s decode isn't the most important thing to me: wall clock time for task completion is. And tracking all of that, on my GB10-based Asus box, Step 3.7 Flash at IQ4_XS beats Qwen 3.6 27B despite the latter having MTP, on all of my actual coding task evaluations in real codebases.

Qwen seems better at one-shotting things based on vague prompts to an acceptable degree, but thats literally not what I use these things for!

One thing if people do play with it, is it seems very very sensitive to quantisation of the K part of the KV cache. F16 K and Q8 V got rid of a lot of the loops that it was otherwise hitting.

There's also a regression in llama.cpp wrt. Step Flash, where quantisation is getting worse KLD and Perplexity than it otherwise was previously, for the exact same quants. Very odd, but it's being looked into at least!