I’m just pleased by the competition, agree with the ideal of free and local but sustainable competition is key: driving $200 p/m down to a much much lower number.
If they release a Qwen 3.6 that also makes good use of the card, may move to it.
I have intention to evaluate all four on some evals I have, as long as I don't get squirrelled again.
- Implement a numerically stable backward pass for layer normalization from scratch in NumPy.
- Design and implement a high-performance fused softmax + top-k kernel in CUDA (or CUDA-like pseudocode).
- Implement an efficient KV-cache system for autoregressive transformer inference from scratch.
and tested Qwen3.6-27B (IQ4_NL on a 3090) against MiniMax-M2.7 and GLM-5 with kimi k2.6 as the judge (imperfect, i know, it was 2AM). Qwen surpassed minimax and won 2/3 of the implementations again GLM-5 according to kimi k2.6, which still sounds insane to me. The env was a pi-mono with basic tools + a websearch tool pointing to my searxng (i dont think any of the models used it), with a slightly customized shorter system prompt. TurboQuant was at 4bit during all qwen tests. Full results https://github.com/sleepyeldrazi/llm_programming_tests.
I am also periodically testing small models in a https://www.whichai.dev style task to see their designs, and qwen3.6 27B also obliterated (imo) the other ones I tested https://github.com/sleepyeldrazi/llm-design-showcase .
Needless to say those tests are non-exhaustive and have flaws, but the trend from the official benchmarks looks like is being confirmed in my testing. If only it were a little faster on my 3090, we'll see how it performs once a DFlash for it drops.
What context size are you using for that?
Btw, are you using flash attention in Ollama for this model? I think it's required for this model to operate ok.
-- Q5_K_M Unsloth quantization on Linux llama.cpp
-- context 81k, flash attention on, 8-bit K/V caches
-- pp 625 t/s, tg 30 t/s
Q8 with the same context wouldn't fit in 48GB of VRAM, it did with 128k of context.
You’re much better off adding a second GPU if you’ve already got a PC you’re using.