- Implement a numerically stable backward pass for layer normalization from scratch in NumPy.
- Design and implement a high-performance fused softmax + top-k kernel in CUDA (or CUDA-like pseudocode).
- Implement an efficient KV-cache system for autoregressive transformer inference from scratch.
and tested Qwen3.6-27B (IQ4_NL on a 3090) against MiniMax-M2.7 and GLM-5 with kimi k2.6 as the judge (imperfect, i know, it was 2AM). Qwen surpassed minimax and won 2/3 of the implementations again GLM-5 according to kimi k2.6, which still sounds insane to me. The env was a pi-mono with basic tools + a websearch tool pointing to my searxng (i dont think any of the models used it), with a slightly customized shorter system prompt. TurboQuant was at 4bit during all qwen tests. Full results https://github.com/sleepyeldrazi/llm_programming_tests.
I am also periodically testing small models in a https://www.whichai.dev style task to see their designs, and qwen3.6 27B also obliterated (imo) the other ones I tested https://github.com/sleepyeldrazi/llm-design-showcase .
Needless to say those tests are non-exhaustive and have flaws, but the trend from the official benchmarks looks like is being confirmed in my testing. If only it were a little faster on my 3090, we'll see how it performs once a DFlash for it drops.