It’s not a surprise that models are leapfrogging each other when the engineers are able to incorporate better code examples and reasoning traces, which in turn bring higher quality outputs.
That's just, like, your opinion, man.
> You really can't compare a model that's got trillions of parameters to a 27B one.
Parameter count doesn't matter much when coding. You don't need in-depth general knowledge or multilingual support in a coding model.
Every release is accompanied by claims of being as good as Sonnet or Opus, but when I try them (even hosted full weights) they’re far from it.
Impressive for the size, though!
if you can't afford to do that, look at a lot of them, eg. on artificialanalysis.com they merge multiple benchmarks across weighted categories and build an Intelligence Score, Coding Score and Agentic score.
GLM 5 scores 5% on the semi-private set, compared to SOTA models which hover around 80%.
Gemini flash was just as good as pro for most tasks with good prompts, tools, and context. Gemma 4 was nearly as good as flash and Qwen 3.6 appears to be even better.
What matters is the motion in the tokens
But when actually employed to write code they will fall over when they leave that specific domain.
Basically they might have skill but lack wisdom. Certainly at this size they will lack anywhere close to the same contextual knowledge.
Still these things could be useful in the context of more specialized tooling, or in a harness that heavily prompts in the right direction, or as a subagent for a "wiser" larger model that directs all the planning and reviews results.