This even applies to OpenAI & Anthropic who don't even eval on the same datasets a lot of the time.
Even with LLMs, posts like this don't just fall out of a coconut tree. If you have a set of target benchmarks for your own model, then keeping "the set" of side-by-side comparable models is its own maintenance headache.
Realistically I assume they hope readers don’t notice the fine details.
The Qwen models are great for open weights but for every past release they haven’t performed as well as the benchmarks in my experience. They’re optimizing for benchmark numbers because they know it works.
The pool of people reading such articles while ignoring such details can't be big.
On Hacker News I wonder if most people even opened the article at all most times.
e: which itself is a modification of RTFM from usenet
if they say it's 4.7 comparable, it anchors that into your head as the model to evaluate against.