The marketing departments touting each model do want to claim superiority on the basis of slivers of percentage points, and that's probably always a stronger claim than the test results can reasonably support. And the benchmarks are obviously susceptible to cheating and overfitting. But when the scores aren't saturated and do show a big discrepancy, that kind of result usually seems to align with what people report from actually trying to use the models in the relevant problem space.
but yeah you're correct anyone optimizing for public-bench rank instead of their own task-distribution eval has been pointing at the wrong thing for a while
still I guess useful signal to know which one model to consider, negative signal is still signal, assuming everyone is gaming benchmark in certain ways, lack of performance do result in a real workload effect
That being said, they didn't audit the other 72.4%, right? So it's likely that there are way more flawed problems throughout the full set?