They're saying:
1. A large number of the tests are inaccurate; so correct solutions will be marked as incorrect.
2. Frontier models have already read and memorized the PR's the problems are based on.
3. In fact, many problems are essentially impossible to get right if you haven't memorized the solution: for example, the test cases will fail if you didn't happen to expose a helper function with a specific name. That name isn't mentioned in the problem; but frontier models are passing that test anyway because they remember that such a helper function is necessary.
If the next stage of benchmarks don't address these issues, they'll continue to have the same problems, saturated or not.
But the article says "We audited a 27.6% subset of the dataset that models often failed to solve [which is 19.1% of the problems at time of publication] and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submission"
0.191 * 0.594 > 1 - 0.936
Does this mean that the audited subset wasn't representative? Or that Anthropic is getting high answers through some shady means?
You can’t trust it that a model that scores 93% is better at software engineering than a model that scores 90%, because at that point it’s impossible to distinguish between recall and reasoning.
40% vs 90%? Sure.
70% vs 90%? _Absolutely meaningless_ as you are not measuring coding intelligence but “how well can the model cheat flaws in SWEBench Verified”, the former can certainly be better at coding even assuming no deliberate benchmaxxing / foul play.
But how do you know the model was over-optimized for it or just really good?
It would be interesting to see a deeper investigation, into how the models are dealing with this and whether the successful ones seemed to be trained on the benchmark.
SPECint and SPECfp went through this exact movie: benchmark, saturate, retire, replace, repeat. The treadmill is the product.
I don't have the solution just noticing the pattern.
However, both kinds of tests are susceptible to over-fitting: an LLM can be trained on the exact test questions, and a CPU can be designed with eg. branch predictors and cache sizes tuned specifically to handle a particular benchmark or workload.
Both that and the SPEC compiler shenanigans are cheating by changing the test, not just over-specializing the product being benchmarked.
An industry-standard benchmark shouldn't be hosted or designed by a lab producing the models, regardless.
But if some or all players are bench-maxing it, then it becomes a much less useful metric for comparison.
Also, this doesn't address what OpenAI says about the test suite disallowing valid solutions.