undefined

upvote

points

by ofirpress19 hours ago |

upvote

by gwd4 hours ago|

[-]

They're not saying "Don't use SWE-bench Verified because it's saturated".

They're saying:

1. A large number of the tests are inaccurate; so correct solutions will be marked as incorrect.

2. Frontier models have already read and memorized the PR's the problems are based on.

3. In fact, many problems are essentially impossible to get right if you haven't memorized the solution: for example, the test cases will fail if you didn't happen to expose a helper function with a specific name. That name isn't mentioned in the problem; but frontier models are passing that test anyway because they remember that such a helper function is necessary.

If the next stage of benchmarks don't address these issues, they'll continue to have the same problems, saturated or not.

reply

upvote

by energy12319 hours ago|

[-]

> 93.6% (congrats Anthropic)

But the article says "We audited a 27.6% subset of the dataset that models often failed to solve [which is 19.1% of the problems at time of publication] and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submission"

0.191 * 0.594 > 1 - 0.936

Does this mean that the audited subset wasn't representative? Or that Anthropic is getting high answers through some shady means?

reply

upvote

by cjsaltlake19 hours ago|

[-]

I suggest reading the Mythos report's discussion on SWE-bench and contamination. I think it's fairly convincing that you can account for contamination and still trust SWE-bench numbers on models that aren't over-optimized for it.

reply

upvote

by stingraycharles11 hours ago|

[-]

You can trust that a model that scores 40% vs a model that scores 90% is indeed worse.

You can’t trust it that a model that scores 93% is better at software engineering than a model that scores 90%, because at that point it’s impossible to distinguish between recall and reasoning.

reply

upvote

by dannyw7 hours ago|

[-]

It’s honestly far better to just ignore SWEBench Verified in 2026. Multiple labs have noted issues with contamination, and achieving high scores require memorisation of what passes the prescriptive verifier; not what is a correct solution.

40% vs 90%? Sure.

70% vs 90%? _Absolutely meaningless_ as you are not measuring coding intelligence but “how well can the model cheat flaws in SWEBench Verified”, the former can certainly be better at coding even assuming no deliberate benchmaxxing / foul play.

reply

upvote

by kator16 hours ago|

[-]

> models that aren't over-optimized for it.

But how do you know the model was over-optimized for it or just really good?

reply

upvote

by kmdupree15 hours ago|

[-]

i disagree: https://www.philosophicalhacker.com/post/anthropic-error/

reply

upvote

by defmacr04 hours ago|

[-]

I don't understand that methodology in the first place. Does Anthropic even have some kind of somewhat objective definition to measure and judge "memorization"? Is there any evidence that other LLMs are viable tool to determine that?

reply

upvote

by MagicMoonlight12 hours ago|

[-]

This article says anthropic models can write out the entire benchmark solution set word for word from memory

reply

upvote

by fulafel8 hours ago|

[-]

there's more details under the Too narrow and too wide tests heading.

It would be interesting to see a deeper investigation, into how the models are dealing with this and whether the successful ones seemed to be trained on the benchmark.

reply

upvote

by kator16 hours ago|

[-]

Those who fail to study history (or live through it) are doomed to repeat it.

SPECint and SPECfp went through this exact movie: benchmark, saturate, retire, replace, repeat. The treadmill is the product.

I don't have the solution just noticing the pattern.

reply

upvote

by wtallis13 hours ago|

[-]

That's a slightly different problem. There's no thing as saturation for a performance benchmark like SPEC; we can always conceive of a faster processor (even if we don't know how to build one). Saturation is the problem that once you are at (or near) 100% pass rate on a test of pass/fail questions, there's no room for the score to keep going up and the test has lost any power to discriminate between competing options.

However, both kinds of tests are susceptible to over-fitting: an LLM can be trained on the exact test questions, and a CPU can be designed with eg. branch predictors and cache sizes tuned specifically to handle a particular benchmark or workload.

reply

upvote

by fibonacci11235810 hours ago|

[-]

Maybe OP was thinking about compilers "cracking" certain SPEC benchmarks: implementing exactly the optimization needed to boost a benchmark quite a lot, but that opt. probably won't apply to any other code out there (usually it's so targeted and risky with general C/C++ code that intentionally it doesn't work on anything else). That happened a couple of times over the years, I know about the Intel compiler cases for ex. I can certainly see LLM providers adding tricks that help a certain class of benchmarks, but doesn't help much for anything else.

reply

upvote

by wtallis8 hours ago|

[-]

Intel's done it again recently, this time targeting Geekbench: https://www.intel.com/content/www/us/en/support/articles/000...

Both that and the SPEC compiler shenanigans are cheating by changing the test, not just over-specializing the product being benchmarked.

reply

upvote

by akavel15 hours ago|

[-]

Also, in meantime, there's https://SWE-rebench.com as a nice riff on SWE-bench, as far as I understand.

reply

upvote

by davidheineman13 hours ago|

[-]

SWE-bench is fantastic! IMO, the scrutiny is a byproduct of the adoption and success of the benchmark.

reply

upvote

by Bombthecat19 hours ago|

[-]

Both of them look pretty old?

reply

upvote

by cjsaltlake19 hours ago|

[-]

code clash I think would be quite hard to game or contaminate unintentionally; considering that models need to compete against one another

reply

upvote

by gertlabs18 hours ago|

[-]

https://gertlabs.com already does this at scale.

An industry-standard benchmark shouldn't be hosted or designed by a lab producing the models, regardless.

reply

upvote

by Bombthecat19 hours ago|

[-]

I mean the data / benchmarks

reply

upvote

by EnPissant17 hours ago|

[-]

> 1. SWE-bench Verified is now saturated at 93.9% (congrats Anthropic), but anyone who hasn't reached that number yet still has more room for growth.

But if some or all players are bench-maxing it, then it becomes a much less useful metric for comparison.

Also, this doesn't address what OpenAI says about the test suite disallowing valid solutions.

reply

upvote

by dominotw16 hours ago|

[-]

how hard is it create one of these for my company that models most of the work we do at my company.

reply

upvote

by irthomasthomas15 hours ago|

[-]

Just point an agent at your llm logs and ask it to generate a dataset of questions and answers from the problems you solved already.

reply

upvote

by cwyers13 hours ago|

[-]

[dead]

reply

upvote

by kronks18 hours ago|

[-]

[dead]

reply