1. SWE-bench Verified is now saturated at 93.9% (congrats Anthropic), but anyone who hasn't reached that number yet still has more room for growth.
2. SWE-bench Multilingual and SWE-bench Multimodal (which we'll open source in the next month) are still unsatured.
3. All benchmarks and benchmark paradigms eventually become saturated. That's why the SWE-bench team has worked hard on building the next stage of benchmarks, and we have a few that are already out, for example https://codeclash.ai/ or https://algotune.io/ . And we'll have more to say soon :)
They're saying:
1. A large number of the tests are inaccurate; so correct solutions will be marked as incorrect.
2. Frontier models have already read and memorized the PR's the problems are based on.
3. In fact, many problems are essentially impossible to get right if you haven't memorized the solution: for example, the test cases will fail if you didn't happen to expose a helper function with a specific name. That name isn't mentioned in the problem; but frontier models are passing that test anyway because they remember that such a helper function is necessary.
If the next stage of benchmarks don't address these issues, they'll continue to have the same problems, saturated or not.
But the article says "We audited a 27.6% subset of the dataset that models often failed to solve [which is 19.1% of the problems at time of publication] and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submission"
0.191 * 0.594 > 1 - 0.936
Does this mean that the audited subset wasn't representative? Or that Anthropic is getting high answers through some shady means?
But how do you know the model was over-optimized for it or just really good?
You can’t trust it that a model that scores 93% is better at software engineering than a model that scores 90%, because at that point it’s impossible to distinguish between recall and reasoning.
40% vs 90%? Sure.
70% vs 90%? _Absolutely meaningless_ as you are not measuring coding intelligence but “how well can the model cheat flaws in SWEBench Verified”, the former can certainly be better at coding even assuming no deliberate benchmaxxing / foul play.
It would be interesting to see a deeper investigation, into how the models are dealing with this and whether the successful ones seemed to be trained on the benchmark.
SPECint and SPECfp went through this exact movie: benchmark, saturate, retire, replace, repeat. The treadmill is the product.
I don't have the solution just noticing the pattern.
However, both kinds of tests are susceptible to over-fitting: an LLM can be trained on the exact test questions, and a CPU can be designed with eg. branch predictors and cache sizes tuned specifically to handle a particular benchmark or workload.
Both that and the SPEC compiler shenanigans are cheating by changing the test, not just over-specializing the product being benchmarked.
An industry-standard benchmark shouldn't be hosted or designed by a lab producing the models, regardless.
But if some or all players are bench-maxing it, then it becomes a much less useful metric for comparison.
Also, this doesn't address what OpenAI says about the test suite disallowing valid solutions.
The problem with coding benchmarks then becomes creating novel benchmarks that are guaranteed to not already be in the training data, and not borrow anything from previous benchmarks.
In this regard I don't think any benchmark that was created before a given model is released should ever be considered valid or representative of model performance. The potential financial gain for including the data just to be able to market a minor improvement is too swaying. With that in mind they should honestly just stop including benchmarks altogether in marketing material
Let the model speak for itself and let the community decide, but of course that will never slide with corporate types with so much money on the line.
The LLMs I have tested have terrible world models and intuitions for how actions change the environment. They're also not great at discerning and pursuing the right goals. They're like an infinitely patient five-year old with amazing vocabulary.
[1]: https://entropicthoughts.com/updated-llm-benchmark
(more descriptions available in earlier evaluations referenced from there)
If you doing an RPG, which I guess is where this is more obvious, you track the play and enemy positions, their health, their moods and perhaps top thoughts, the state of important inanimate objects. if you break down the door, you update the door's state in the document. This is in contrast to just giving the LLM the previous turns and hoping it realizes the door is broken down later (just by statistical completion).
It might be too expensive, but I would be interested in the benchmarks for the current crop of SOTA models.
Which community are we talking about? The professionals with 10+ years experience using LLMs, the vibe coders that have no experience writing code and everyone in between? If you read some of the online communities the experiences with the models all over the place, some compare GPT 5.5 to the second coming of JC while others think it's stupider than 5.4.
I personally don't have time to build a set of private benchmarks to compare the models that are coming out so I'm mostly relying on private and semi-private benchmarks to get a feel for how models are improving before I subscribe to a service and start using it myself. At least it's something a bit more reliable than the vibes of random people and bots on reddit.
These benchmarks are always greenfield, but people want a model that can deal with a rotted context.
The only real way to evaluate a model is to test it yourself but that's exhausting for each new model and not comprehensive anyway.
I also find it increasingly difficult to evaluate the models I actually do use. Sometimes each new release seems identical or only marginally better than the previous version, but when I then go back two or three version, I suddenly find that oder model to be dramatically worse. But was that older model always that quality, or am I now being served a different model under the same version name?
It's all just so opaque.
Regarding evaluation, I've found using tools like promptfoo (and in some cases custom tools built on top of that) are useful. These help when evaluating new models/versions and when modifying the system prompt to guide the model. Especially if you can define visualizations and assertions to accurately test what you are trying to achieve.
This can be difficult for tasks like summarization, code generation, or creative writing that don't have clear answers. Though having some basic evaluation metrics and test cases can still be useful, and being able to easily do side-by-side comparisons by hand.
What you actually want to measure on these models is what they can SEE in production. Context shape, retrieval quality, tool use, ability to compose state across turns. None of that is in SWE-bench because SWE-bench is shaped like a one-shot problem set and frontier coding work isn't shaped like that anymore.
Even a perfectly contamination-free benchmark would mostly test the wrong axis. The model is already at human-grad-student level on isolated problems. The leverage is in how it operates inside a larger system. And that's almost like, a taste/preference issue, and virtually impossible to objectively measure.
Obligatory XKCD: https://xkcd.com/937/
Use frontier LLMs to help create the harness and identify problems, but put in the effort to ensure your verifier is actually good and robust.
Then you have your own private benchmark, which makes new model releases a breeze instead of purely vibes or contaminated public benchmarks.
For extra props, add things you care about; such as reliability (eg deliberate noise injection, simple typo introduction in problems, variants, running each test multiple times).
At the end of the day however, the best LLM is the one you’re the most productive in. Frontier intelligence might be the main factor, but far from the only factor:
• How fast is it in the real world? How well does it understand your general style of prompting / guidance?
• How consistent and reliable is it? Does it exhibit laziness / hallucination of performing actions (and saying it does) it never performed?
• etc.
By determining if model gets better or not on a given benchmark, OpenAI selects models against benchmarks, implicitly using them in the training.
In the end all it does is affirm what you're saying though. Benchmarks are essentially obsolete the moment they become recognized. I suppose it's just another iteration of Goodhart's Law.
as long as theres a test framework, you could gauge success deterministically.
ELT-Bench is another recent example. It was the first serious attempt at a benchmark for data engineering workloads, published about a year ago.
A few days ago, a follow-up paper from a group that includes one of the original authors audited the benchmark itself. The team gfound that the benchmark has structural issues that biased results.
Here’s the paper: https://arxiv.org/abs/2603.29399
None of these are new though, the industry has gone through all that before just in a smaller scale and there’s a lot to learn from that. Here’s a post I wrote on the parallels we see today to what happened with the benchmarketing wars of the database systems.
https://www.typedef.ai/blog/from-benchmarketing-to-benchmaxx...
You need new datasets perpetually.
I have empirical experience though building classifiers that can have no precision measurement because the classifier performs invariably better than humans. They become the state of the art benchmark themselves and can’t be benchmarked except against themselves. These are for tasks that are non trivial and complex, but less logical than coding and less sustained reasoning. There may come a day though, when there is no calibrated benchmark that is independent of the models it’s measuring.
10 groups of 3 researchers, all have their own benchmarks that they do not share (testing it without the authors knowing is a different problem, maybe they only run the benchmarks when the gen-pop has access to the models).
that's 10 different tests. Aggregate pass rates
Many SWE-bench passing PRs would not be merged: https://news.ycombinator.com/item?id=47341645
Top model SWE bench scores may be skewed by git history leaks: https://news.ycombinator.com/item?id=45214670
Jokes aside, a benchmark I look forward to is ARC-AGI-3. I tried out their human simulation, and it feels very reasoning heavy.
Leaderboard: https://arcprize.org/leaderboard
(Most premier models don't even pass 5 percent.)
Arc AGI seems to test that as well. Every game is a rectangular grid to make it as easy as possible yet the AIs still fail.
I'm fairly certain the way forward isn't through agents directly interfacing with UIs but through agents using scripts and other tools to interact with the interface. That's why harnesses are so critical to performance on tasks like this.
I would like a version of Arc AGI that tests the agent's ability to dynamically create these harnesses.
Meanwhile AI agents are expected to guess pixels and fail each time.
It's not a crazy idea. Have the older model interview the newer one and then ask both (or maybe a third referee model) which one they think is smarter. Repeat 100x with different seeds. The percentage of times both sides agree the newer model won is the score.
Hehe
That's what we designed at https://gertlabs.com. We put a lot of thought into it, and kept it mostly (not fully) related to problem solving through coding.
Opus otoh is overrated in terms of its technical ability. It is certainly a better designer/developer for beautiful user experiences, but I'll always lean on gpt 5.5 to check its work.
The biggest surprise in the benchmark is Xiao-Mi. I haven't tried it yet, but I will be after looking at this.
Grats on your team for putting together something meaningful to make sense of the ongoing AI speedrun! Great work!
Your comment makes it sound like they are miles apart, which the benchmark doesn't seem to support.
Edit: I looked at the data more and the two models are only basically equal when looking at the mean of all the tests. Gpt 5.5 significantly outperforms opus 4.7 in coding, while opus 4.7 significantly outperforms in "decision making." I'm not seeing details on what decision making explicitly means.
I'm not being a hater, I love Opus for different reasons, but I can't rely on it for its technical ability.
Because GPT 5.5 just launched and those games take longer to accumulate data for, it just doesn't have enough samples yet. It will end up with a wider lead on Opus, I am sure. Coding evals always have large sample sizes on day 1. Good find, we should probably better adjust the weighting here for decision games with low match counts.
Matches with my experience with Opus for C++.
C# results are empty - @gertlabs - any ETA for those?
Either that, or Flash is truly a better architecture and the Pro variant is heavily benchmaxxed. It wouldn't be the first time we saw something like that in our benchmarking. We collect samples every week so it'll be interesting to see if it rebalances over time as new providers host the model. Flash is great though; it's so fast and cheap.
Jan 2025 was Claude 3.5 Sonnet, Gemini 1.5 Pro and OpenAI had GPT-4o.
As someone who used all those models, as well as today's frontier models - today's models are a significant step up from those.
It will be interesting to see the implications of this. Tooling can only do so much in the long term.
This has, in my opinion, likely been the primary vector in getting better models thus far, but MIT mathematically proves that it yields diminishing returns for each new dimension added. It will get more and more expensive and the cost-return will or probably already has made it infeasible.
Ilya appear to support sentiment this as well. [1]
[0] - https://openreview.net/forum?id=knPz7gtjPW [1] - https://www.businessinsider.com/openai-cofounder-ilya-sutske...
Is this saying a quarter* of the questions and answers were wrong, this whole time?!
If so, how was this ever, in any way, a valid measurement?
And what was the process for creating this benchmark and how did it end up with such an extraordinarily poor set of data? (There is a description later of how, which seems to be a high standard and I struggle to understand how it aligns with the other results they discuss.) Kudos to them for highlighting the issues, but I am left with questions.
[*] Not one in four, but one in six, thanks commenters for the correction; leaving the original since, eh, my bad, and it lets replies make sense. I feel the broad point still stands!
No, they're saying 59.4% of the 27.6% subset had flawed test cases I think.
> If so, how was this ever, in any way, a valid measurement?
Benchmarks essentially aren't, for practical concerns anyways. They don't represent your use case, and they don't represent any and all use cases, they're valid for measuring exactly what's included in the benchmarks, nothing more and nothing less.
I don't understand the ecosystems obsession with using public benchmarks, they hardly ever tell you anything of value. Ok, Qwen 3.5 is 50% better on Benchmark X than Qwen 2.5, does that mean it'll be 50% better for what you're using it for? Very unlikely.
I've been running my own private benchmarks, with test cases I never share anywhere, for the specific problems I'm using LLMs for. Some are based on real, actual cases where a LLM went wrong and I had to adjust the prompt, and over time I've built up a suite.
Most of the times when a new update comes out to a model, it moves maybe 2-3% in my own benchmarks, meanwhile they tout 30-40% increase or something ridiculous in public benchmarks, and we're supposed to believe the models' training data isn't contaminated...
but yeah you're correct anyone optimizing for public-bench rank instead of their own task-distribution eval has been pointing at the wrong thing for a while
still I guess useful signal to know which one model to consider, negative signal is still signal, assuming everyone is gaming benchmark in certain ways, lack of performance do result in a real workload effect
The marketing departments touting each model do want to claim superiority on the basis of slivers of percentage points, and that's probably always a stronger claim than the test results can reasonably support. And the benchmarks are obviously susceptible to cheating and overfitting. But when the scores aren't saturated and do show a big discrepancy, that kind of result usually seems to align with what people report from actually trying to use the models in the relevant problem space.
That being said, they didn't audit the other 72.4%, right? So it's likely that there are way more flawed problems throughout the full set?
The answer is “it works because ML wants to work.” It’s surprising how far you can get with something flawed. It’s also why such huge breakthroughs are possible by noting flaws others haven’t.
I do these sort of breakthroughs at home all the time! My wife would say the computer is doing something strange, and instead of just randomly clicking around, I read the error messages slowly and out loud, then follow what they say. Anyone can do this, yet it seems like a magical ability every time you employ it to help people.
Most machine-learning benchmarks have a fairly large fraction of incorrect labels, but when you just want to distinguish between different models, the time you'd need to ensure perfect scoring would usually be better spent on collecting a larger benchmark dataset, even if it ends up having more errors.
So not one in four, but one in six problems have problems.
That is extraordinarily high and the point still stands: is this truly saying a [large proportion] of the questions and answers were wrong, this whole time, and if so how was it ever a valid measurement?
Huh, that is very curious and interesting indeed. If that's indeed true, that Anthropic claims that pass rate while OpenAI claims the test cases are flawed and broken, then clearly one of them aren't telling their whole side...
https://news.ycombinator.com/item?id=47911074
Citation for the claimed pass rates is: https://llm-stats.com/benchmarks/swe-bench-verified
I.e. A panel comes up with a series of problems.
Like advent of code or project Euler but more complex and constricted.
Benchmark outcomes could be performance points and measure of cost, time to solution (well token count really).
A couple times per year it's run.
It avoids overfitting.
Overtime the tasks can become more complex if needed.
If they benchmax it into being able to complete full products from spec and robust implementations amazing.
Further, olympiad style benchmarks are arguably easier to contaminate / memorize unless you refresh it regularly; but that goes for SWE-bench too.
Simple enough that anyone could run it with a regular subscription.
Really unless we can get the providers to ditch the gameable benchmarks they won't.
But industries love nothing more than a benchmark they can manipulate.
this statement alone seems to invalidate the SWE-bench tests
Also similar: Graduate student descent. https://sciencedryad.wordpress.com/2014/01/25/grad-student-d...
That doesn't help for measuring coding ability specifically (you fundamentally need a code-correctness oracle), but for capability axes where the "answer" is a stated position rather than a verifiable fact, public + stable can still be useful. The SWE-bench problem isn't really "public", it's "public + has a fixed correct answer".
They're saying they need to move on from it because the benchmark is flawed (without bringing in proof) and that's why they can't hit 100%.
It's not a "our models are so good that the benchmark is too easy" thing.
> We also found evidence that models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests.
> This means that improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time.
Did we read the same article?
I think the core issue is in static benchmarks and the community needs to start moving beyond measuring pass/fail (which worked when agents were incapable of doing much of the work) to dynamic evals that simulate more how we evaluate humans.
Once a benchmark is known and there's billion of dollars on the line, obviously every company will game them.
I want a model that can detect the actual units/models that are placed on top of the terrain/board so I can track how the models move during the game, but trying gemini and chatgpt they were absolutely rubbish.
I mean, it's fine as it's useful for many people, but where is the button for disabling it ? Or why is it enabled by default ?
"codage de pointe" sounds so weird and cringe in French.
See also: https://this.os.isfine.org/blog/posts/us-ai-labs-love-the-ai...
The other issue they mention is being overly constrained vs. what is asked for - such as requiring specific class or function names to pass that were not part of what was specified.
It might be possible that even to the extent they are not contaminated Claude is better at predicting what sort of function names would be used in the repository (this fits my experience in using it on a number of projects with very different styles - I've found it to be good at "when in Rome") - this is a laudable trait, but it's also not what SWEbench claims to be measuring.
So maybe Anthropic runs Mythos through the benchmark 10000 times and takes the highest score, who knows?
Anthropic p-hacking the benchmark strikes me as cheating, and somewhat unlikely. Mythos figuring out how to cheat at the benchmark strikes me as much more likely.
But if that hypothesis is the explanation the interesting part is Opus 4.7 (but not 4.6) seems to be doing the same.
Define "cheat". If it's just hacking the test harness to return "PASSED", surely this would be easily detected with some human auditing? It sounds far more likely their solution are designed to pass the incorrect tests. That might be considered bad in a SWE context, but it's not exactly cheating either. It might even be considered a good thing, eg. in the context of backwards compatibility.
[1] https://learn.microsoft.com/en-us/troubleshoot/microsoft-365...
No shit, Sherlock!
For example, you can have problems that are underspecified, with hardcoded tests for a particular solution (out of multiple possible solutions). If your solution works fine but used a different function name than the one hardcoded in the tests, you can unfairly score 0.
When an eval has underspecified problems like these, you can still score 100% if you remember the original solution from your training data or if you just have taste similar to the original human authors. And both of these qualities - good memory and good taste - are great, but they'll be rewarded unfairly relative to a model that still did exactly what it was asked but in a different way than the hardcoded tests expected.
It is not impossible to solve in absolute terms, in the sense, all necessary pieces of information are presented in the repo + problem statement.
But it is impossible to solve in the sense, unless you read the ground truth, you are NOT able to solve it the way the test patch demands.
Simply not plausible to me that model can read the problem statement so precisely that it nails exactly, like 100% what the test suite is trying to test.
Is this just the next level of the "they're serving quantized models!" theory?