SWE-bench Verified no longer measures frontier coding capabilities

[-]

They're not saying "Don't use SWE-bench Verified because it's saturated".

They're saying:

1. A large number of the tests are inaccurate; so correct solutions will be marked as incorrect.

2. Frontier models have already read and memorized the PR's the problems are based on.

3. In fact, many problems are essentially impossible to get right if you haven't memorized the solution: for example, the test cases will fail if you didn't happen to expose a helper function with a specific name. That name isn't mentioned in the problem; but frontier models are passing that test anyway because they remember that such a helper function is necessary.

If the next stage of benchmarks don't address these issues, they'll continue to have the same problems, saturated or not.

by energy12317 hours ago|

[-]

> 93.6% (congrats Anthropic)

But the article says "We audited a 27.6% subset of the dataset that models often failed to solve [which is 19.1% of the problems at time of publication] and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submission"

0.191 * 0.594 > 1 - 0.936

Does this mean that the audited subset wasn't representative? Or that Anthropic is getting high answers through some shady means?

[-]

I suggest reading the Mythos report's discussion on SWE-bench and contamination. I think it's fairly convincing that you can account for contamination and still trust SWE-bench numbers on models that aren't over-optimized for it.

by kator15 hours ago|

[-]

> models that aren't over-optimized for it.

But how do you know the model was over-optimized for it or just really good?

by stingraycharles9 hours ago|

[-]

You can trust that a model that scores 40% vs a model that scores 90% is indeed worse.

You can’t trust it that a model that scores 93% is better at software engineering than a model that scores 90%, because at that point it’s impossible to distinguish between recall and reasoning.

by dannyw6 hours ago|

[-]

It’s honestly far better to just ignore SWEBench Verified in 2026. Multiple labs have noted issues with contamination, and achieving high scores require memorisation of what passes the prescriptive verifier; not what is a correct solution.

40% vs 90%? Sure.

70% vs 90%? _Absolutely meaningless_ as you are not measuring coding intelligence but “how well can the model cheat flaws in SWEBench Verified”, the former can certainly be better at coding even assuming no deliberate benchmaxxing / foul play.

by kmdupree13 hours ago|

[-]

i disagree: https://www.philosophicalhacker.com/post/anthropic-error/

by defmacr03 hours ago|

[-]

I don't understand that methodology in the first place. Does Anthropic even have some kind of somewhat objective definition to measure and judge "memorization"? Is there any evidence that other LLMs are viable tool to determine that?

by MagicMoonlight10 hours ago|

[-]

This article says anthropic models can write out the entire benchmark solution set word for word from memory

by fulafel7 hours ago|

[-]

there's more details under the Too narrow and too wide tests heading.

It would be interesting to see a deeper investigation, into how the models are dealing with this and whether the successful ones seemed to be trained on the benchmark.

by kator15 hours ago|

[-]

Those who fail to study history (or live through it) are doomed to repeat it.

SPECint and SPECfp went through this exact movie: benchmark, saturate, retire, replace, repeat. The treadmill is the product.

I don't have the solution just noticing the pattern.

by wtallis12 hours ago|

[-]

That's a slightly different problem. There's no thing as saturation for a performance benchmark like SPEC; we can always conceive of a faster processor (even if we don't know how to build one). Saturation is the problem that once you are at (or near) 100% pass rate on a test of pass/fail questions, there's no room for the score to keep going up and the test has lost any power to discriminate between competing options.

However, both kinds of tests are susceptible to over-fitting: an LLM can be trained on the exact test questions, and a CPU can be designed with eg. branch predictors and cache sizes tuned specifically to handle a particular benchmark or workload.

by fibonacci1123588 hours ago|

[-]

Maybe OP was thinking about compilers "cracking" certain SPEC benchmarks: implementing exactly the optimization needed to boost a benchmark quite a lot, but that opt. probably won't apply to any other code out there (usually it's so targeted and risky with general C/C++ code that intentionally it doesn't work on anything else). That happened a couple of times over the years, I know about the Intel compiler cases for ex. I can certainly see LLM providers adding tricks that help a certain class of benchmarks, but doesn't help much for anything else.

by wtallis6 hours ago|

[-]

Intel's done it again recently, this time targeting Geekbench: https://www.intel.com/content/www/us/en/support/articles/000...

Both that and the SPEC compiler shenanigans are cheating by changing the test, not just over-specializing the product being benchmarked.

by akavel14 hours ago|

[-]

Also, in meantime, there's https://SWE-rebench.com as a nice riff on SWE-bench, as far as I understand.

by davidheineman12 hours ago|

[-]

SWE-bench is fantastic! IMO, the scrutiny is a byproduct of the adoption and success of the benchmark.

by Bombthecat17 hours ago|

[-]

Both of them look pretty old?

[-]

code clash I think would be quite hard to game or contaminate unintentionally; considering that models need to compete against one another

by gertlabs17 hours ago|

[-]

https://gertlabs.com already does this at scale.

An industry-standard benchmark shouldn't be hosted or designed by a lab producing the models, regardless.

by Bombthecat17 hours ago|

[-]

I mean the data / benchmarks

by EnPissant15 hours ago|

[-]

> 1. SWE-bench Verified is now saturated at 93.9% (congrats Anthropic), but anyone who hasn't reached that number yet still has more room for growth.

But if some or all players are bench-maxing it, then it becomes a much less useful metric for comparison.

Also, this doesn't address what OpenAI says about the test suite disallowing valid solutions.

by dominotw14 hours ago|

[-]

how hard is it create one of these for my company that models most of the work we do at my company.

by irthomasthomas13 hours ago|

[-]

Just point an agent at your llm logs and ask it to generate a dataset of questions and answers from the problems you solved already.

by cwyers11 hours ago|

[-]

[dead]

by kronks16 hours ago|

[-]

[dead]

by Jcampuzano221 hours ago|

[-]

Its pretty clear that any benchmark that comes out will be outdated and exist within the training data with short measure. There will always be an incentive to optimize specifically for these benchmarks even if just for marketing material. Sure there is a training cutoff, but its usually only 3-6 months off of the public release dates.

The problem with coding benchmarks then becomes creating novel benchmarks that are guaranteed to not already be in the training data, and not borrow anything from previous benchmarks.

In this regard I don't think any benchmark that was created before a given model is released should ever be considered valid or representative of model performance. The potential financial gain for including the data just to be able to market a minor improvement is too swaying. With that in mind they should honestly just stop including benchmarks altogether in marketing material

Let the model speak for itself and let the community decide, but of course that will never slide with corporate types with so much money on the line.

by mnky9800n20 hours ago|

https://github.com/mnky9800n/zork-bench

[-]

This is why I made Zork bench. Zork, the text adventure game, is in the training data for LLMs. It’s also deterministic. Therefore it should be easy for an LLM to play and complete. Yet they don’t. Understanding why is the goal of Zork bench.

by kqr20 hours ago|

[1]: https://entropicthoughts.com/updated-llm-benchmark

[-]

I have worked on similar problems. See e.g. [1].

The LLMs I have tested have terrible world models and intuitions for how actions change the environment. They're also not great at discerning and pursuing the right goals. They're like an infinitely patient five-year old with amazing vocabulary.

(more descriptions available in earlier evaluations referenced from there)

by malfist18 hours ago|

[-]

I'm going to ignore all that and tell my developers working in complicated codebases that they have to use AI. I'm sure comprehending side effects in a world building text adventure is completely different that understanding spaghetti code

by red75prime17 hours ago|

[-]

Desarcasmed version: "I think that problems with Zork make those models virtually useless in programming tasks." Correct?

by cptskippy13 hours ago|

[-]

He said complicated code bases. LLMs are great at producing small snippets of code to address very targeted problems.

by red75prime11 hours ago|

[-]

Great on small snippets of code, passable on larger pieces of code, great at finding vulnerabilities in large pieces of code, terrible in Zork. All-in-all, a jagged frontier that defies a simple sarcastic characterization.

by seanmcdirmid17 hours ago|

[-]

You can code your prompts to read and write an external world model on the side. This is what most people do who are seriously doing games with LLMs.

by stingraycharles9 hours ago|

[-]

What do you mean with this? What is this world model, what does it capture?

by seanmcdirmid7 hours ago|

[-]

You keep a document going called "state of the world", on every turn, you read this document in (as context), use it to help compute what happens, and based on what happens, create an updated "state of the world" document. You track important details so your LLM is consistent from turn to turn.

If you doing an RPG, which I guess is where this is more obvious, you track the play and enemy positions, their health, their moods and perhaps top thoughts, the state of important inanimate objects. if you break down the door, you update the door's state in the document. This is in contrast to just giving the LLM the previous turns and hoping it realizes the door is broken down later (just by statistical completion).

by Schlagbohrer3 hours ago|

[-]

I would love to see consistent-world-state-capturing more integrated into, for example, SillyTavern.

by mnky9800n19 hours ago|

[-]

we should talk. i sent you an email.

by WarmWash20 hours ago|

[-]

The open models only give the SOTA models a run for their money on gameable benchmarks. On the semi-private ARC-AGI 2 sets they do absolutely awfully (<10% while SOTA is at ~80%)

It might be too expensive, but I would be interested in the benchmarks for the current crop of SOTA models.

by roenxi19 hours ago|

[-]

Have the open models been tried? When I look at the leaderboard [0] the only qwen model I see is 235B-A22B. I wouldn't expect an MoE model to do particularly well, from what I've seen (thinking mainly of a leaderboard trying to measure EQ [1]) MoE models are at a distinct disadvantage to regular models when it comes to complex tasks that aren't software benchmark targets.

[0] https://arcprize.org/leaderboard

[1] https://eqbench.com/index.html

by WarmWash18 hours ago|

[-]

There is GLM 5 and kimi 2.5 (which gets 11.8%, but I digress)

by CamperBob219 hours ago|

[-]

Actually the Zorks weren't deterministic, especially Zork II. The Wizard could F you over pretty badly if he appeared at an inopportune time.

by Schlagbohrer3 hours ago|

[-]

Was that using an RNG? Or is the entire game deterministic?

by doingthehula16 hours ago|

[-]

[dead]

by cbg019 hours ago|

[-]

> let the community decide

Which community are we talking about? The professionals with 10+ years experience using LLMs, the vibe coders that have no experience writing code and everyone in between? If you read some of the online communities the experiences with the models all over the place, some compare GPT 5.5 to the second coming of JC while others think it's stupider than 5.4.

I personally don't have time to build a set of private benchmarks to compare the models that are coming out so I'm mostly relying on private and semi-private benchmarks to get a feel for how models are improving before I subscribe to a service and start using it myself. At least it's something a bit more reliable than the vibes of random people and bots on reddit.

by trueno15 hours ago|

[-]

yea lol i think the community on this one is woefully unqualified to call any shots here. the goalposts are basically teleporting and everyone's aligning success with their own incredibly vague, personally created agentic non-deterministic workflows success. there's like no real answers coming from "the community" in this space at the moment, it's vividly similar to cryptocurrency cycles. most importantly, like you say, vibe coders are going to be the largest subset of the community and probably the most unqualified to assess performance because they're mostly clueless to how things work under the hood.

by WarmWash20 hours ago|

[-]

An easy way to make coding benchmarks viable again is to initialize the models with 200k of distracting or unrelated tokens in their context. Or even just run the tests sequentially in the same context and see how far the model gets before it unwinds.

These benchmarks are always greenfield, but people want a model that can deal with a rotted context.

by adamandsteve19 hours ago|

[-]

"The community" is astroturfed as hell though. Anthropic pays influencers to promote Claude Code and likely bots a ton as well, so it's hard to come to any kind of consensus online. Even if everyone was acting in good faith, some people will have a much better experience than others because of the domain they're working in (e.g. AI being much better at frontend and commonly used libraries).

The only real way to evaluate a model is to test it yourself but that's exhausting for each new model and not comprehensive anyway.

by InsideOutSanta18 hours ago|

[-]

Yeah, it's crazy that there is no trustworthy source for model reviews. I'd love to know how well the new Deepseek 4 actually performs, for example, but I don't want to spend the next week testing it out. Reddit used to be a somewhat useful gauge, but now there are posts on how 4 is useless right next to posts on how amazing it is. And I have no idea if this is astroturfing, or somebody using a quantized version, or different workloads, or what.

I also find it increasingly difficult to evaluate the models I actually do use. Sometimes each new release seems identical or only marginally better than the previous version, but when I then go back two or three version, I suddenly find that oder model to be dramatically worse. But was that older model always that quality, or am I now being served a different model under the same version name?

It's all just so opaque.

by rhdunn18 hours ago|

[-]

One challenge is that model evaluation is typically domain/application specific. Model performance can also depend on the system prompt and the input/context.

Regarding evaluation, I've found using tools like promptfoo (and in some cases custom tools built on top of that) are useful. These help when evaluating new models/versions and when modifying the system prompt to guide the model. Especially if you can define visualizations and assertions to accurately test what you are trying to achieve.

This can be difficult for tasks like summarization, code generation, or creative writing that don't have clear answers. Though having some basic evaluation metrics and test cases can still be useful, and being able to easily do side-by-side comparisons by hand.

by mtrifonov6 hours ago|

[-]

Still downstream of the actual issue. The benchmarks measure capability and the bottleneck stopped being capability a while ago.

What you actually want to measure on these models is what they can SEE in production. Context shape, retrieval quality, tool use, ability to compose state across turns. None of that is in SWE-bench because SWE-bench is shaped like a one-shot problem set and frontier coding work isn't shaped like that anymore.

Even a perfectly contamination-free benchmark would mostly test the wrong axis. The model is already at human-grad-student level on isolated problems. The leverage is in how it operates inside a larger system. And that's almost like, a taste/preference issue, and virtually impossible to objectively measure.

by jvuygbbkuurx20 hours ago|

[-]

I think the solution is a bunch of private trusted benchmarks, and averaging their announced results.

by zephen17 hours ago|

[-]

> averaging their announced results.

Obligatory XKCD: https://xkcd.com/937/

by dannyw6 hours ago|

[-]

Spend a hour or an afternoon creating your own eval harness with problems or workloads from your private repos or personal projects.

Use frontier LLMs to help create the harness and identify problems, but put in the effort to ensure your verifier is actually good and robust.

Then you have your own private benchmark, which makes new model releases a breeze instead of purely vibes or contaminated public benchmarks.

For extra props, add things you care about; such as reliability (eg deliberate noise injection, simple typo introduction in problems, variants, running each test multiple times).

At the end of the day however, the best LLM is the one you’re the most productive in. Frontier intelligence might be the main factor, but far from the only factor:

• How fast is it in the real world? How well does it understand your general style of prompting / guidance?

• How consistent and reliable is it? Does it exhibit laziness / hallucination of performing actions (and saying it does) it never performed?

• etc.

by Escapado19 hours ago|

[-]

I agree with the sentiment but I wonder if a sufficiently large amount of sufficiently sophisticated benchmarks existed then I would be surprised if a model would only memorize those benchmarks while showing terrible real world performance. We are not there yet but maybe one day we will be.

by AntiUSAbah18 hours ago|

[-]

In contrary: In an Interview someone from OpenAI said they are trying to avoid it because it makes it harder for them to determine if a model gets better or not.

by thesz14 hours ago|

[-]

Perturbation of dataset used for training can introduce adversarial behavior even without adding any other data, and idea is quite simple: you take two batches from the dataset for training and select model with more probable adversarial behavior. The more batches with posterior selection get processed, the more probable adversarial behavior become.

By determining if model gets better or not on a given benchmark, OpenAI selects models against benchmarks, implicitly using them in the training.

by somenameforme7 hours ago|

[-]

I'd add another thing here as well. Many take this sort of conspiratorial view like companies training on benchmarks would be some sort of underhanded intent at cheating. In reality, benchmarks also provide a way for companies to easily compare themselves to competitors and work to iteratively improve their own models, so there's a completely non-nefarious motivation to maximize scores on benchmarks.

In the end all it does is affirm what you're saying though. Benchmarks are essentially obsolete the moment they become recognized. I suppose it's just another iteration of Goodhart's Law.

[-]

They mention this in the article. This is why private (non public) benchmark tasks that have been made from scratch are necessary.

by cyanydeez20 hours ago|

[-]

a good benchmark would probably porting a selected repo to another language. then clear context notes, and have it port it back.

as long as theres a test framework, you could gauge success deterministically.

by cpard18 hours ago|

https://www.typedef.ai/blog/from-benchmarketing-to-benchmaxx...

[-]

Benchmarks/evals are really hard and they become harder when there’s huge incentive to game them at an industry scale.

ELT-Bench is another recent example. It was the first serious attempt at a benchmark for data engineering workloads, published about a year ago.

A few days ago, a follow-up paper from a group that includes one of the original authors audited the benchmark itself. The team gfound that the benchmark has structural issues that biased results.

Here’s the paper: https://arxiv.org/abs/2603.29399

None of these are new though, the industry has gone through all that before just in a smaller scale and there’s a lot to learn from that. Here’s a post I wrote on the parallels we see today to what happened with the benchmarketing wars of the database systems.

by softwaredoug18 hours ago|

[-]

It’s just hard to make them not part of the training data. We see this a bit with BrowseComp plus and other deep research datasets. Not because frontier labs are trying to cheat, but just from training on the full web.

You need new datasets perpetually.

by cpard17 hours ago|

[-]

That’s true. it also depends heavily on the type of task, not everything is equally represented on the web today and it remains to be seen if this is going to change or not.

by stavros17 hours ago|

[-]

Or hidden benchmarks, though it's then harder to get people to trust the results.

by patates3 hours ago|

[-]

How do you hide them if you aren't self hosting the model?

by cpard17 hours ago|

[-]

The trust issue might be solved by having standardisation bodies created, similar to W3C or even TPC, although TPC didn’t end that well.

by fnordpiglet18 hours ago|

[-]

Database benchmarks are another.

I have empirical experience though building classifiers that can have no precision measurement because the classifier performs invariably better than humans. They become the state of the art benchmark themselves and can’t be benchmarked except against themselves. These are for tasks that are non trivial and complex, but less logical than coding and less sustained reasoning. There may come a day though, when there is no calibrated benchmark that is independent of the models it’s measuring.

by operatingthetan17 hours ago|

[-]

Would creating new benchmarks every month solve this problem?

by preciousoo17 hours ago|

[-]

Or create "blind" benchmarks.

10 groups of 3 researchers, all have their own benchmarks that they do not share (testing it without the authors knowing is a different problem, maybe they only run the benchmarks when the gen-pop has access to the models).

that's 10 different tests. Aggregate pass rates

by jddj14 hours ago|

[-]

For the most part I think we get the benchmarks we deserve.

Many SWE-bench passing PRs would not be merged: https://news.ycombinator.com/item?id=47341645

Top model SWE bench scores may be skewed by git history leaks: https://news.ycombinator.com/item?id=45214670

by threepts20 hours ago|

[-]

Why don't they ask their premier model to generate a bench for them?

Jokes aside, a benchmark I look forward to is ARC-AGI-3. I tried out their human simulation, and it feels very reasoning heavy.

Leaderboard: https://arcprize.org/leaderboard

(Most premier models don't even pass 5 percent.)

by falcor8420 hours ago|

[-]

They focus on minimizing the number of moves and don't allow any harness whatsoever, putting the bar extremely high. The current top verified contender (Claude Opus 4.6) is at only 0.45%. But with how new it is, I expect a lot of improvement in the next generation of models.

by threepts20 hours ago|

[-]

Optimal for judging actual reasoning ability rather than an LLM's ability to regurgitate knowledge from a necropost on HN/Reddit/Twitter from 2018.

by knollimar20 hours ago|

[-]

a small harness that stores text files and manages context could be useful, otherwise you lose all ability to measure that skill (and that's important because it represents real world use cases on large code bases)

by jjmarr15 hours ago|

[-]

I'm making an LLM agent that can play DS games. The biggest blocker is clicking on the right spot to move things around in space rather than reasoning abilities.

Arc AGI seems to test that as well. Every game is a rectangular grid to make it as easy as possible yet the AIs still fail.

I'm fairly certain the way forward isn't through agents directly interfacing with UIs but through agents using scripts and other tools to interact with the interface. That's why harnesses are so critical to performance on tasks like this.

I would like a version of Arc AGI that tests the agent's ability to dynamically create these harnesses.

by anthonypasq9610 hours ago|

[-]

the whole point of arc-agi 3 is that if models are AGI then they should be able to solve the same tasks as humans do given the same information, but they cant. allowing scripts and harnesses and whatnot completely defeats the purpose.

by falcor842 hours ago|

[-]

But humans aren't just a "reasoning component"; our nervous system (and body in general) provides us with significant capabilities that would be considered a "harness" for our frontal lobe. It just seems silly to me to try to solve all of this in a single leap. But I guess that they just feel burned by how relatively quickly ARC-AGI 2 was solved

by jjmarr9 hours ago|

[-]

Humans haven't interacted with computers by typing in "5 columns right, 3 columns down" since before I was born. They use a mouse and keyboard.

Meanwhile AI agents are expected to guess pixels and fail each time.

by sowbug19 hours ago|

[-]

Why don't they ask their premier model to generate a bench for them?

It's not a crazy idea. Have the older model interview the newer one and then ask both (or maybe a third referee model) which one they think is smarter. Repeat 100x with different seeds. The percentage of times both sides agree the newer model won is the score.

by alansaber20 hours ago|

[-]

Very (reasoning) heavy benchmarks do seem like the way to go, being the hardest to game.

by xtracto19 hours ago|

[-]

Can AI write a problem so difficult that even AI cannot solve?

Hehe

by ngruhn17 hours ago|

[-]

How about prime factorization

by therealdrag019 hours ago|

[-]

[dead]

by gertlabs20 hours ago|

[-]

A better benchmark needs to be objectively scored, have multi-disciplinary, breadth, and be scalable (no single correct answer).

That's what we designed at https://gertlabs.com. We put a lot of thought into it, and kept it mostly (not fully) related to problem solving through coding.

by orangebread20 hours ago|

[-]

Wow. This benchmark definitely feels more accurate than the other rankings I've seen. My experience with gpt 5.4/5.5 is that they are technically flawless and if there are any technical issues that is because the input didn't provide enough clarity; that's not to say that it doesn't autonomously react to any issues during bug fixes or implementations, but it'll tend to nail its tasks without leaving behind gaps.

Opus otoh is overrated in terms of its technical ability. It is certainly a better designer/developer for beautiful user experiences, but I'll always lean on gpt 5.5 to check its work.

The biggest surprise in the benchmark is Xiao-Mi. I haven't tried it yet, but I will be after looking at this.

Grats on your team for putting together something meaningful to make sense of the ongoing AI speedrun! Great work!

by euleriancon8 hours ago|

[-]

Are we looking at the same data? On that site I see that opus 4.7's and gpt 5.5's g scores are within each others confidence intervals, and both significantly ahead of the number 3 model.

Your comment makes it sound like they are miles apart, which the benchmark doesn't seem to support.

Edit: I looked at the data more and the two models are only basically equal when looking at the mean of all the tests. Gpt 5.5 significantly outperforms opus 4.7 in coding, while opus 4.7 significantly outperforms in "decision making." I'm not seeing details on what decision making explicitly means.

by orangebread2 hours ago|

[-]

Right, I'm including my own observations in what the leaderboard is showing. Could be confirmation bias, but I use both Opus and GPT extensively and since GPT 5.4 I have noticed that Opus doesn't even begin to touch GPT's level of technical depth. I was hoping Opus 4.7 would close that gap, but unfortunately it doesn't even compare to GPT 5.4 in that sense.

I'm not being a hater, I love Opus for different reasons, but I can't rely on it for its technical ability.

by gertlabs7 hours ago|

[-]

Decision making refers to the environments where the LLM is called on every tick (like games with social communication), examples here: https://gertlabs.com/spectate.

Because GPT 5.5 just launched and those games take longer to accumulate data for, it just doesn't have enough samples yet. It will end up with a wider lead on Opus, I am sure. Coding evals always have large sample sizes on day 1. Good find, we should probably better adjust the weighting here for decision games with low match counts.

by gertlabs20 hours ago|

[-]

Much appreciated! MiMo V2.5 Pro is by far the most underrated recent release (probably because it wasn't open weights from the start).

by yalok4 hours ago|

[-]

amazing to see Claude Code top models still way above all other models for C++ & Java, while GPT 5.5 is higher in Python & JS and others. Shows the skew in the training data sets, and maybe the go-to-market focus - with Anthropic focusing on enterprise customers much more than OpenAI?

Matches with my experience with Opus for C++.

C# results are empty - @gertlabs - any ETA for those?

by monlockandkey14 hours ago|

[-]

Your benchmark suggests Deepseek V4 pro performs worse than Deepseek V4 flash? That is in an interesting result. Any comments on that outcome?

by gertlabs14 hours ago|

[-]

It's a surprising result, and a lot of it stems from the Pro variant struggling with our custom harness in agentic tasks (whereas Flash does fine), as well as provider instability. Failed requests are not counted against the model in its score, but it's possible there are additional silent degradations even on successful requests.

Either that, or Flash is truly a better architecture and the Pro variant is heavily benchmaxxed. It wouldn't be the first time we saw something like that in our benchmarking. We collect samples every week so it'll be interesting to see if it rebalances over time as new providers host the model. Flash is great though; it's so fast and cheap.

by kqr20 hours ago|

https://entropicthoughts.com/no-swe-bench-improvement

[-]

It was never that great, it seems. For all of 2025 there was virtually no improvement in the rate at which models produced quality code. They only got better at passing automated tests.

by stevex9 hours ago|

[-]

It's not true that there was no improvement in the rate at which models produced quality code.

Jan 2025 was Claude 3.5 Sonnet, Gemini 1.5 Pro and OpenAI had GPT-4o.

As someone who used all those models, as well as today's frontier models - today's models are a significant step up from those.

by civvv18 hours ago|

[-]

This is likely true. I think model quality has stagnated and that its likely a non-trivial task to find a new improvement vector. Scaling the width of the model (which has been the driving force behind the speed of improvement thus far) seems to have reached its limit.

It will be interesting to see the implications of this. Tooling can only do so much in the long term.

by mxwsn17 hours ago|

[-]

How do you know that width scaling has been the driving force of improvement?

by civvv1 hours ago|

[-]

I am no insider and have never even tried to build an LLM, so I can only guess. But the general sentiment seems to be that this is the case. If you are interested, I would recommend you read the MIT paper "Superposition Yields Robust Neural Scaling" [0]. It confirms an interesting trend: models represent more features/concepts than they have clean independent dimensions, so features overlap. Increasing model dimension reduces this geometric interference, which lowers loss in a predictable way, but with diminishing returns.

This has, in my opinion, likely been the primary vector in getting better models thus far, but MIT mathematically proves that it yields diminishing returns for each new dimension added. It will get more and more expensive and the cost-return will or probably already has made it infeasible.

Ilya appear to support sentiment this as well. [1]

[0] - https://openreview.net/forum?id=knPz7gtjPW [1] - https://www.businessinsider.com/openai-cofounder-ilya-sutske...

by waterTanuki13 hours ago|

[-]

I mean, it's not exactly a PhD level question. One can infer from the extreme demand of GPUs and DRAM + new data center construction that all the providers are banking on width.

by svnt8 hours ago|

[-]

No? That could just be fomo, actual adoption, or a number of other things.

by 15 hours ago|

[-]

deleted

[-]

But, that's an enormous source of coding productivity, and it's why Anthropic is worth billions... The reason SWE-bench has been so successful and useful for coding is that software engineering has a ton of tradition and infrastructure for making and using automated tests.

by greenchair14 hours ago|

[-]

maybe this is why these companies pricing plans are getting more limited and expensive..

by vintagedave21 hours ago|

[-]

> We audited a 27.6% subset of the dataset that models often failed to solve and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submissions, despite our best efforts in improving on this in the initial creation of SWE-bench Verified.

Is this saying a quarter* of the questions and answers were wrong, this whole time?!

If so, how was this ever, in any way, a valid measurement?

And what was the process for creating this benchmark and how did it end up with such an extraordinarily poor set of data? (There is a description later of how, which seems to be a high standard and I struggle to understand how it aligns with the other results they discuss.) Kudos to them for highlighting the issues, but I am left with questions.

[*] Not one in four, but one in six, thanks commenters for the correction; leaving the original since, eh, my bad, and it lets replies make sense. I feel the broad point still stands!

[-]

> Is this saying a quarter of the questions and answers were wrong, this whole time?!

No, they're saying 59.4% of the 27.6% subset had flawed test cases I think.

> If so, how was this ever, in any way, a valid measurement?

Benchmarks essentially aren't, for practical concerns anyways. They don't represent your use case, and they don't represent any and all use cases, they're valid for measuring exactly what's included in the benchmarks, nothing more and nothing less.

I don't understand the ecosystems obsession with using public benchmarks, they hardly ever tell you anything of value. Ok, Qwen 3.5 is 50% better on Benchmark X than Qwen 2.5, does that mean it'll be 50% better for what you're using it for? Very unlikely.

I've been running my own private benchmarks, with test cases I never share anywhere, for the specific problems I'm using LLMs for. Some are based on real, actual cases where a LLM went wrong and I had to adjust the prompt, and over time I've built up a suite.

Most of the times when a new update comes out to a model, it moves maybe 2-3% in my own benchmarks, meanwhile they tout 30-40% increase or something ridiculous in public benchmarks, and we're supposed to believe the models' training data isn't contaminated...

by avereveard3 hours ago|

[-]

the ecosystem obsession with public benchmarks comes from the fact that running benchmark costs, and labs don't test on any given private benchmark

but yeah you're correct anyone optimizing for public-bench rank instead of their own task-distribution eval has been pointing at the wrong thing for a while

still I guess useful signal to know which one model to consider, negative signal is still signal, assuming everyone is gaming benchmark in certain ways, lack of performance do result in a real workload effect

by wtallis14 hours ago|

[-]

I'm not sure people are really trying to interpret this kind of benchmark as being accurate in gauging the magnitude of improvement. It seems pretty obvious that doubling your score on some benchmark where 100% means "correctly answered all of these specific problems" doesn't translate directly to performing twice as well on all problems. I think what people want from these benchmarks—and what they do get to some extent—is answering the question of "is model A better than model B", especially the subset of "is this local model better than last year's frontier online model".

The marketing departments touting each model do want to claim superiority on the basis of slivers of percentage points, and that's probably always a stronger claim than the test results can reasonably support. And the benchmarks are obviously susceptible to cheating and overfitting. But when the scores aren't saturated and do show a big discrepancy, that kind of result usually seems to align with what people report from actually trying to use the models in the relevant problem space.

by wavemode17 hours ago|

[-]

> No, they're saying 59.4% of the 27.6% subset had flawed test cases I think.

That being said, they didn't audit the other 72.4%, right? So it's likely that there are way more flawed problems throughout the full set?

by sillysaurusx20 hours ago|

[-]

Imagenet is one of the most popular datasets on the planet. Turns out, a significant fraction of its images are mislabeled. In the limit case the model would have to fit towards wrong answers to get higher than a certain percentage.

The answer is “it works because ML wants to work.” It’s surprising how far you can get with something flawed. It’s also why such huge breakthroughs are possible by noting flaws others haven’t.

by embedding-shape20 hours ago|

[-]

> It’s also why such huge breakthroughs are possible by noting flaws others haven’t.

I do these sort of breakthroughs at home all the time! My wife would say the computer is doing something strange, and instead of just randomly clicking around, I read the error messages slowly and out loud, then follow what they say. Anyone can do this, yet it seems like a magical ability every time you employ it to help people.

by cindyllm20 hours ago|

[-]

[dead]

by jmalicki20 hours ago|

[-]

Has it been reasonably possible to overfit to the errors in ImageNet, or are they effectively random noise?

by yorwba17 hours ago|

[-]

To be useful for identifying which model is better, benchmark scores only need to correlate with true performance, for which it's enough that the majority of tasks are scored correctly. You could have a terrible benchmark where 49% of the labels are wrong and a model that always answers correctly gets a score of 51%, but as long as it's higher than the always-wrong model at 49%, it's still directionally correct.

Most machine-learning benchmarks have a fairly large fraction of incorrect labels, but when you just want to distinguish between different models, the time you'd need to ensure perfect scoring would usually be better spent on collecting a larger benchmark dataset, even if it ends up having more errors.

by motoboi21 hours ago|

[-]

It’s saying that 16% of the problems have well, problems.

by vintagedave21 hours ago|

[-]

You're right - I did not apply the math. (I won't edit, in order to let the parent comment still make sense, and thankyou for the correction.)

So not one in four, but one in six problems have problems.

That is extraordinarily high and the point still stands: is this truly saying a [large proportion] of the questions and answers were wrong, this whole time, and if so how was it ever a valid measurement?

by motoboi19 hours ago|

[-]

Wait until you discover how many wrong labeled images in imagenet and that it still kickstarted the deeplearning revolution.

by 21 hours ago|

[-]

deleted

[-]

> Curiously Opus 4.7 claims to have a 87.6% pass rate and Mythos claims to have a 93.9% pass rate... leading to the conclusion that it's actually possible to "solve" the problems that OpenAI claims are incorrect.

Huh, that is very curious and interesting indeed. If that's indeed true, that Anthropic claims that pass rate while OpenAI claims the test cases are flawed and broken, then clearly one of them aren't telling their whole side...

by gpm20 hours ago|

https://news.ycombinator.com/item?id=47911074

[-]

Oops, sorry, moved this portion of the comment to a top level comment simultaneously with you replying. Since the part of the comment that was replying to GP was addressed well in a simultaneous comment.

Citation for the claimed pass rates is: https://llm-stats.com/benchmarks/swe-bench-verified

by rustyhancock19 hours ago|

[-]

I think an Olympiad format is better. But the financial incentive is such that it might be near impossible to stop leaks.

I.e. A panel comes up with a series of problems.

Like advent of code or project Euler but more complex and constricted.

Benchmark outcomes could be performance points and measure of cost, time to solution (well token count really).

A couple times per year it's run.

It avoids overfitting.

Overtime the tasks can become more complex if needed.

If they benchmax it into being able to complete full products from spec and robust implementations amazing.

[-]

SWE-bench was created to replace olympiad coding benchmarks. I think past olympiad coding benchmarks were much worse representative of real-world coding than something like SWE-bench, which is derived from real units of labor.

Further, olympiad style benchmarks are arguably easier to contaminate / memorize unless you refresh it regularly; but that goes for SWE-bench too.

by rustyhancock16 hours ago|

[-]

I was picturing one-shot performance only for the benchmark, on novel real world tasks. I.e. the score on the March Olympiad you got in April isn't relevant.

Simple enough that anyone could run it with a regular subscription.

Really unless we can get the providers to ditch the gameable benchmarks they won't.

But industries love nothing more than a benchmark they can manipulate.

by ripvanwinkle20 hours ago|

[-]

>>In our analysis we found that all frontier models we tested were able to reproduce the original, human-written bug fix used as the ground-truth reference, known as the gold patch, or verbatim problem statement specifics for certain tasks, indicating that all of them have seen at least some of the problems and solutions during training

this statement alone seems to invalidate the SWE-bench tests

by pkoiralap5 hours ago|

[-]

This was bound to happen either organically or inorganically. Make sure it performs well on the benchmarks. And it doesn't really matter if it doesn't generalize outside of it right? :D

Also similar: Graduate student descent. https://sciencedryad.wordpress.com/2014/01/25/grad-student-d...

by 3 hours ago|

[-]

deleted

by marlburrow17 hours ago|

[-]

The "private benchmarks" suggestion comes up every time, but I think there's a more interesting axis: benchmarks built on top of already-public, already-stable test instruments. SWE-bench is fundamentally a corpus that lives on GitHub — once it ships, it leaks into training data automatically. Benchmarks built on contested qualitative instruments (psych tests, opinion surveys) have a different contamination profile because the correct answer doesn't exist in the training corpus to memorize — only the question does.

That doesn't help for measuring coding ability specifically (you fundamentally need a code-correctness oracle), but for capability axes where the "answer" is a stated position rather than a verifiable fact, public + stable can still be useful. The SWE-bench problem isn't really "public", it's "public + has a fixed correct answer".

by 1a527dd521 hours ago|

[-]

This feels very much like "we are now moving the goal posts".

by hashmap17 hours ago|

[-]

It does, and it should. With each iteration getting closer to the goalposts exposes the flaws in the goalposts, and then you try to make better goalposts. The problem people seem to have with the goalposts moving is they assume the goalpost makers either made good goalposts or thought they made good goalposts, but the actual process is "do the best we can at the moment and update when we get better information".

by neversupervised21 hours ago|

[-]

But this is the good kind of goalpost moving

by iLoveOncall21 hours ago|

[-]

Only if you didn't read the article.

They're saying they need to move on from it because the benchmark is flawed (without bringing in proof) and that's why they can't hit 100%.

It's not a "our models are so good that the benchmark is too easy" thing.

[-]

I feel like they're quite open about why they think the benchmark doesn't work anymore:

> We also found evidence that models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests.

> This means that improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time.

by f33d517321 hours ago|

[-]

> without bringing in proof

Did we read the same article?

[-]

How can you say “without bringing in proof” when there is literally proof in the article?

[-]

Only if you didn’t read the article…

by zachdotai13 hours ago|

[-]

I wrote about this recently here: https://fabraix.com/blog/adversarial-cost-to-exploit

I think the core issue is in static benchmarks and the community needs to start moving beyond measuring pass/fail (which worked when agents were incapable of doing much of the work) to dynamic evals that simulate more how we evaluate humans.

by languid-photic18 hours ago|

[1] https://voratiq.com/blog/your-workflow-is-the-eval

[-]

It’s very hard to encode the properties that matter most in code in tests. [1]

by djoldman21 hours ago|

https://arxiv.org/pdf/2509.16941

[-]

> We have incorporated these findings into our recent evaluation efforts. In the last months we’ve chosen to report results from the public split of SWE-Bench Pro. We recommend other model developers do the same. SWE-bench Pro is not perfect, but empirically seems to suffer less from contamination issues.

by parentheses18 hours ago|

[-]

The timing makes me wonder if this is a direct response to Deepseek V4 having performance comparable to SOTA models.

by osti10 hours ago|

[-]

This was published two months ago. Even though it was at a time that open source models are publishing comparable swe bench scores.

by lmeyerov18 hours ago|

[-]

It's been fun benchmarking AI investigations at botsbench.com . Part of it is checking for these kinds of issues - we recently started seeing contamination in our first generation challenge, and less obvious, agent sandbox escapes for other kinds of cheating. Fun times!

by swyx18 hours ago|

[-]

more context in small writeup + we interviewd the team behind this when it was announced: https://www.latent.space/p/swe-bench-dead

by eugenekolo17 hours ago|

[-]

Without SWE-Bench though, how will AI models properly game their results to show ~5-10% gain each iteration?

Once a benchmark is known and there's billion of dollars on the line, obviously every company will game them.

by axpy90616 hours ago|

[-]

Once the bench is public it’s out and probably in the training data. Better to have your own and test it on a new model.

by Jimmc41421 hours ago|

[-]

Goodhart’s Law in reverse, what can’t be gamed gets rejected.

by stephen_cagle17 hours ago|

[-]

You've almost buffer overrun Goodhart's Law into the https://en.wikipedia.org/wiki/McNamara_fallacy . :]

by cbg019 hours ago|

[-]

SWE-bench verified was created in collaboration with OpenAI. It's also an open dataset so prone to contamination, meaning it can be gamed.

by wredcoll18 hours ago|

[-]

This is somewhat tangential, but I want a model that can detect physical objects placed on top of a board from a picture/video, specifically warhammer 40k models.

I want a model that can detect the actual units/models that are placed on top of the terrain/board so I can track how the models move during the game, but trying gemini and chatgpt they were absolutely rubbish.

by z33k18 hours ago|

[-]

Amiibo and Skylanders detect the pieces with NFC. Wiring up the whole board/ terrain with NFC readers would probably be difficult, though.

by addaon16 hours ago|

[-]

The other classic approach has been a single camera under the table, but that conflicts with terrain use. mmWave radar is probably good enough for to localization at this point, and cheap, but distinguishing pieces is hard.

by wredcoll10 hours ago|

[-]

An interesting thought but at the moment I was just talking about analyzing a video lol

by w4yai21 hours ago|

[-]

I don't understand these websites which force translation to my native language.

I mean, it's fine as it's useful for many people, but where is the button for disabling it ? Or why is it enabled by default ?

"codage de pointe" sounds so weird and cringe in French.

by Toutouxc21 hours ago|

[-]

Same for apps and games. I understand English just fine, no need to switch to your shitty Google-translate localization just because my iPhone or PlayStation is set to my native language.

by LukaD21 hours ago|

[-]

Does your browser request French via an Accept-Language header perhaps? What really infuriates me is when sites don’t respect that header and give you a translation based on IP location.

[-]

Regardless if it does or not, users should be able to manually override what language the website is in, at least be able to read the native one, regardless of what the original language was, what headers you send and where geodatabases think your IP is from.

by 18 hours ago|

[-]

deleted

by w4yai21 hours ago|

[-]

Correct answer! What a bad UX

by cowartc19 hours ago|

[-]

The headline leads with contamination, but buried is that 59% of audited failures had test design defects. That's a measurement system never validated against ground truth before being adopted industry-wide as a score that mattered. They reported on it for two years but the gauge was broken the entire time.

by nothinkjustai19 hours ago|

[-]

Ai comments are banned here.

by gmerc10 hours ago|

[-]

Translation: Now that all rest sets are ingested, we need to move the bar that gave use several years of free PR.

by gpm21 hours ago|

[-]

Curiously Opus 4.7 claims to have a 87.6% pass rate and Mythos claims to have a 93.9% pass rate... leading to the conclusion that it's actually possible to "solve" the problems that OpenAI claims are incorrect.

by jmalicki20 hours ago|

[-]

Part of the issue they mention is contamination - the tests are in the training data.

The other issue they mention is being overly constrained vs. what is asked for - such as requiring specific class or function names to pass that were not part of what was specified.

It might be possible that even to the extent they are not contaminated Claude is better at predicting what sort of function names would be used in the repository (this fits my experience in using it on a number of projects with very different styles - I've found it to be good at "when in Rome") - this is a laudable trait, but it's also not what SWEbench claims to be measuring.

[-]

If you read the mythos report, in which they discuss and account for contamination substantially, it still suggests that performance on SWE-bench verified is meaningful. Benchmarks, including SWE-bench can absolutely be gamed, but if you're not explicitly benchmaxxing, improving on SWE-bench still measures model improvements, at least up to the level of Mythos.

by 2ndorderthought20 hours ago|

[-]

Or that opus and mythos are training on the data somehow such that there solutions are incorrectly right. Or that openai is lying/wrong. Or that all of these companies are cheating so much it doesn't really matter and never did.

[-]

The problem isn’t that the tasks are impossible to solve, it’s that they’re underspecified and/or impossible to solve consistently (ex. because a test is expecting the solution function to have a specific name that wasn’t specified in the task itself).

So maybe Anthropic runs Mythos through the benchmark 10000 times and takes the highest score, who knows?

by gpm20 hours ago|

[-]

We actually know that a "100% pass rate" is trivially possible: https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/

Anthropic p-hacking the benchmark strikes me as cheating, and somewhat unlikely. Mythos figuring out how to cheat at the benchmark strikes me as much more likely.

But if that hypothesis is the explanation the interesting part is Opus 4.7 (but not 4.6) seems to be doing the same.

by gruez20 hours ago|

[1] https://learn.microsoft.com/en-us/troubleshoot/microsoft-365...

[-]

>Mythos figuring out how to cheat at the benchmark strikes me as much more likely.

Define "cheat". If it's just hacking the test harness to return "PASSED", surely this would be easily detected with some human auditing? It sounds far more likely their solution are designed to pass the incorrect tests. That might be considered bad in a SWE context, but it's not exactly cheating either. It might even be considered a good thing, eg. in the context of backwards compatibility.

by kimjune0110 hours ago|

[-]

AI labs should compete on a bench that's adversarial, such as go or Starcraft

by adityamwagh21 hours ago|

[-]

> We also found evidence that models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests.

No shit, Sherlock!

by neuroelectron19 hours ago|

[-]

It's really naïve to think any of the big AI companies won't cheat

by DeathArrow20 hours ago|

[-]

So Opus 4.7 and Mythos are solving problems that are impossible to solve?

by tedsanders15 hours ago|

[-]

Whether a problem is "good" or "bad" is not always objective or simple.

For example, you can have problems that are underspecified, with hardcoded tests for a particular solution (out of multiple possible solutions). If your solution works fine but used a different function name than the one hardcoded in the tests, you can unfairly score 0.

When an eval has underspecified problems like these, you can still score 100% if you remember the original solution from your training data or if you just have taste similar to the original human authors. And both of these qualities - good memory and good taste - are great, but they'll be rewarded unfairly relative to a model that still did exactly what it was asked but in a different way than the hardcoded tests expected.

by karmasimida14 hours ago|

[-]

To some extent yes.

It is not impossible to solve in absolute terms, in the sense, all necessary pieces of information are presented in the repo + problem statement.

But it is impossible to solve in the sense, unless you read the ground truth, you are NOT able to solve it the way the test patch demands.

Simply not plausible to me that model can read the problem statement so precisely that it nails exactly, like 100% what the test suite is trying to test.

by DeathArrow20 hours ago|

[-]

So we need to generate benchmarks after the models finish training. Or we need to keep the solutions to the benchmark problems as closed source.

by retinaros20 hours ago|

[-]

it never did

by varispeed21 hours ago|

[-]

Issue with these benchmark also is that they measure a model you are unlikely going to be routed to. My experience with Anthropic is that despite using Opus 4.6 and 4.7, most of the time the performance is matching low B parameter Qwen. I think there should be a way to verify what model is actually being used to process prompts - that should be independently verified. At the moment it is so bad, you have to ask verification question to the model in form of a non-trivial problem. If it solves it, then there is a chance you actually get Opus and not an impostor and so you can continue the session instead of restarting it hoping you get routed correctly. But that does not help if model is replaced with cheaper one mid session. I've got so much work lost because of these shenanigans.

by gruez20 hours ago|

[-]

> My experience with Anthropic is that despite using Opus 4.6 and 4.7, most of the time the performance is matching low B parameter Qwen.

Is this just the next level of the "they're serving quantized models!" theory?

by varispeed14 hours ago|

[-]

Not a theory buy lived experience. You never know when you get the nerfed session.

by alansaber20 hours ago|

[-]

I'm sure some inference providers don't, but most intentionally obfuscate this data. They have the full trace logs- my impression is that they don't share them because it's their competitive advantage, and it's easier for a competitor to distil their model if they did.

by conorliu31 minutes ago|

[-]

[dead]

by getverdict3 hours ago|

[-]

[dead]

by flowdesktech4 hours ago|

[-]

[dead]

by enesz4 hours ago|

[-]

[dead]

by tokenhub_dev8 hours ago|

[-]

[dead]

by alphainfo18 hours ago|

[-]

[dead]

by techpulselab20 hours ago|

[-]

[dead]

by pylonpeng10 hours ago|

[-]

[flagged]

by vdalhambra18 hours ago|

[-]

[dead]

by hibouaile16 hours ago|

[-]

[dead]

by chhxdjsj15 hours ago|

[-]

[dead]

by ryguz19 hours ago|

[-]

[dead]

by tripleee17 hours ago|

[-]

[dead]

by huflungdung19 hours ago|

[-]

[dead]

by neversupervised21 hours ago|

[-]

Terminal Bench is the future