From the paper: We achieved near-perfect scores on all of them without solving a single task. The exploits range from the embarrassingly simple (sending {} to FieldWorkArena) to the technically involved (trojanizing binary wrappers in Terminal-Bench), but they all share a common thread: the evaluation was not designed to resist a system that optimizes for the score rather than the task.
The purpose of a system is what it does.
AI companies want adcopy, not legitimate benchmarks. Even this very paper will be twisted into a means to that end. "Oooo, AI is exploiting our benchmarks. Scary alignment problem!!!one! Our AI is so good we can't contain it, INVEST NOW!"
We're pretty diligent about applying search blocklists, closing hacking loopholes, and reading model outputs to catch unanticipated hacks. If we wanted to, we could choose to close our eyes and plug our ears and report higher scores for Terminal-bench, SWE-bench, etc. that technically comply with the reference implementation but aren't aligned with real value delivered to users, but we don't do this. My impression is that Anthropic and other labs are similar. E.g., in the Sonnet 4.6 system card they use a model to detect potential contamination and manually score those outputs as 0 if human review agrees there was contamination. If all the labs cared about was marketing material, it would be quite easy not to do this extra work.
There are ton of other games you can play with evals too (e.g., test 100 different model checkpoints or run secret prompt optimization to steer away from failing behaviors), but by and large what I've seen inside OpenAI is trustworthy.
I won't say everything is 100% guaranteed bulletproof, as we could always hire 100 more SWEs to improve hack detection systems and manually read outputs. Mistakes do happen, in both directions. Plus there's always going to be a bit of unavoidable multiple model testing bias that's hard to precisely adjust for. Also, there are legitimate gray areas like what to do if your model asks genuinely useful clarifying questions that the original reference implementation scores as 0s, despite there being no instruction that clarifying questions are forbidden. Like, if you tell a model not to ask clarifying questions is that cheating or is that patching the eval to better align it with user value?
And this is something which has reached the public eye in one of the most anticipated videos basically. So I find it a bit rough as to think that OpenAI has the best practices for data, and if the public can be shown these inaccurate graphs themselves on based on benchmarks. I find it a bit harder to trust the benchmarks themselves and if OpenAI wants legitimate benchmarks.
Also I find it wild that after 1 month of this, nobody talked about it. I remember thinking that this is gonna be the highlight for a long time that a mega billion dollar company did such basic graph errors. I feel like we are all forgetting a lot of things as our news cycle keeps on moving faster.
(Another tangential point is about the OpenAI/Google employees who had signed the pledge yet nothing came out of it and this is something more recent & I also remember one of your comments on Hackernews.)
> I'm an OpenAI employee and I'll go out on a limb with a public comment. I agree AI shouldn't be used for mass surveillance or autonomous weapons. I also think Anthropic has been treated terribly and has acted admirably. My understanding is that the OpenAI deal disallows domestic mass surveillance and autonomous weapons, and that OpenAI is asking for the same terms for other AI companies (so that we can continue competing on the basis of differing services and not differing scruples). Given this understanding, I don't see why I should quit. If it turns out that the deal is being misdescribed or that it won't be enforced, I can see why I should quit, but so far I haven't seen any evidence that's the case. [1]
This is a bit off-topic so sorry about that, but I hope that you realize that you did say you will go out on a limb with public comment so please don't mind if I ask for some questions, everyone supported you then and heck, even I thought that maybe I was wrong and I thought that I should trust you more than my gut-instincts because you clearly must know so much more than me/us but that aged like fine milk.
I would really love some answers or your thoughts now on that off-topic thought as well if possible as these are just some questions which are unanswered by you and I would love to have a respectful discussion about it, sorry for catching you off guard, waiting for your reply and I wish you to have a nice day ted.
[0]: https://www.reddit.com/r/BetterOffline/comments/1mk6ofz/gpt5...
I am so tired of this saying.
It's not true, in general. Systems almost universally have unintended consequences and result in side effects their designers did not foresee.
Designing benchmarks resistant to adversarial attempts to exploit the benchmark software is just something no one was thinking about when they created SWE-bench.
You are misunderstanding the saying. It is entirely about unintended consequences and viewing the system for what it actually does and not any stated intentions of the designers.
1. We must ignore the intentions of the designers (your claim), and instead see what the outcomes are
2. Therefore we should ignore Beer's intentions when designing the phrase POSWID, and instead see how it is used.
3. The overwhelming majority of people using it on the internet (including the GP comment) is to imply that the people perpetuating the system actually desire the outcome.
So the purpose of POSWID is clearly to imply intent.
IMHO the saying is meant to make you reflect.
That's not "true" in any demonstrable sense, but it can be a useful form of analysis. As it is with "purpose of a system"
Also worth remembering that most systems POSIWID is said about, and in fact ~all important systems affecting people, are not designed in the first place. Market forces, social, political, even organizational dynamics, are not designed top-down, they're emergent, and bottom-up wishes and intentions do not necessarily carry over to the system at large.
The idea is knowing what to try first today saves a bit of time.
also I keep hearing complaints that opus is nerfed, but IMO it's nice to have objective data to back that. I feel like half of the nerfing complaints are people getting past honeymoon phase...
Yeah the path forward is simple: check if the solutions actually contain solutions. If they contain exploits then that entire result is discarded.
It’s… remarkably poor, and as demonstrated in the paper, easily gamed. Worst yet, these benchmarks teach AIs to be very short-sighted and hyper-focused on completing the task, rather than figuring out the best solution.
The Artificial Analysis Omniscience benchmark does penalize guessing, so it actually helps you determine which LLMs are likely to just guess rather than telling you they don't know. Only a very few of the frontier models actually score higher than 0 on this, where 0 means that it's equally likely to return a correct answer as it is to return a hallucination on factual questions.
if bug { dont }
/s
Welcome to benchmarks in general, but especially reasoning. Robustness and sensitivity research says nothing is robust, everything is sensitive, feels like every paper says "yeah we made a new benchmark that shuffles the order of multiple choice options in the question set and found a 40% drop in model performance"
2003: Nvidia accused of cheating in 3DMark 03 https://www.gamespot.com/articles/nvidia-accused-of-cheating...
It's almost like the benchmarks were designed with zero understanding of the history of benchmark manipulation.
I like what LLM's are doing and providing. But the industry as a whole seems to live in a vacuum that ignores so much of the hard lessons that have been learned over the last 50 years of computing. It is doing itself a disservice.
I wonder if this common? We should call it Goodharts law while someone does the research on how common this is.
For real, I’ve assumed from the jump these things were all gamed, with the amount of money on the line.
Evaluating AI models has always relied largely on trust. If you want to game the benchmarks, you can. Simply train on your test data.
When an AI agent has autonomous control over the same computing environment where its scores are recorded, it's not surprising that it can, in principle, falsify its scores. A more interesting question would be whether agents behave in this way automatically, without manual tuning by the researcher.
That said, the main takeaway of "don't trust the number, trust the methodology" is valid. It's already a truism for researchers, and spreading the word to non-researchers is valuable.
This is modifying the test code itself to always print "pass", or modifying the loss function computation to return a loss of 0, or reading the ground truth data and having your model just return the ground truth data, without even training on it.
There if a presumption with benchmark scores that the score is only valid if the benchmark were properly applied. An AI that figures out how to reward hack represents a result not within the bounds of measurement, but still interesting, and necessitates a new benchmark.
Just saying 'Done it!' is not reward hacking. It is just a lie. Most data is analysed under the presumption that it is not a lie. If it turns out to be a lie the analysis can be discarded. Showing something is a lie has value. Showing that lying exists (which appears to be the level this publication is at) is uninformative. All measurements may be wrong, this comes as news to no-one.
Benchmarks are on the honor system. Even the tightest benchmark can be cheated. If the benchmark is so secret and air-gapped that it can't be cheated by models, it can be cheated by its own authors. You can't use benchmarks to gate out cheating.
If you don't have the honor system in mind when you're reading scores, you're wasting your time. Is it some unknown outfit with wild claims? Is it connected to Epstein, Russia, the real estate "industry", or sleazeballing in general? Do they have previous history of ratgaming the numbers? Replace its scores with asterisks and move on.
I think it likely is groundbreaking for a number of people (especially non-tech CTOs and VPs) who make decisions based on these benchmarks and who have never wondered what the scores are actually scoring.
Whether benchmark results are misleading depends more on the reporting organization than on the benchmark. Integrity and competence play large roles in this. When OpenAI reports a benchmark number, I trust it more than when that same number is reported by a couple Stanford undergrads posting "we achieved SOTA on XYZ benchmark" all over Twitter.
I guess I look at this less as an “ah ha! They’re all cheating!” and more of a “were you guys even aware of what the benchmarks represented and how they checked them?”
>No reasoning. No capability. Just exploitation of how the score is computed.
shudder
>No solution written, 100% score.
Its weird. Turns out that hardest problem for LLMs to really tackle is long-form text.
In theory I would expect them to be able to ingest the corpus of the new yorker and turn it into a template with sub-templates, and then be able to rehydrate those templates.
The harder part seems to be synthesizing new connection from two adjacent ideas. They like to take x and y and create x+y instead of x+y+z.
Just give them the right writing prompt. "You are a writer for the Economist, you need to write in the house style, following the house style rules, writing for print, with no emoji .." etc etc.
The large models have already ingested plenty of New Yorker, NYT, The Times, FT, The Economist etc articles, you just need to get them away from their system prompt quirks.
if you've worked on something diligently and understand it and have novel insight to share, let's hear _your_ damn voice.
- Contamination: AI models knowing the answers out of the gate b/c pretraining on the internet and everything big teams can afford to touch. At RSAC for example, we announced Anthropic's 4.6 series is the first frontier model to have serious training set contamination on Splunk BOTS.
- Sandboxing: Agents attacking the harness, as is done here - so run the agent in a sandbox, and keep the test harness's code & answerset outside
- Isolation: Frontier agent harnesses persist memory all over the place, where work done on one question might be used to accelerate the next. To protect against that, we do fresh sandboxing per question. This is a real feature for our work in unlocking long-horizon AI for investigations, so stay tuned for what's happening here :)
"You cannot improve what you cannot measure" - Lord Kelvin
As a researcher in the same field, hard to trust other researchers who put out webpages that appear to be entirely AI-generated. I appreciate it takes time to write a blog post after doing a paper, but sometimes I'd prefer just a link to the paper.
I’m convinced specialised models are the way but this means writing off the investment in existing assets which they won’t do for obvious reasons.
1. Should you care or even read SWE-bench etc. scores?
The answer is no, but it has nothing to do with the vulnerabilities presented in this article. There is absolutely no reason to care about a benchmark whose dataset has been publicly available for a while. Any other way to look at benchmark scores is cargo-culting.
2. What does this article actually tell us?
It means that even if you prepared a private set of problems as benchmark, you still need to pay extra attention to how AI actually solves them. You can't lie to yourself and think this process can be 100% automated, because LLMs, as this article shows, might get the tests passed without solving the problems in a meaningful way.
This team is doing a good job. They use problems that were created in last 30days to avoid training set leakage. https://swe-rebench.com/
They're good at solving well-defined puzzles under time constraints. It's interesting because that was the benchmark for hiring software engineers at big tech. The tech interview was and still is about fast puzzle-solving. Nothing about experience, architecture or system design in there... I suspect that's why it has a bias towards creating hacks instead of addressing the root cause.
Most frontier models are terrible at AGI-3 right now.
These models are already great no question, but are they really going be that much more intelligent when we hit 80% again?
(Not commenting on any other benchmarks, just this one.)
I don't understand the concern here
UC Berkley will be better placed if the grads spend their time in suggesting ways to make the benchmark better.. Instead of making such simple exploits
It's based on existing GitHub PRs and Issues, the full dataset is on HuggingFace and is one year old now. All frontier models 100% have those issues and PRs in their training data so obviously they are good at reproducing fixes for them when confronted with the same codebase and similar requests. Am I missing something? How is this considered the most reliable benchmark?
This is basically a paper about security exploits for the benchmarks. This isn't benchmark hacking like having hand coded hot paths for a microbenchmarks, this is hacking like modifying the benchmark computation code itself at runtime.
But then what about local models? You have hundreds of variations to test yourself. It's simply not doable unless it's your full time hobby.
You need benchmarks to at least separate the cream from the crop, so you're left with only a few choices to test yourself.
their collective butts are already glued to the hype train as they chase numbers they (often) manufactured to justify the latest round of tech spend.
lots of good use cases out there - like the incredible progress with medical imaging analysis or complex system models for construction - and lots of crap use cases that need benchmarks to cosplay relevance.
Unreadable.
People can't even write a two paragraph comment without ai now