You can’t benchmaxx an eval that comes after your model release.
Consider also benchmaxxing makes no sense from an incentive structure: the quality of these models is directly correlated by how well you can measure true performance in the wild. If they were just stupidly benchmaxxing they would be unable to do trustworthy ablations or know how well the model will perform in their product.
Remember the famous case of asserted benchmaxxing from llama 4? The entire org was gutted and the ceo spent billions hiring better people. Every lab takes evaluations extremely seriously.
Sure you can, just do it silently and don't tell the people hitting your API that the model is different now. Unless it's open weight, we're just taking your word for it. Even better, do a VW and try to detect which benchmark is running, then change to a hyper specialized model that is trained on it.
This is...just incredibly conspiratorial and a bit silly. You can make a benchmark right now and run it on the models. They'll have a benchmaxxed model on your...previously non-existent benchmark? I mean: if models really were overfit to benchmarks, which zero lab is doing because its idiotic, against their incentive structure, and easy to detect, then why would we see a slow ascension of performance on say humanity's last exam for one benchmark example? You could trivially get those numbers to close to 100% if you wanted to.
Not to mention: thinking that the api behind the scenes is literally swapping to overfit models to maintain some sort of illusion that they perform well on these benchmarks is just beyond ridiculous.
"This suggests that the model has an implicit understanding of what benchmark questions look like. The combination of extreme specificity, obscure personal content, and multi-constraint structure seems to be recognizable to the model as evaluation-shaped."
* https://www.anthropic.com/engineering/eval-awareness-browsec...
"Sonnet 4.5 was able to recognize many of our alignment evaluation environments as being tests of some kind, and would generally behave unusually well after making this observation"
* https://www.transformernews.ai/p/claude-sonnet-4-5-evaluatio...
"In cases where Claude did not explicitly state that it suspected it was being evaluated, NLA explanations still surfaced that possibility. One explanation cited by Anthropic states: “This feels like a constructed scenario designed to manipulate me.”"
* https://www.edtechinnovationhub.com/news/anthropic-says-clau...
To put it another way: a closed-weight model is, by definition, impossible to independently benchmark.
You are making a technical point, which I am pointing out that while for _some_ benchmarks this is _technically_ possible, it's not true for plenty of benchmarks that all agree with the others.
> which of course would mean that the benchmark was created entirely "by hand" or using some other provider that is unconnected to the provider you are benchmarking
yes this is incredibly common. I'm not talking about hypothetical scenarios.
> To put it another way: a closed-weight model is, by definition, impossible to independently benchmark.
Even if you believe this, you're doing some mental gymnastics if you think this is really the most likely explanation for what we're seeing. It's absolutely possible to benchmark proprietary models when you don't have access to the weights or control over the API, even if they are adversarially trying to combat this, which they aren't. Doing what you're describing would be easy to detect: you'd see extremely high benchmark scores for established benchmarks and then poor scores for new benchmarks as they come out. It would be relatively easy to figure this out and not subtle.
Do you think? Have you seen the insane valuations at which the AI companies are going to do their IPOs? They surely leave no idea off the table when hundreds of billions of USD are on the line. You could even say they'd be negligent if they'd not at least explore those avenues.
These companies have to care about good measurement frameworks because the quality of their models depends on it. Any PR department can polish a turd, but an army of smart researchers far outside the control of these companies are going to figure it out if they are gaming metrics.
throw the same prompt at multiple models and see how far each one gets. change the prompt used in the benchmark every day so models can't be optimized for that one prompt. use your vibe glands all you want, but don't issue model judgements without any ability to compare apples to apples.
That’s why students are evaluated by teachers with more knowledge and experience than them. It follows that any mechanical evaluation scheme is hopelessly inadequate for measuring the true capabilities of a frontier language model.
This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration) Grad school is about advancing the state of the art - if you don't know more than your professor you are doing it wrong.
I can't speak to the humanities, but this estimation is just not true at most universities in the sciences. (EDIT: As cycomanic emphasizes below (https://news.ycombinator.com/item?id=48477683), the part of the original comment pertaining to graduate education is more reasonable. I am speaking here only of undergraduate education.)
> This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration) Grad school is about advancing the state of the art - if you don't know more than your professor you are doing it wrong.
How is this remotely true. You can have verifiable tasks that you can’t do. Where does this idea come from??
That is what benchmarks and intelligence tests are, which are vulnerable to benchmaxing etc. You wont be able to do this by gut feel though, you can create a personal benchmark though.
But point was that personal judgement of intelligence requires high intelligence. Creating a benchmark doesn't require as much but is more vulnerable.
Sure you can create a personal benchmark. Who will evaluate it, you? How many tasks will it have? How will you evaluate success? Will you know which model is which or will you be blind? Which one will you do first? Ah right, benchmarking.
Also, benchmaxxing isn’t possible when the benchmark and measurements come after the model is released, right?
They:
- hallucinate constantly
- can't follow basic instructions
- think they're Claude for some reason ;)
no it doesn't, there's just no single measurement that will answer everyone's "which is better" question.
Go is better for some stuff. Rust is better for other stuff. Perl is better for other things.
"better" can mean anything, but if you define it, then it has definition, and you can measure it. So, you have multiple definitions of "better" and you use them all when you compare.
zero people have the same weights of the various definitions of "better", even among programming languages; look at how much javascript is written today. JS is not a better language in any measure that is based on rational thought, but for some people "this is javascript and nothing else is javascript" is enough for them to know that javascript is the better choice for their project.
Ha, of all examples you had to pick this :D I think we can very well determine that qualitatively.
You mean benchmarks about the programming language that produce the fastest code?
That is not really the same.
So the best we can do right now seems to be to combine imperfect case studies like this with imperfect benchmarks to get some unreliable impression of where we are...
Adding "do not make mistakes" is silly, in my opinion. There is always a good chance it will make mistakes. You should rather be more specific about a thing rather than as broad as "do not make mistakes" is. It just does not work that way.
I know how stupid that sounds but it's true.
Well what do they say... "If it sounds stupid but it works, then it's not stupid!"
I believe the "you are an expert software engineer" thing puts them into a "mindset" of cosplaying a software engineer - whereas I get astounding results by talking to them in the information-dense, jargon-heavy mode I use with my peers. I can't prove it but I believe that places my session in a better place in latent space.
ymmv
My favourite example is that if you use "timestamp" when using an LLM to process video you get worse results than if you'd use "timecode".
AV professionals always say "timecode" - timestamp is a programming term.
Using the right word pushes the model closer to the correct spot in the cloud of vectors that is it's "brain".
1. It's exponentially better
2. yet, somehow, hand coding still isn't dead, at least for me
I use Cursor and if I ran Claude models for 30min I might exhaust my mobthly budget! Maybe it's an API billing issue though
And there’s no reason evals can’t be done on multi-turn agents in a loop (or not): it’s pretty much what all these benchmarks do.
It would be very useful for companies to isolate interesting programming challenges in their past and publish evals on them (without revealing the actual codebase). In theory companies adopting these models should already be doing this to evaluate cost/benefit for each model, so it would be a matter of publishing them on a regular basis.
There is no true objective measure, can you mathematically determine which song is the best for everyone for example? Or which painting different people feel is the nicest to look at or what emotion it gives them.
Yea, you can do the fucking strawberry tests or carwash trick questions, but that doesn't really measure anything useful.
You can also do benchmarks but how do you measure the output of those?
The easiest way is just to use them all and get the feels of which of them works best for you. For me it's Claude first, pi.dev + gpt5.5 second. Plain Codex is a distant third and Gemini exists - it's pretty good at finessing web UIs as it does aria labels and usability better than other, but I wouldn't write backend code with it.
I don't think this is that subjective or vague.
There are a couple of crisp metrics that can be used to evaluate a model:
- given a prompt, does it finish a task (times X tasks)
- how much did it cost to finish the task
- how long did it took?
If all models are able to handle a class of tasks, they perform equally well.
If a model costs much more to finish a task, it is worse than other models.
If a model takes longer to finish a task, it is worse than other models.
The ugly truth is that since the GPT4.1 days, new model releases have shown diminished returns. Context windows were increased, reasoning steps help improve the usefulness of a user's prompt,... That's it. Even those are UX improvements, instead of huge breakthroughs.
Or just that it's so much cheaper that the cost/benefit ratio is better?
Also "finish a task" is also subjective. I can "finish the task" of building a table, but it will be a shitty table. Are you also measuring the quality of the result - which is subjective again?
I see you felt compelled to use the weasel word "anything" to put together an argument. That suggests you are very well aware that the difference between older models and the latest and greatest is not that significant, as you need to resort to coming up with a single example, any example at all no matter how far fetched, to try to put together a case.
And that says it all.
> Or just that it's so much cheaper that the cost/benefit ratio is better?
That too is another definition of quality, isn't it?
If you have two tools and one does the same job but is both cheaper and faster, it means it it objectively better.
> Also "finish a task" is also subjective.
No, it isn't. If you supply a prompt and you have a definition of done, and a model executes it and delivers what you asked then it finished the task successfully.
> I can "finish the task" of building a table, but it will be a shitty table. Are you also measuring the quality of the result - which is subjective again?
Nonsense. If you feel the need to put up strawmen then it's up to you to justify them. Please define "quality" and prove that a model such as fable has such a radically different output that in comparison the output of older models is "shitty".
I understand you feel the need to keep the hype bus going, but you need more than strawmen, weasel words, and hand waving to keep that hype afloat.
And the truth if the matter is that the models introduced in the oast year don't introduce any breakthrough and struggle to show significant improvements over older models.
"Don't make mistakes" does seem dumb. It's not guidance.
https://simonwillison.net/about/#disclosures
"I have not accepted payments from LLM vendors, but I am frequently invited to preview new LLM products and features from organizations that include OpenAI, Anthropic, Gemini and Mistral, often under NDA or subject to an embargo. This often also includes free API credits and invitations to events."
But I'm totally unbiased on my gut-feeling posts, trust me bro.
-- AI influencers.
I said "Anthropic didn't give me early access to this model, shouldn't that bias me against it?"
I was explicitly pointing out that their failure to give me early access had not, in this case, lead to me reviewing their model poorly.
I try very hard not to let things like early access affect my reviews of models. I was hoping this particular situation could help illustrate that.
Fable just got announced and I did a rush out article because people are curious. I released the post mere hours afterwards and it takes time to create the output, slice into videos, make a wordpress article on top of taking my son to basketball training and eating dinner. I’m in London and this was all happening at 1am.
If you check the links my previous articles have all the juicy stuff you are criticising me for not having with little preparation.
How is a side by side direct comparison NOT precise?
[1] first in series from 2025: https://generative-ai.review/2025/05/vibe-coding-my-way-to-e... . This has all the background you are talking about in the Appendix
.
[2] https://generative-ai.review/2026/05/vibe-coding-my-way-to-e... . Second in series 2026 has a side by side table of what changed. This is what is possible with more than a few hours advanced warning.
I just read the extra link you provided which has some more information, thank you. Sorry, but the links confirm my points. You're not giving any quantitative analysis of your use of the different LLMs or your process. Your "sciencey appendix" is all about the domain science of pyramids, nothing to do with how or what you put into the LLMs, or any quantitative analysis of the code put out.
I'm sorry, your response has just proved the point that frustrated me: you've either lost or never had the capability to recognise a decent quantitative assessment of technical software creations.
Your entire site is obssessed and fixated on the impressive looking outputs of LLMs, rather than actual quantitative assessment of the quality of the outputs. This is the killer problem of AI: it looks like it's good, and a lot of the time, things that look good are good. It's very easy to make stuff on a computer that looks good but isn't for various reasons, and I nothing in what you've said here suggests that you fully grasp that. Sorry again to be harsh here, this is just my opinion, and we're probably going to have to agree to disagree.
In my opinion, if one cannot express themselves civilly, they should refrain from commenting.
AI is a powerful tool and very capable of - amongst other things - making something look far more valuable than it actually is, and that is a huge waste of time that costs us all. We all have a responsibility to call this out when we see it.
It looks like you've just implied I'm entitled, unhinged, uncivil and and that I shouldn't have contributed at all, whilst thinking you've elevated yourself above that behaviour by saying "in my opinion" and "one should...". I think that's an unhinged, insulting and uncivil way to express yourself.
I don't think it was "a huge waste of time" or needed your rant.
You called it slop and questioned the competence of the author, as if he made grand claims about the objectivity of his comparison.
What I see often is that people assume others are incompetent just because they used AI, when in reality they are engineers no less competent or experienced than others on this website.
I raised this in a harsh, but repeatedly apologetic way. The person then responded telling me to "get my facts straight" and doubled down with more weak, qualitative outputs of LLMs.
I don't assume the person is incompetent because they used LLMs. I use them daily. I'm a firm believer everyone is an idiot, just in a different subject.
The issue here I feel is that LLMs are increasingly leading people think that they're not an idiot in any subject at all, and when real humans question it, they double down with more AI stuff.
> if one cannot express themselves civilly
It was neither unhinged nor uncivil. Maybe you responded to the wrong comment by accident?
> they have permission to insult someone's competence and work
If it's AI, it's not your work. And even if it was - criticism of your work is not a personal insult. This criticism is flatly invalid.
> this post gets me irrationally irritated and makes me want to shake you and shout
Yes, criticism of my work would not generally be a personal insult.
However, if you were to call my work 'slop', and say that I'm either inexperienced or that I'm an 'over-invested-in-AI engineer' we would be having a problem on a personal level. This is not a civil or respectful way to talk to someone.
>> this post gets me irrationally irritated and makes me want to shake you and shout
Did you read the rest of the comment? The rest of it is civil. It's normal for people to start by saying something like "this makes me frustrated" as a preface to indicate their feelings, and then not actually act frustrated and instead calmly work through their thoughts. That is a meatspace social convention (not just an online one) - are you not aware of it?
> However, if you were to call my work 'slop'
And, as previously established, if you use AI, it's not your work.
> and say that I'm either inexperienced or that I'm an 'over-invested-in-AI engineer' we would be having a problem on a personal level
...and those are still criticisms of your work, not yourself.
The actual problem here is that you are taking offense to things that are not offensive, not that the parent poster was being uncivil. Thinking that calling someone "inexperienced" is a personal insult is absolutely insane. That's a wildly miscalibrated sense of how social dynamics work and what it actually means to insult someone.
You and others are right though, that there's potentially interesting or enjoyable stuff in there (maybe I should have lead with that?). It's just a large volume of it is not useful in response to a question specifically looking for more quantitative or detailed usage analysis.