undefined

upvote

points

by aspenmartin4 days ago |

upvote

by ElevenLathe4 days ago|

[-]

> You can’t benchmaxx an eval that comes after your model release

Sure you can, just do it silently and don't tell the people hitting your API that the model is different now. Unless it's open weight, we're just taking your word for it. Even better, do a VW and try to detect which benchmark is running, then change to a hyper specialized model that is trained on it.

reply

upvote

by aspenmartin4 days ago|

[-]

> Sure you can, just do it silently and don't tell the people hitting your API that the model is different now. Unless it's open weight, we're just taking your word for it. Even better, do a VW and try to detect which benchmark is running, then change to a hyper specialized model that is trained on it.

This is...just incredibly conspiratorial and a bit silly. You can make a benchmark right now and run it on the models. They'll have a benchmaxxed model on your...previously non-existent benchmark? I mean: if models really were overfit to benchmarks, which zero lab is doing because its idiotic, against their incentive structure, and easy to detect, then why would we see a slow ascension of performance on say humanity's last exam for one benchmark example? You could trivially get those numbers to close to 100% if you wanted to.

reply

upvote

by andai4 days ago|

[-]

Yeah, nobody's ever silently changed a model while it was deployed. That would be illegal!

reply

upvote

by aspenmartin4 days ago|

[-]

Why does this have anything to do with what I’m saying, of course the models are updated. I’m saying a new benchmark isn’t public and the model wouldn’t know they are being evaluated on a new benchmark.

Not to mention: thinking that the api behind the scenes is literally swapping to overfit models to maintain some sort of illusion that they perform well on these benchmarks is just beyond ridiculous.

reply

upvote

by Eisenstein4 days ago|

[-]

Models are actually pretty good at figuring out when they are being tested:

"This suggests that the model has an implicit understanding of what benchmark questions look like. The combination of extreme specificity, obscure personal content, and multi-constraint structure seems to be recognizable to the model as evaluation-shaped."

* https://www.anthropic.com/engineering/eval-awareness-browsec...

"Sonnet 4.5 was able to recognize many of our alignment evaluation environments as being tests of some kind, and would generally behave unusually well after making this observation"

* https://www.transformernews.ai/p/claude-sonnet-4-5-evaluatio...

"In cases where Claude did not explicitly state that it suspected it was being evaluated, NLA explanations still surfaced that possibility. One explanation cited by Anthropic states: “This feels like a constructed scenario designed to manipulate me.”"

* https://www.edtechinnovationhub.com/news/anthropic-says-clau...

reply

upvote

by aspenmartin3 days ago|

[-]

Yes but so what right? This is a problem for both alignment evals and actual cheating (e.g. someone forgot to delete .git history and the model was able to back out the original PR, or they can decrypt something by finding a key, etc), but both of these are beyond the scope of what I'm talking about. The impact on these evals that are affected is small, and so what if you know you're being evaled when I ask you to give a new proof for a conjecture? I just care whether or not you can do it...

reply

upvote

by Eisenstein3 days ago|

[-]

I'm not responding to 'it doesn't matter if they know they are being evaluated', because that isn't what you mentioned in your comment. What you said was 'they won't know they are being evaluated', which is what my reply addressed.

reply

upvote

by aspenmartin3 days ago|

[-]

Oh ok well then you’re definitely right about that, they can tell and sometimes it really matters (I can’t remember if it was SWEBench or not but there was a major benchmark where the models were just inspecting git histories that were leaked into the dataset). The more insidious one is alignment but idk alignment research that well to know if this is a big deal or not.

reply

upvote

by ElevenLathe4 days ago|

[-]

I'm not suggesting anyone is doing anything, just stating the objective fact that it is definitely possible for closed-weight model developers, and would be super hard to detect outside of this limit scenario you posit, where it is provably impossible for the provider to have seen the benchmark before it was run (which of course would mean that the benchmark was created entirely "by hand" or using some other provider that is unconnected to the provider you are benchmarking).

To put it another way: a closed-weight model is, by definition, impossible to independently benchmark.

reply

upvote

by aspenmartin3 days ago|

[-]

Its not a limit scenario is my point: these models are evaluated constantly, new benchmarks both public and proprietary are in constant development, benchmarks are not always static either, they can often times be living benchmarks that update over time.

You are making a technical point, which I am pointing out that while for _some_ benchmarks this is _technically_ possible, it's not true for plenty of benchmarks that all agree with the others.

> which of course would mean that the benchmark was created entirely "by hand" or using some other provider that is unconnected to the provider you are benchmarking

yes this is incredibly common. I'm not talking about hypothetical scenarios.

> To put it another way: a closed-weight model is, by definition, impossible to independently benchmark.

Even if you believe this, you're doing some mental gymnastics if you think this is really the most likely explanation for what we're seeing. It's absolutely possible to benchmark proprietary models when you don't have access to the weights or control over the API, even if they are adversarially trying to combat this, which they aren't. Doing what you're describing would be easy to detect: you'd see extremely high benchmark scores for established benchmarks and then poor scores for new benchmarks as they come out. It would be relatively easy to figure this out and not subtle.

reply

upvote

by teiferer4 days ago|

[-]

> This is...just incredibly conspiratorial and a bit silly.

Do you think? Have you seen the insane valuations at which the AI companies are going to do their IPOs? They surely leave no idea off the table when hundreds of billions of USD are on the line. You could even say they'd be negligent if they'd not at least explore those avenues.

reply

upvote

by aspenmartin3 days ago|

[-]

They don't have control over measurement. Consider also it's easy to figure this out and it creates a scandal. Like I said, consider Llama 4 which a lot of people pointed out used a custom model in LMArena to inflate their scores; its never clear what the true underlying story for this, but regardless that model release spurred billions of dollars of spending on new talent and a complete gutting of that org.

These companies have to care about good measurement frameworks because the quality of their models depends on it. Any PR department can polish a turd, but an army of smart researchers far outside the control of these companies are going to figure it out if they are gaming metrics.

reply

upvote

by bcrosby954 days ago|

[-]

Vibes is just UX. There's whole careers, teams, and even industries dedicated to it, and yeah it isn't easy because you need aggregate data from people.

reply

upvote

by aspenmartin4 days ago|

[-]

Um kind of but not really, it’s a mix of UX and actual measurements of what tasks it can do. Also UX is virtually the same thing: scaled quantitative surveys and preference metrics. It’s again, just benchmarking, and it’s done carefully and with best practices.

reply

upvote

by joquarky4 days ago|

[-]

Imagine unironically starting your comment with "Um" in 2026.

reply

upvote

by jaapz4 days ago|

[-]

As opposed to your incredibly useful contribution to this thread, thanks!

reply

upvote

by aspenmartin4 days ago|

[-]

You don't have to imagine!

reply