Evals come from a million places and new evals and robust perturbations of existing evals abound. They test a variety of tasks in a variety of ways. All of them individually are flawed. Taken together the aggregate signal is highly useful as you more or less marginalize over a lot of different things. Not to mention these companies have plenty of proprietary internal measurements, they build benchmarks themselves to probe their models and then also have flywheel traffic and A/B tests.
You are right to call out benchmarks but to dismiss them or not take them seriously is a mistake.
This is what myself and my coworkers (and many other people in this thread) are doing on a daily basis with real stakes and real tasks – which these benchmarks are all aiming to be a proxy for. There's a real, tangible [cost]benefit to [not] using the highest-ROI models and harnesses.
The people with real incentives and skin in the game are telling you that the data diverges from "the data".
I don't mind if you don't take it seriously, our jobs are more important to us than a benchmark is.
But I wouldn't opt-out of using your own eyes and the eyes of others so easily, especially when there are literally hundreds of billions of dollars in invested capital with an interest in a certain outcome... this is how you end up in "Emperor's New Clothes" situations.
Eyes and ears of others is incredibly important. But you still seem to think somehow benchmarks is part of some giant conspiratorial cabal. You have institutions without ANY skin in the game making extremely high quality benchmarks. Consider in academia there is little else to do outside of partnerships with these companies. But benchmarks you can do completely independently and with university grant level money (it costs maybe $10-100k for a reasonable benchmark in many cases). Not only that, “real tasks” are what many benchmarks measure. You have these companies with extremely good logging and well scaled measurements to really look at what works and what doesn’t.
I personally don't believe in any sort of cabal (Occam's Razor hasn't let me down yet). Ultimately, I don't really care *why* they're wrong as much as I care *that* they have diverged from my rubber-meets-the-road measures of value.
That is concerning to me, because people are investing 100s of B's of capital based on the putative RoI putatively available to people like ourselves. When the benchmarks support this RoI thesis, but none of the anecdata does... that's really concerning!
Re: academics, I don't think any of the data academics have access to are good proxies for the work real people are doing. And for the data that are good proxies, the model labs certainly have access to the same data, and therefore the benchmark performance against those data is irrelevant.
> but none of the anecdata does... that's really concerning!
But see this is not really true -- adoption, subjective benchmarks, verifiable benchmarks, task-dependent performance, internal product metrics, living benchmarks, all point in a pretty consistent direction. Anecdata is not the plural of data. An anecdote is like a case study. It's there to motivate the things we already have which is a huge amount of performance measures for a variety of different tasks.
> Re: academics, I don't think any of the data academics have access to are good proxies for the work real people are doing.
But this isn't really true either -- you can get this data from a variety of sources that are licensable or open source, or data that you can commission. You can critique any one methodology for this but a blanket "they are hamstrung" is not really fair or accurate.
> And for the data that are good proxies, the model labs certainly have access to the same data, and therefore the benchmark performance against those data is irrelevant.
But this is also not true -- you can have exclusive license agreements, data you hold close to the heart, or data to measure models that haven't had access to it because that data was created after these models were released.
There are plenty of problems in model measurement but the answer is not to just abandon it to be cavemen with zero respect for rigor and the biases we have to be subject to as human beings.
Maybe back when this was a scientific endeavor; not now when enormous, enormous amounts of capital are on the line. Along with an entire cult's chosen eschatology.
Otherwise we agree that benchmarking is hard, the benchmarks contain hard problems, and that there are many hard working people trying to accurately gauge what is going on. It is getting harder to watch though as all that is on the line taints the overall endeavor.
Frankly I don't give a damn about data that could be made up on the spot or appears to be scientific or meaningful while it's not at all clear how it was made (up).
Claude was heavily lobotomised for my work starting somewhen in February.
I talked to friends and people I know and trust and many felt the same. (I didn't ask them whether they felt like I did, but what they felt, how happy they were with agentic coding etc.)
I quit my abo in March and talked to said friends who are still on a plan just last week: they are still not happy, but company pays so whatever...
I am not willing to believe the contrary from strangers on the interwebs or PR departments of companies who want to sell me something.
If people I genuinely trust tell me about their experiences, I am willing to try again.
But yes, if it doesn't work for me (for whatever reason, could be that I am holding it wrong), then I can accept that it works for everyone but me and still not use it.
Also "scientific" doesn't mean what it used to mean. When the n is small or it's just anecdotes (I am aware of the irony) blown out of proportion I really can't take the data and conclusions seriously
I am neither impressed nor offended by any kind of argumentum ad hominem. I sincerely hope you have a wonderful day!
> Benchmarks are not PR they are designed by a variety of institutions completely outside the control of frontier labs.
I don't give a crap about how good a shovel may be in a theoretical experiment when it's digging in sand, when I work with hard earth.
The ones I had a look at are mostly absolutely meaningless to my actual work.
> and what you’re describing is just putting your trust in a very poor quality benchmark.
And here is where we disagree fundamentally, so we can leave it at that.
Ex falso quodlibet
I don't know what this means, benchmark tasks are pretty hard and pretty in domain.
> The ones I had a look at are mostly absolutely meaningless to my actual work.
You've looked at 100,000 benchmarks?
> And here is where we disagree fundamentally, so we can leave it at that.
Yes we do disagree, yet one of us has statistics and rigor and one of us doesn't.
What about "The ones I had a look at" was unclear?
> Yes we do disagree, yet one of us has statistics and rigor and one of us doesn't.
Yup, that's true. So again, have a nice life!
It sounds like you're saying "Actually you, as a human, are simply not smart enough to evaluate Opus 4.8"
- evaluations need to be done at the same time to avoid drift in your bias
- you need to worry about your test set: which questions are you asking? How many of them? Are they representative of your work?
- which one did you do first? Raters have a tendency to bias in one direction or another
- you also know the label! You know which model is which! This biases your assessment…
And on and on and on. Careful science exists for a reason.