undefined

points

by vintagedave22 hours ago |

comments

by embedding-shape22 hours ago|

[-]

> Is this saying a quarter of the questions and answers were wrong, this whole time?!

No, they're saying 59.4% of the 27.6% subset had flawed test cases I think.

> If so, how was this ever, in any way, a valid measurement?

Benchmarks essentially aren't, for practical concerns anyways. They don't represent your use case, and they don't represent any and all use cases, they're valid for measuring exactly what's included in the benchmarks, nothing more and nothing less.

I don't understand the ecosystems obsession with using public benchmarks, they hardly ever tell you anything of value. Ok, Qwen 3.5 is 50% better on Benchmark X than Qwen 2.5, does that mean it'll be 50% better for what you're using it for? Very unlikely.

I've been running my own private benchmarks, with test cases I never share anywhere, for the specific problems I'm using LLMs for. Some are based on real, actual cases where a LLM went wrong and I had to adjust the prompt, and over time I've built up a suite.

Most of the times when a new update comes out to a model, it moves maybe 2-3% in my own benchmarks, meanwhile they tout 30-40% increase or something ridiculous in public benchmarks, and we're supposed to believe the models' training data isn't contaminated...

by wtallis16 hours ago|

parent|

[-]

I'm not sure people are really trying to interpret this kind of benchmark as being accurate in gauging the magnitude of improvement. It seems pretty obvious that doubling your score on some benchmark where 100% means "correctly answered all of these specific problems" doesn't translate directly to performing twice as well on all problems. I think what people want from these benchmarks—and what they do get to some extent—is answering the question of "is model A better than model B", especially the subset of "is this local model better than last year's frontier online model".

The marketing departments touting each model do want to claim superiority on the basis of slivers of percentage points, and that's probably always a stronger claim than the test results can reasonably support. And the benchmarks are obviously susceptible to cheating and overfitting. But when the scores aren't saturated and do show a big discrepancy, that kind of result usually seems to align with what people report from actually trying to use the models in the relevant problem space.

by avereveard5 hours ago|

parent|

prev|

[-]

the ecosystem obsession with public benchmarks comes from the fact that running benchmark costs, and labs don't test on any given private benchmark

but yeah you're correct anyone optimizing for public-bench rank instead of their own task-distribution eval has been pointing at the wrong thing for a while

still I guess useful signal to know which one model to consider, negative signal is still signal, assuming everyone is gaming benchmark in certain ways, lack of performance do result in a real workload effect

by wavemode18 hours ago|

parent|

prev|

[-]

> No, they're saying 59.4% of the 27.6% subset had flawed test cases I think.

That being said, they didn't audit the other 72.4%, right? So it's likely that there are way more flawed problems throughout the full set?

by sillysaurusx22 hours ago|

prev|

[-]

Imagenet is one of the most popular datasets on the planet. Turns out, a significant fraction of its images are mislabeled. In the limit case the model would have to fit towards wrong answers to get higher than a certain percentage.

The answer is “it works because ML wants to work.” It’s surprising how far you can get with something flawed. It’s also why such huge breakthroughs are possible by noting flaws others haven’t.

by embedding-shape22 hours ago|

parent|

[-]

> It’s also why such huge breakthroughs are possible by noting flaws others haven’t.

I do these sort of breakthroughs at home all the time! My wife would say the computer is doing something strange, and instead of just randomly clicking around, I read the error messages slowly and out loud, then follow what they say. Anyone can do this, yet it seems like a magical ability every time you employ it to help people.

by cindyllm22 hours ago|

parent|

[-]

[dead]

by jmalicki21 hours ago|

parent|

prev|

[-]

Has it been reasonably possible to overfit to the errors in ImageNet, or are they effectively random noise?

by yorwba19 hours ago|

prev|

[-]

To be useful for identifying which model is better, benchmark scores only need to correlate with true performance, for which it's enough that the majority of tasks are scored correctly. You could have a terrible benchmark where 49% of the labels are wrong and a model that always answers correctly gets a score of 51%, but as long as it's higher than the always-wrong model at 49%, it's still directionally correct.

Most machine-learning benchmarks have a fairly large fraction of incorrect labels, but when you just want to distinguish between different models, the time you'd need to ensure perfect scoring would usually be better spent on collecting a larger benchmark dataset, even if it ends up having more errors.

by motoboi22 hours ago|

prev|

[-]

It’s saying that 16% of the problems have well, problems.

by vintagedave22 hours ago|

parent|

[-]

You're right - I did not apply the math. (I won't edit, in order to let the parent comment still make sense, and thankyou for the correction.)

So not one in four, but one in six problems have problems.

That is extraordinarily high and the point still stands: is this truly saying a [large proportion] of the questions and answers were wrong, this whole time, and if so how was it ever a valid measurement?

by motoboi21 hours ago|

parent|

[-]

Wait until you discover how many wrong labeled images in imagenet and that it still kickstarted the deeplearning revolution.

by 22 hours ago|

prev|

[-]

deleted

by embedding-shape22 hours ago|

parent|

[-]

> Curiously Opus 4.7 claims to have a 87.6% pass rate and Mythos claims to have a 93.9% pass rate... leading to the conclusion that it's actually possible to "solve" the problems that OpenAI claims are incorrect.

Huh, that is very curious and interesting indeed. If that's indeed true, that Anthropic claims that pass rate while OpenAI claims the test cases are flawed and broken, then clearly one of them aren't telling their whole side...

by gpm22 hours ago|

parent|

[-]

Oops, sorry, moved this portion of the comment to a top level comment simultaneously with you replying. Since the part of the comment that was replying to GP was addressed well in a simultaneous comment.

https://news.ycombinator.com/item?id=47911074

Citation for the claimed pass rates is: https://llm-stats.com/benchmarks/swe-bench-verified