undefined

points

[-]

Think of it less like a test suite and more like an exam. If you're trying to differentiate between the performance of different people/systems/models, you need to calibrate the difficulty accordingly.

When designing a benchmark, a pass rate of roughly 50% is useful because it gives you the most information about the relative performance of different models. If the pass rate is 90%+ too often, that means the test is too easy: you're wasting questions asking the model to do things we already know it can do, and getting no extra information. And if it's too low then you're wasting questions at the other end, trying to make it do impossible tasks.

by SEMW14 hours ago|

prev|

[-]

There's no shortage of benchmarks (coding or otherwise) that any competent coding model will now pass with ~100%.

But no-one quotes those any more because if everyone passes them, they don't serve any useful purpose in discriminating between different models or identifying advancements

So people switch to new benchmarks which either have more difficult tasks or some other artificial constraints that make them in some way harder to pass, until the scores are low enough that they're actually discriminating between models. and a 50% score is in some sense ideal for that - there's lots of room for variance around 50%.

(whether the thing they're measuring is something that well correlates to real coding performance is another question)

So you can't infer anything in isolation from a given benchmark score being only 50% other than that benchmarks are calibrated to make such scores the likely outcome

by crustycoder11 hours ago|

parent|

[-]

So it's the relative and not the absolute diff that matters - thanks.