upvote
Is there a page where I could read more? What's unintuitive at a glance is that Opus 4.7 has a lower success rate than Sonnet 4.6 (90% vs 100%) while having a higher Avg Percentile (87.2% vs 70.9%).
reply
We calculate percentiles based on successful submissions only, and then apply success rate as a separate measurement, which is incorporated into our relative rankings.

So we do penalize evals where the player failed the game, but not in the percentile measurement (success rate measures instances of playing incorrectly, did not compile, runtime errors, and other non-infrastructure related issues that can be blamed on the model). The design decision there is that percentile tells you how good the model's ideas are (when executed correctly), separately from how often it got something working correctly, but I can see how that's not great UX, at least as presented now.

But the actual score itself is a combination of percentiles and success rates with some weighting for different categories, nothing fancy.

I added a methodology page to the roadmap, thanks for pointing that out. We've converged on a benchmark methodology that should scale for a very long time, so it's time to document it better.

reply
Neat, thank you for explaining!
reply
Do your benchmark results indicate any level of regression on Opus 4.6 or 4.5 since their first release?
reply
We only have some basic time filtering (https://gertlabs.com/?days=30), but most of our samples are from the last 2 months. This is a visualization we plan to add when we've collected more historical data.

But we did heavily resample Claude Opus 4.6 during the height of the degraded performance fiasco, and my takeaway is that API-based eval performance was... about the same. Claude Opus 4.6 was just never significantly better than 4.5.

But we don't really know if you're getting a different model when authenticated by OAUTH/subscription vs calling the API and paying usage prices. I definitely noticed performance issues recently, too, so I suspect it had more to do with subscription-only degradation and/or hastily shipped harness changes.

reply
"but most of our samples are from the last 2 months."

There's your major issue. That's well within the brutal quantization window.

reply