upvote
We calculate percentiles based on successful submissions only, and then apply success rate as a separate measurement, which is incorporated into our relative rankings.

So we do penalize evals where the player failed the game, but not in the percentile measurement (success rate measures instances of playing incorrectly, did not compile, runtime errors, and other non-infrastructure related issues that can be blamed on the model). The design decision there is that percentile tells you how good the model's ideas are (when executed correctly), separately from how often it got something working correctly, but I can see how that's not great UX, at least as presented now.

But the actual score itself is a combination of percentiles and success rates with some weighting for different categories, nothing fancy.

I added a methodology page to the roadmap, thanks for pointing that out. We've converged on a benchmark methodology that should scale for a very long time, so it's time to document it better.

reply
Neat, thank you for explaining!
reply