prior bms relied mostly on unit tests or synthetic judges which are easily benchmaxxed, which leads to nobody trusting benchmarks
we need people manually checking the data for good code quality
this benchmark looks very good from the methodology. a cog researcher checking the data themselves is very high signal (not scaleable so don't take the benchmark as gospel, but directionally good)
Nobody would have 800+ billion reasons to lie by commission or omission here.
they aren't married to a particular lab, most of their usage is their in house model i believe
I think it's safe to assume everything AI related is heavily biased until proven otherwise. Just like in pharma.
TL;DR - they worked with OSS project maintainers to build tasks. They score models based on whether a PR is mergeable. All tasks are graded by a human researcher. SoTA models have hill-climbing to do which raises the bar and inspires confidence. I'd say it's legit.