upvote
I'm not sure if the paper's findings are all that actionable. The paper doesn't say "here's how benchmarks are currently being gamed." It says "here's how benchmarks could in theory be gamed."

Whether benchmark results are misleading depends more on the reporting organization than on the benchmark. Integrity and competence play large roles in this. When OpenAI reports a benchmark number, I trust it more than when that same number is reported by a couple Stanford undergrads posting "we achieved SOTA on XYZ benchmark" all over Twitter.

reply
I think that’s totally fair!

I guess I look at this less as an “ah ha! They’re all cheating!” and more of a “were you guys even aware of what the benchmarks represented and how they checked them?”

reply