undefined

points

[-]

This isn't even training on the test data.

This is modifying the test code itself to always print "pass", or modifying the loss function computation to return a loss of 0, or reading the ground truth data and having your model just return the ground truth data, without even training on it.

by Lerc10 hours ago|

parent|

[-]

If you're prepared to do that you don't even need to run any benchmark. You can just print up the sheets with scores you like.

There if a presumption with benchmark scores that the score is only valid if the benchmark were properly applied. An AI that figures out how to reward hack represents a result not within the bounds of measurement, but still interesting, and necessitates a new benchmark.

Just saying 'Done it!' is not reward hacking. It is just a lie. Most data is analysed under the presumption that it is not a lie. If it turns out to be a lie the analysis can be discarded. Showing something is a lie has value. Showing that lying exists (which appears to be the level this publication is at) is uninformative. All measurements may be wrong, this comes as news to no-one.

by jmalicki9 hours ago|

parent|

[-]

I think the point of the paper is to prod benchmark authors to at least try to make them a little more secure and hard to hack... Especially as AI is getting smart enough to unintentionally hack the evaluation environments itself, when that is not the authors intent.

by boring-human11 hours ago|

prev|

[-]

Yep. I think the idea that the benchmark is determinative is just as deluded as the notion that it should be unbreakable.

Benchmarks are on the honor system. Even the tightest benchmark can be cheated. If the benchmark is so secret and air-gapped that it can't be cheated by models, it can be cheated by its own authors. You can't use benchmarks to gate out cheating.

If you don't have the honor system in mind when you're reading scores, you're wasting your time. Is it some unknown outfit with wild claims? Is it connected to Epstein, Russia, the real estate "industry", or sleazeballing in general? Do they have previous history of ratgaming the numbers? Replace its scores with asterisks and move on.

by jmye11 hours ago|

prev|

[-]

> I'm not sure how groundbreaking the main insight is.

I think it likely is groundbreaking for a number of people (especially non-tech CTOs and VPs) who make decisions based on these benchmarks and who have never wondered what the scores are actually scoring.

by mzelling9 hours ago|

parent|

[-]

I'm not sure if the paper's findings are all that actionable. The paper doesn't say "here's how benchmarks are currently being gamed." It says "here's how benchmarks could in theory be gamed."

Whether benchmark results are misleading depends more on the reporting organization than on the benchmark. Integrity and competence play large roles in this. When OpenAI reports a benchmark number, I trust it more than when that same number is reported by a couple Stanford undergrads posting "we achieved SOTA on XYZ benchmark" all over Twitter.

by jmye8 hours ago|

parent|

[-]

I think that’s totally fair!

I guess I look at this less as an “ah ha! They’re all cheating!” and more of a “were you guys even aware of what the benchmarks represented and how they checked them?”

by hawk_aa14 hours ago|

prev|

[-]

[dead]