undefined

points

[-]

I meant in the sense of - you have benchmarkers and trainers. If you publicize your evaluation, trainers may likely have their models 'consume' it, even if only indirectly: another person creating their own benchmark from scratch may be influenced by yours, even if the new question sets are clean-room. That, and the rule of thumb that benchmark value dissipates like sqrt(age) [0]

So there is a definite advantage to never publicizing your internal benchmark. But then, no one else can replicate your findings. You should assume that the space of benchmarks that are actually decent at evaluating model performance is much larger and most of the good ones, the ones that were costliest to produce, are hidden, and might not even correspond very well with the public ones. And that the public expensive benchmarks are selective and have a bias towards marketing purposes.

[0] https://www.offconvex.org/2021/04/07/ripvanwinkle/

by sebastiennight1 hours ago|

prev|

[-]

I believe the correct term is "Goodhart's Law": https://en.wikipedia.org/wiki/Goodhart%27s_law