I agree it's misleading for them to hyper-focus on one metric, but public benchmarks are far from the only thing that matters. I place more weight on Lmarena scores and private benchmarks.
Looking around, SWE Rebench seems to have decent protection against training data leaks[1]. Kagi has one that is fully private[2]. One on HuggingFace that claims to be fully private[3]. SimpleBench[4]. HLE has a private test set apparently[5]. LiveBench[6]. Scale has some private benchmarks but not a lot of models tested[7]. vals.ai[8]. FrontierMath[9]. Terminal Bench Pro[10]. AA-Omniscience[11].
So I guess we do have some decent private benchmarks out there.
[0] https://arcprize.org/leaderboard
[1] https://swe-rebench.com/about
[2] https://help.kagi.com/kagi/ai/llm-benchmark.html
[3] https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard
[7] https://labs.scale.com/leaderboard
[9] https://epoch.ai/frontiermath/
[10] https://github.com/alibaba/terminal-bench-pro
[11] https://artificialanalysis.ai/articles/aa-omniscience-knowle...