False positives and poorly defined tasks/acceptance criteria have let some models have insanely inflated scores on bad benchmarks.
And sure, you can say they're not disclosed to prevent gaming, but if you're the only one who can review them then the might as well be a random number generator display with an unreadable UI.