But even an imperfect yardstick is better than no yardstick at all. You’ve just got to remember to maintain a healthy level of skepticism is all.
When such benchmarks aren’t available what you often get instead is teams creating their own benchmark datasets and then testing both their and existing models’ performance against it. Which is eve worse because they probably still the rest multiple times (there’s simply no way to hold others accountable on this front), but on top of that they often hyperparameter tune their own model for the dataset but reuse previously published hyperparameters for the other models. Which gives them an unfair advantage because those hyperparameters were tuned to a doffeeent dataset and may not have even been optimizing for the same task.
It's not just over-fitting to leading benchmarks, there's also too many degrees of freedom in how a model is tested (harness, etc). Until there's standardized documentation enabling independent replication, it's all just benchmarketing .