I reckon we'll have similar suites comparing different aspects of models.
And, at some point, we'll be dealing with models skewing results whenever they detect they're being benchmarked, like it happened before with hardware. Some say that's already happening with the pelican test.
The problem is that hardware benchmarks are harder to game. Yes, hardware manufacturer can make driver tweaks for say particular game to run better but the benchmark is still representable for the workflow user faces and they can't change the most important part, hardware, they can't benchmark gimmick their way in designing hardware
Meanwhile in LLM land the game is to tune it for the current popular set of benchmarks, all while user experience is only vaguely related to those results