However, both kinds of tests are susceptible to over-fitting: an LLM can be trained on the exact test questions, and a CPU can be designed with eg. branch predictors and cache sizes tuned specifically to handle a particular benchmark or workload.
Both that and the SPEC compiler shenanigans are cheating by changing the test, not just over-specializing the product being benchmarked.