You need new datasets perpetually.
I have empirical experience though building classifiers that can have no precision measurement because the classifier performs invariably better than humans. They become the state of the art benchmark themselves and can’t be benchmarked except against themselves. These are for tasks that are non trivial and complex, but less logical than coding and less sustained reasoning. There may come a day though, when there is no calibrated benchmark that is independent of the models it’s measuring.
10 groups of 3 researchers, all have their own benchmarks that they do not share (testing it without the authors knowing is a different problem, maybe they only run the benchmarks when the gen-pop has access to the models).
that's 10 different tests. Aggregate pass rates