if you can't afford to do that, look at a lot of them, eg. on artificialanalysis.com they merge multiple benchmarks across weighted categories and build an Intelligence Score, Coding Score and Agentic score.
GLM 5 scores 5% on the semi-private set, compared to SOTA models which hover around 80%.