It avoided answering 2/21 tests in this specific benchmark mark, that's already 90% max score already.
Whatever it is you're measuring, it's not anything related to what I use models for.
What are you using Claude models for? Coding only? Computer use? Which harness?
I've experimented with a few models for all this and have found Gemini the best at OCR but quite a bit worse at the rest. Claude is worse than GPT at web research-shaped things, but Opus 4.8 wins my anecdote benchmark for the other tasks besides those two.
But really, for code or knowlege stuff Gemini is markedly worse than the others, while Opus and GPT 5.5 are very very close.