I found it while trying to use 3.5 Flash for scoring the reasoning of some models, and it gets it wrong because of the centering bias, whereas 3 Flash gets scoring right.
How is it useless to see that Opus 4.8 is 2x more expensive and 2x slower on some questions?