It does things like place the old Gemini 3 Flash above the more capable 3.5 Flash and Opus 4.5 - Opus 4.8 and gpt-5.5
At least, until hopefully one day HN has a rule about accounts that derive 99.9999% of their engagement with the site from shilling a personal project.
I found it while trying to use 3.5 Flash for scoring the reasoning of some models, and it gets it wrong because of the centering bias, whereas 3 Flash gets scoring right.
How is it useless to see that Opus 4.8 is 2x more expensive and 2x slower on some questions?