"Comparable" is doing some heavy lifting there. Comparable to Anthropic models in 1H'25, maybe.
But let's say for the sake of discussion Opus is much better - still doesn't justify the price disparity especially when considering that other models are provided by commercial inference providers and anthropics is inhouse.
The problem here is people think AI benchmarks are analogous to say, CPU performance benchmarks. They're not:
* You can't control all the variables, only one (the prompt).
* The outputs, BY DESIGN, can fluctuate wildly for no apparent reason (i.e., first run, utter failure, second run, success).
* The biggest point, once a benchmark is known, future iterations of the model will be trained on it.
Trying to objectively measure model performance is a fool's errand.