upvote
Yes, and the article author is fully aware of that. Thank you for pointing out this small mistake though.
reply
It looks like the author is specifically avoiding model's name, because results are really weird.

  Opus 4.8/4.7 scored 28%

  Opus 4.6 score 37%

So the author thought as let's not get into that just write Claude.
reply
Not weird at all, given the variance in Opus' quality over the last few months.

wild guess - I wouldn't be surprised if Opus 4.6 was run quantized for a while, and 4.7/4.8 have QAT for that nerfed size.

reply
many people think opus 4.6 was the best
reply
Hello! Author here (Katie) Ty for your comments, 4.6 and 4.7 both scored 28% on our benchmark, I just wanted to have 10 things in the list because I wanted a round number.
reply
Where is the weird part?
reply
The dollar amount is meaningless without comparison - and no other model has a price tag. Sloppy article.
reply
It costs nothing to not be pedantic.
reply
Possibly, nothing other than accuracy
reply
"Kindly reach us in Cambridge for the lessons".
reply
Claude code it's the only way to get access to the actual amortized cost of running a Claude-scale model. The consumer non-enterprise API is extremely expensive (with increasing marginal costs for the user and fat profit margins for Anthropic). If you want to approximate a State level attacker's cost where they can have the model on their own hardware, Claude Code is probably the best guess at the amortized cost.
reply