undefined

points

by WithinReason23 hours ago |

comments

by raincole23 hours ago|

[-]

Yes, and the article author is fully aware of that. Thank you for pointing out this small mistake though.

by mkagenius22 hours ago|

parent|

[-]

It looks like the author is specifically avoiding model's name, because results are really weird.

  Opus 4.8/4.7 scored 28%

  Opus 4.6 score 37%

So the author thought as let's not get into that just write Claude.

by happycube21 hours ago|

parent|

[-]

Not weird at all, given the variance in Opus' quality over the last few months.

wild guess - I wouldn't be surprised if Opus 4.6 was run quantized for a while, and 4.7/4.8 have QAT for that nerfed size.

by andriy_koval22 hours ago|

parent|

prev|

[-]

many people think opus 4.6 was the best

by insiderphd10 hours ago|

parent|

prev|

[-]

Hello! Author here (Katie) Ty for your comments, 4.6 and 4.7 both scored 28% on our benchmark, I just wanted to have 10 things in the list because I wanted a round number.

by raincole18 hours ago|

parent|

prev|

[-]

Where is the weird part?

by croemer20 hours ago|

prev|

[-]

The dollar amount is meaningless without comparison - and no other model has a price tag. Sloppy article.

by tills1322 hours ago|

prev|

[-]

It costs nothing to not be pedantic.

by alienbaby21 hours ago|

parent|

[-]

Possibly, nothing other than accuracy

by mdp202114 hours ago|

parent|

prev|

[-]

"Kindly reach us in Cambridge for the lessons".

by Onavo23 hours ago|

prev|

[-]

Claude code it's the only way to get access to the actual amortized cost of running a Claude-scale model. The consumer non-enterprise API is extremely expensive (with increasing marginal costs for the user and fat profit margins for Anthropic). If you want to approximate a State level attacker's cost where they can have the model on their own hardware, Claude Code is probably the best guess at the amortized cost.