undefined

points

[-]

Actually, ELO rankings done blinded on models do vary: https://the-frontier.app, that said, your point looks accurate as far as 5.3 - 5.5 on this chart, 40 to 50 point ELO gain.

I find I have to argue with 5.5 less than 5.3, and I therefore use it when I could reach for 5.3, but I don't think it's a major difference.

by Y_Y4 hours ago|

parent|

[-]

Electric Light Orchestra really stole Arpad Elo's thunder.

by anentropic10 hours ago|

prev|

[-]

Exactly this. And it's not really possible to do repeatable trials, it's all just vibes. People have very little awareness of their own cognitive biases.

by spiorf10 hours ago|

parent|

[-]

And companies have high awareness of this all.

They have a way to decrease cost and probably increase token consumption, with gradual changes and no abrupt jump in capabilities, and users have no way to reliably detect it.

Market will advantage companies that do it.

And they are in the best position to automate online narrative shift (the real LLM killer application IMO) towards "Users are imagining it".

by airstrike4 hours ago|

prev|

[-]

That's a pretty shallow dismissal, and I bet you $100 I can tell you which model I'm talking to between 4.6 and 4.8 without looking or asking after a handful of messages.

Anthropic famously had a terrible outage back when 4.6 was the latest and greatest, and it was never the same after it came back.

All evidence suggests they simply don't have the compute to keep serving their best models at their most powerful.

by pbgcp20269 hours ago|

prev|

[-]

You will be amused to hear that when Anthropic "refreshed" 4.6 on AWS Bedrock I found it in my tests and wrote about it – and they actually rolled it back. This is how much non–coding tests may tell you about the model.

by _puk5 hours ago|

parent|

[-]

So Bedrock 4.6 is old school Opus?

I know you can point Claude code at Bedrock.. might be worth a play.