upvote
You’re completely overrating these benchmarks and it’s landing you at a nonsense opinion. Just actually use the models and you will see that the gap is significant.
reply
It should be easy for a company like Anthropic to prove this beyond a doubt. Why don't they? Why don't they have a collection of prompts and side-by-side comparisons with other models showing how far ahead they are?
reply
I think it's mainly because the difference in models at the frontier isn't "response to prompt X", but rather "coherence with 500K tokens of context and instructions in play"
reply
Good morning to the Anthropic office good sir
reply
I got to try using Fable for a day... it was a clear and definite shift in quality and how independent it is.

It was almost like having another human using and shepherding Opus for me, instead of herding Opus directly myself.

reply
All that says is some benchmarks aren’t worth the tokens it takes to evaluate them. Mythos is clearly capable of finding zero days other models can’t, and Fable is close enough to be lumped with it.
reply
> Mythos is clearly capable of finding zero days other models can’t

I'm unconvinced that this is anything more than proof of work and marginal improvement that other models will catch up with, perhaps as early as to next week. Lots of other current-gen models will find vulns that can be chained together if you're willing to burn enough tokens on the task, and Fable is an absolute token incinerator.

reply
Did you use the models yourself?
reply
[dead]
reply