undefined

points

[-]

You’re completely overrating these benchmarks and it’s landing you at a nonsense opinion. Just actually use the models and you will see that the gap is significant.

by irthomasthomas10 hours ago|

parent|

[-]

It should be easy for a company like Anthropic to prove this beyond a doubt. Why don't they? Why don't they have a collection of prompts and side-by-side comparisons with other models showing how far ahead they are?

by largbae8 hours ago|

parent|

[-]

I think it's mainly because the difference in models at the frontier isn't "response to prompt X", but rather "coherence with 500K tokens of context and instructions in play"

by viking1233 hours ago|

parent|

prev|

[-]

Good morning to the Anthropic office good sir

by dagss12 hours ago|

prev|

[-]

I got to try using Fable for a day... it was a clear and definite shift in quality and how independent it is.

It was almost like having another human using and shepherding Opus for me, instead of herding Opus directly myself.

by rileyphone14 hours ago|

prev|

[-]

All that says is some benchmarks aren’t worth the tokens it takes to evaluate them. Mythos is clearly capable of finding zero days other models can’t, and Fable is close enough to be lumped with it.

by mullingitover13 hours ago|

parent|

[-]

> Mythos is clearly capable of finding zero days other models can’t

I'm unconvinced that this is anything more than proof of work and marginal improvement that other models will catch up with, perhaps as early as to next week. Lots of other current-gen models will find vulns that can be chained together if you're willing to burn enough tokens on the task, and Fable is an absolute token incinerator.

by kolinko13 hours ago|

prev|

[-]

Did you use the models yourself?

by lightbendover2 hours ago|

prev|

[-]

[dead]