upvote
It would have made things easier for us if Sonnet 4.6 scored lower, but it's a great model and the data is real.

It doesn't have a higher capability score than Fable, though. We break our coding evaluations into 2 parts, and "one-shot coding" makes up part of the index, where Fable significantly outperforms every other model, which is why it's ranked at the top despite Sonnet 4.6 having a slightly higher median (and lower average) in long-horizon agentic workloads. One-shot coding tends to be the most correlated with other companies' model cards, whereas agentic coding is partly about how well a model can adapt to a custom harness. Fable also refused some tasks.

Data at https://gertlabs.com/rankings?ow=1&mode=oneshot_coding

reply