upvote
I'll respond with more anecdotal evidence, the Llama family has been terrible at following directions in all the tests I've done--not sure about the other models in RULER.

In the Chroma results, they look at Sonnet 4 which was also terrible in my experience. The same prompt that worked perfectly in Sonnet 4.5 would fail miserably in Sonnet 4

Would be good to see newer tests with both SOTA and open weight. The SOTA ones always seem to follow directions and stay on topic better but it'd be good to have some data to back it up.

reply
But the studies are in 2024 and 2025. They don’t apply to current Claude models.
reply