upvote
Interesting. I wonder if that's related to the phenomenon mentioned in the Opus 4.6 model card[1], where increased reasoning effort leads to 4.6 overthinking and convincing itself of the wrong answer on many questions. It seems to be unique to 4.6; I guess they fried it a bit too much during RL training.

[1] https://www.anthropic.com/claude-opus-4-6-system-card

reply
I tested this with Opus the day 4.6 came out and it failed then, still fails now. There were a lot of jokes I've seen related to some people getting a 'dumber' model, and while there's probably some grain of truth to that I pay for their highest subscription tier so at the very least I can tell you it's not a pay gate issue.
reply
You mean Sonnet 4.6? I ran 9 claude models including Haiku, swipe through the gallery in the link to see their responses.
reply
I don't see Sonnet 4.6 in the screenshots. I see the other Claude models though.

Edit: Found Haiku. Alas!

reply
Yea good catch Sonnet 4.6 is not part of the test.
reply