Another thing that worked like magic prior to Feb/Mar was how likely Claude was to load a skill whenever it deduced that a skill might be useful. I personally use [superpowers][1] a lot, and I've noticed that I have to be very explicit when I want a specific skill to be used - to the point that I have to reference the skill by name.
This was a first for me with Sonnet. It completely veered off the prompt it was given (review a design document) and instead come out with a verbose suggestion to do a mechanical search and replace to use this newly fabricated function name - that it event spelled incorrectly. I had to Google numey to make sure Sonnet wasn't outsmarting me.
I told it to implement the server side one, it said ok, I tabbed away for a while, came to find the js implementation, checking the log Claude said “on second thought I think I’ll do the client side version instead”.
Rarely do I throw an expletive bomb at Claude - this was one such time.
It’s always “you’re using the tool wrong, need to tweak this knob or that yadda yadda”.
Also, it's probably very easy to spot such benchmarks and lock-in full thinking just for them. Some ISPs do the same where your internet speed magically resets to normal as soon as you open speedtest.net ...
> When thinking is deep, the model resolves contradictions internally before producing output.
> When thinking is shallow, contradictions surface in the output as visible self-corrections: "oh wait", "actually,", "let me reconsider", "hmm, actually", "no wait."
Yeah, THIS is something that I've seen happen a lot. Sometimes even on Opus with max effort.
I wonder if this is even more exaggerated now through Easter, as everyone’s got a bit extra time to sit down and <play> with Claude. That might be pushing capacity over the limit - I just don’t know enough about how Antropic provision and manage capacity to know if that could be a factor. However quality has gotten really bad over the holiday.