upvote
Yeah that was an interesting discovery in a development meeting. Many people were chasing after the next best model and everything, though for me, Sonnet 4.6 solves many topics in 1-2 rounds. I mainly need some focus on context, instructions and keeping tasks well-bounded. Keeping the task narrow also simplifies review and staying in control, since I usually get smaller diffs back I can understand quickly and manage or modify later.

I'll look at the new models, but increasing the token consumptions by a factor of 7 on copilot, and then running into all of these budget management topics people talk about? That seems to introduce even more flow-breakers into my workflow, and I don't think it'll be 7 times better. Maybe in some planning and architectural topics where I used Opus 4.6 before.

reply
haven't people been complaining lately about 4.6 getting worse?
reply
People complain about a lot of things. Claude has been fine:

https://marginlab.ai/trackers/claude-code-historical-perform...

reply
I will be the first to acknowledge that humans are a bad judge of performance and that some of the allegations are likely just hallucinations...

But... Are you really going to completely rely on benchmarks that have time and time again be shown to be gamed as the complete story?

My take: It is pretty clear that the capacity crunch is real and the changes they made to effort are in part to reduce that. It likely changed the experience for users.

reply
While that's a nice effort, the inter-run variability is too high to diagnose anything short of catastrophic model degradation. The typical 95% confidence interval runs from 35% to 65% pass rates, a full factor of two performance difference.

Moreover, on the companion codex graphs (https://marginlab.ai/trackers/codex-historical-performance/), you can see a few different GPT model releases marked yet none correspond to a visual break in the series. Either GPT 5.4-xhigh is no more powerful than GPT 5.2, or the benchmarking apparatus is not sensitive enough to detect such changes.

reply
Yes, MarginLab only tests 50 tasks a day, which is too few to give a narrower confidence interval. On the other hand, this really calls into question claims of performance degradation that are based on less intensive use than that. Variance is just so high that long streaks of bad luck are to be expected and plausibly the main source of such complaints. Similarly, it's unlikely you can measure a significant performance difference between models like GPT 5.4-xhigh and GPT 5.2 unless you have a task where one of them almost always fails or one almost always succeeds (thus guaranteeing low variance), or you make a lot of calls (i.e. probably through the API and not in interactive mode.)
reply
Matrix also found that Claude was AB testing 4.6 vs 4.7 in production for the last 12 days.

https://matrix.dev/blog-2026-04-16

reply
Your link shows there have been huge drops.

How is it fine?

reply
That performance monitor is super easy to game if you cache responses to all the SWE bench questions.
reply
You dramatically overestimate how much time engineers at hypergrowth startups have on their hands
reply
Caching some data is time consuming? They can just ask Claude to do it.
reply
No we increased our plans
reply
How long will they host 4.6? Maybe longer for enterprise, but if you have a consumer subscription, you won't have a choice for long, if at all anymore.
reply
I was trying to figure out earlier today how to get 4.6 to run in Claude Code, as part of the output it included "- Still fully supported — not scheduled for retirement until Feb 2027." Full caveat of, I don't know where it came up with this information, but as others have said, 4.5 is still available today and it is now 5, almost 6 months old.
reply
I'm still using 4.5 because it gets the niche work I'm using it for where 4.6 would just fight me.
reply
Opus 4.5 is still available
reply
Wow, they hosted it for 6 months. Truly LTS territory :)
reply