undefined

points

[-]

Yeah that was an interesting discovery in a development meeting. Many people were chasing after the next best model and everything, though for me, Sonnet 4.6 solves many topics in 1-2 rounds. I mainly need some focus on context, instructions and keeping tasks well-bounded. Keeping the task narrow also simplifies review and staying in control, since I usually get smaller diffs back I can understand quickly and manage or modify later.

I'll look at the new models, but increasing the token consumptions by a factor of 7 on copilot, and then running into all of these budget management topics people talk about? That seems to introduce even more flow-breakers into my workflow, and I don't think it'll be 7 times better. Maybe in some planning and architectural topics where I used Opus 4.6 before.

by pier2511 hours ago|

prev|

[-]

haven't people been complaining lately about 4.6 getting worse?

by solenoid093711 hours ago|

parent|

[-]

People complain about a lot of things. Claude has been fine:

https://marginlab.ai/trackers/claude-code-historical-perform...

by addisonj10 hours ago|

parent|

[-]

I will be the first to acknowledge that humans are a bad judge of performance and that some of the allegations are likely just hallucinations...

But... Are you really going to completely rely on benchmarks that have time and time again be shown to be gamed as the complete story?

My take: It is pretty clear that the capacity crunch is real and the changes they made to effort are in part to reduce that. It likely changed the experience for users.

by Majromax11 hours ago|

parent|

prev|

[-]

While that's a nice effort, the inter-run variability is too high to diagnose anything short of catastrophic model degradation. The typical 95% confidence interval runs from 35% to 65% pass rates, a full factor of two performance difference.

Moreover, on the companion codex graphs (https://marginlab.ai/trackers/codex-historical-performance/), you can see a few different GPT model releases marked yet none correspond to a visual break in the series. Either GPT 5.4-xhigh is no more powerful than GPT 5.2, or the benchmarking apparatus is not sensitive enough to detect such changes.

by yorwba8 hours ago|

parent|

[-]

Yes, MarginLab only tests 50 tasks a day, which is too few to give a narrower confidence interval. On the other hand, this really calls into question claims of performance degradation that are based on less intensive use than that. Variance is just so high that long streaks of bad luck are to be expected and plausibly the main source of such complaints. Similarly, it's unlikely you can measure a significant performance difference between models like GPT 5.4-xhigh and GPT 5.2 unless you have a task where one of them almost always fails or one almost always succeeds (thus guaranteeing low variance), or you make a lot of calls (i.e. probably through the API and not in interactive mode.)

by jofzar2 hours ago|

parent|

prev|

[-]

Matrix also found that Claude was AB testing 4.6 vs 4.7 in production for the last 12 days.

https://matrix.dev/blog-2026-04-16

by sumedh5 hours ago|

parent|

prev|

[-]

Your link shows there have been huge drops.

How is it fine?

by cbg011 hours ago|

parent|

prev|

[-]

That performance monitor is super easy to game if you cache responses to all the SWE bench questions.

by solenoid09378 hours ago|

parent|

[-]

You dramatically overestimate how much time engineers at hypergrowth startups have on their hands

by cbg08 hours ago|

parent|

[-]

Caching some data is time consuming? They can just ask Claude to do it.

by ed_elliott_asc11 hours ago|

parent|

prev|

[-]

No we increased our plans

by grim_io11 hours ago|

prev|

[-]

How long will they host 4.6? Maybe longer for enterprise, but if you have a consumer subscription, you won't have a choice for long, if at all anymore.

by Jeremy102610 hours ago|

parent|

[-]

I was trying to figure out earlier today how to get 4.6 to run in Claude Code, as part of the output it included "- Still fully supported — not scheduled for retirement until Feb 2027." Full caveat of, I don't know where it came up with this information, but as others have said, 4.5 is still available today and it is now 5, almost 6 months old.

by hypercube3310 hours ago|

parent|

prev|

[-]

I'm still using 4.5 because it gets the niche work I'm using it for where 4.6 would just fight me.

by nfredericks11 hours ago|

parent|

prev|

[-]

Opus 4.5 is still available

by grim_io10 hours ago|

parent|

[-]

Wow, they hosted it for 6 months. Truly LTS territory :)