undefined

upvote

points

by andrewchilds4 hours ago |

upvote

by wongarsu4 hours ago|

[-]

Keep in mind that the people who experience issues will always be the loudest.

I've overall enjoyed 4.6. On many easy things it thinks less than 4.5, leading to snappier feedback. And 4.6 seems much more comfortable calling tools: it's much more proactive about looking at the git history to understand the history of a bug or feature, or about looking at online documentation for APIs and packages.

A recent claude code update explicitly offered me the option to change the reasoning level from high to medium, and for many people that seems to help with the overthinking. But for my tasks and medium-sized code bases (far beyond hobby but far below legacy enterprise) I've been very happy with the default setting. Or maybe it's about the prompting style, hard to say

reply

upvote

by evilhackerdude3 hours ago|

[-]

keep in mind that people who point out a regression and measure the actual #tok, which costs $money, aren't just "being loud" — someone diffed session context usaage and found 4.6 burning >7x the amount of context on a task that 4.5 did in under 2 MB⁣.

reply

upvote

by svachalek3 hours ago|

[-]

It's not that they don't have a point, it's that everyone who's finding 4.6 to be fine or great are not running out to the internet to talk about it.

reply

upvote

by marcus_cemes2 hours ago|

[-]

Being a moderately frequent user of Opus and having spoken to people who use it actively at work for automation, it's a really expensive model to run, I've heard it burn through a company's weekend's credit allocation before Saturday morning, I think using almost an order of magnitude more tokens is a valid consumer concern!

I have yet to hear anyone say "Opus is really good value for money, a real good economic choice for us". It seems that we're trying to retrofit every possible task with SOTA AI that is still severely lacking in solid reasoning, reliability/dependability, so we throw more money at the problem (cough Opus) in the hopes that it will surpass that barrier of trust.

reply

upvote

by SatvikBeri4 hours ago|

[-]

I've also seen Opus 4.6 as a pure upgrade. In particular, it's noticeably better at debugging complex issues and navigating our internal/custom framework.

reply

upvote

by drcongo3 hours ago|

[-]

Same here. 4.6 has been considerably more dilligent for me.

reply

upvote

by AustinDev3 hours ago|

[-]

Likewise, I feel like it's degraded in performance a bit over the last couple weeks but that's just vibes. They surely vary thinking tokens based on load on the backend, especially for subscription users.

When my subscription 4.6 is flagging I'll switch over to Corporate API version and run the same prompts and get a noticeably better solution. In the end it's hard to compare nondeterministic systems.

reply

upvote

by merlindru48 seconds ago|

[-]

That's very interesting!

Also, +1. Opus 4.6 is strictly better than 4.5 for me

reply

upvote

by perelin4 hours ago|

[-]

Mirrors my experience as well. Especially the pro-activeness in tool calling sticks out. It goes web searching to augment knowledge gaps on its own way more often.

reply

upvote

by galaxyLogic2 hours ago|

[-]

Do you need to upload your git for it to analyuze it? Or are they reading it off github ?

reply

upvote

by gpm1 hours ago|

[-]

They're probably running it with a claude code like tool and it has a local (to the tool, not to anthropic) copy of the git repo it can query using the cli.

reply

upvote

by MrCheeze4 hours ago|

[-]

In my experience with the models (watching Claude play Pokemon), the models are similar in intelligence, but are very different in how they approach problems: Opus 4.5 hyperfocuses on completing its original plan, far more than any older or newer version of Claude. Opus 4.6 gets bored quickly and is constantly changing its approach if it doesn't get results fast. This makes it waste more time on"easy" tasks where the first approach would have worked, but faster by an order of magnitude on "hard" tasks that require trying different approaches. For this reason, it started off slower than 4.5, but ultimately got as far in 9 days as 4.5 got in 59 days.

reply

upvote

by KronisLV4 hours ago|

[-]

I got the Max subscription and have been using Opus 4.6 since, the model is way above pretty much everything else I've tried for dev work and while I'd love for Anthropic to let me (easily) work on making a hostable server-side solution for parallel tasks without having to go the API key route and not have to pay per token, I will say that the Claude Code desktop app (more convenient than the TUI one) gets me most of the way there too.

reply

upvote

by alkhatib4 hours ago|

[-]

Try https://conductor.build

I started using it last week and it’s been great. Uses git worktrees, experimental feature (spotlight) allows you to quickly check changes from different agents.

I hope the Claude app will add similar features soon

reply

upvote

by bredren4 hours ago|

[-]

Can you explain what you mean by your parallel tasks limitation?

reply

upvote

by KronisLV3 hours ago|

[-]

Instead of having my computer be the one running Claude Code and executing tasks, I might want to prefer to offload it to my other homelab servers to execute agents for me, working pretty much like traditional CI/CD, though with LLMs working on various tasks in Docker containers, each on either the same or different codebases, each having their own branches/worktrees, submitting pull/merge requests in a self-hosted Gitea/GitLab instance or whatever.

If I don't want to sit behind something like LiteLLM or OpenRouter, I can just use the Claude Agent SDK: https://platform.claude.com/docs/en/agent-sdk/overview

However, you're not supposed to really use it with your Claude Max subscription, but instead use an API key, where you pay per token (which doesn't seem nearly as affordable, compared to the Max plan, nobody would probably mind if I run it on homelab servers, but if I put it on work servers for a bit, technically I'd be in breach of the rules):

> Unless previously approved, Anthropic does not allow third party developers to offer claude.ai login or rate limits for their products, including agents built on the Claude Agent SDK. Please use the API key authentication methods described in this document instead.

If you look at how similar integrations already work, they also reference using the API directly: https://code.claude.com/docs/en/gitlab-ci-cd#how-it-works

A simpler version is already in Claude Code and they have their own cloud thing, I'd just personally prefer more freedom to build my own: https://www.youtube.com/watch?v=zrcCS9oHjtI (though there is the possibility of using the regular Claude Code non-interactively: https://code.claude.com/docs/en/headless)

It just feels a tad more hacky than just copying an API key when you use the API directly, there is stuff like https://github.com/anthropics/claude-code/issues/21765 but also "claude setup-token" (which you probably don't want to use all that much, given the lifetime?)

reply

upvote

by Jach3 hours ago|

[-]

I haven't kept up with the Claude plays stuff, did it ever actually beat the game? I was under the impression that the harness was artificially hampering it considering how comparatively more easily various versions of ChatGPT and Gemini had beat the game and even moved on to beating Pokemon Crystal.

reply

upvote

by DaKevK3 hours ago|

[-]

Genuinely one of the more interesting model evals I've seen described. The sunk cost framing makes sense -- 4.5 doubles down, 4.6 cuts losses faster. 9 days vs 59 is a wild result. Makes me wonder how much of the regression complaints are from people hitting 4.6 on tasks where the first approach was obviously correct.

reply

upvote

by MrCheeze1 hours ago|

[-]

Notably 45 out of the 50 days of improvement were in two specific dungeons (Silph Co and Cinnabar Mansion) where 4.5 was entirely inadequate and was looping the same mistaken ideas with only minor variation, until eventually it stumbled by chance into the solution. Until we saw how much better it did in those spots, we weren't completely sure that 4.6 was an improvement at all!

https://docs.google.com/spreadsheets/u/0/d/e/2PACX-1vQDvsy5D...

reply

upvote

by data-ottawa4 hours ago|

[-]

I think this depends on what reasoning level your Claude Code is set to.

Go to /models, select opus, and the dim text at the bottom will tell you the reasoning level.

High reasoning is a big difference versus 4.5. 4.6 high uses a lot of tokens for even small tasks, and if you have a large codebase it will fill almost all context then compact often.

reply

upvote

by minimaxir4 hours ago|

[-]

I set reasoning to Medium after hitting these issues and it did not make much of a difference. Most of the context window is still filled during the Explore tool phase (that supposedly uses Haiku swarms) which wouldn't be impacted by Opus reasoning.

reply

upvote

by _zoltan_4 hours ago|

[-]

I'm using the 1M context 4.6 and it's great.

reply

upvote

by honeycrispy4 hours ago|

[-]

Glad it's not just me. I got a surprise the other day when I was notified that I had burned up my monthly budget in just a few days on 4.6

reply

upvote

by 4 hours ago|

[-]

deleted

reply

upvote

by Topfi4 hours ago|

[-]

In my evals, I was able to rather reliably reproduce an increase in output token amount of roughly 15-45% compared to 4.5, but in large part this was limited to task inference and task evaluation benchmarks. These are made up of prompts that I intentionally designed to be less then optimal, either lacking crucial information (requiring a model to output an inference to accomplish the main request) or including a request for a less than optimal or incorrect approach to resolving a task (testing whether and how a prompt is evaluated by a model against pure task adherence). The clarifying question many agentic harnesses try to provide (with mixed success) are a practical example of both capabilities and something I do rate highly in models, as long as task adherence isn't affected overly negatively because of it.

In either case, there has been an increase between 4.1 and 4.5, as well as now another jump with the release of 4.6. As mentioned, I haven't seen a 5x or 10x increase, a bit below 50% for the same task was the maximum I saw and in general, of more opaque input or when a better approach is possible, I do think using more tokens for a better overall result is the right approach.

In tasks which are well authored and do not contain such deficiencies, I have seen no significant difference in either direction in terms of pure token output numbers. However, with models being what they are and past, hard to reproduce regressions/output quality differences, that additionally only affected a specific subset of users, I cannot make a solid determination.

Regarding Sonnet 4.6, what I noticed is that the reasoning tokens are very different compared to any prior Anthropic models. They start out far more structured, but then consistently turn more verbose akin to a Google model.

reply

upvote

by weinzierl4 hours ago|

[-]

Today I asked Sonnet 4.5 a question and I got a banner at the bottom that I am using a legacy model and have to continue the conversation on another model. The model button had changed to be labeled "Legacy model". Yeah, I guess it wasn't legacy a sec ago.

(Currently I can use Sonnet 4.5 under More models, so I guess the above was just a glitch)

reply

upvote

by hedora3 hours ago|

[-]

I’ve noticed the opaque weekly quota meter goes up more slowly with 4.6, but it more frequently goes off and works for an hour+, with really high reported token counts.

Those suggest opposite things about anthropic’s profit margins.

I’m not convinced 4.6 is much better than 4.5. The big discontinuous breakthroughs seem to be due to how my code and tests are structured, not model bumps.

reply

upvote

by etothet4 hours ago|

[-]

I definitely noticed this on Opus 4.6. I moved back to 4.5 until I see (or hear about) an improvement.

reply

upvote

by ctoth3 hours ago|

[-]

For me it's the ... unearned confidence that 4.5 absolutely did not have?

I have a protocol called "foreman protocol" where the main agent only dispatches other agents with prompt files and reads report files from the agents rather than relying on the janky subagent communication mechanisms such as task output.

What this has given me also is a history of what was built and why it was built, because I have a list of prompts that were tasked to the subagents. With Opus 4.5 it would often leave the ... figuring out part? to the agents. In 4.6 it absolutely inserts what it thinks should happen/its idea of the bug/what it believes should be done into the prompt, which often screws up the subagent because it is simply wrong and because it's in the prompt the subagent doesn't actually go look. Opus 4.5 would let the agent figure it out, 4.6 assumes it knows and is wrong

reply

upvote

by DaKevK3 hours ago|

[-]

Have you tried framing the hypothesis as a question in the dispatch prompt rather than a statement? Something like -- possible cause: X, please verify before proceeding -- instead of stating it as fact. Might break the assumption inheritance without changing the overall structure.

reply

upvote

by nwienert3 hours ago|

[-]

After a month of obliterating work with 4.5, I spent about 5 days absolutely shocked at how dumb 4.6 felt, like not just a bit worse but 50% at best. Idk if it's the specific problems I work on but GP captured it well - 4.5 listened and explored better, 4.6 seems to assume (the wrong thing) constantly, I would be correcting it 3-4 times in a row sometimes. Rage quit a few times in the first day of using it, thank god I found out how to dial it back.

reply

upvote

by ctoth2 hours ago|

[-]

Here's the part where you don't leave us all hanging? What did you figure out!!!

reply

upvote

by Snakes37272 hours ago|

[-]

Imo I found opus 4.6 to be a pretty big step back. Our usage has skyrocketed since 4.6 has come out and the workload has not really changed.

However I can honestly say anthropic is pretty terrible about support, to even billing. My org has a large enterprise contract with anthropic and we have been hitting endless rate limits across the entire org. They have never once responded to our issues, or we get the same generic AI response.

So odds of them addressing issues or responding to people feels low.

reply

upvote

by cjbarber2 hours ago|

[-]

I wonder if it's actually from CC harness updates that make it much more inclined to use subagents, rather than from the model update.

reply

upvote

by baq4 hours ago|

[-]

Sonnet 4.5 was not worth using at all for coding for a few months now, so not sure what we're comparing here. If Sonnet 4.6 is anywhere near the performance they claim, it's actually a viable alternative.

reply

upvote

by nerdsniper4 hours ago|

[-]

In terms of performance, 4.6 seems better. I’m willing to pay the tokens for that. But if it does use tokens at a much faster rate, it makes sense to keep 4.5 around for more frugal users

I just wouldn’t call it a regression for my use case, i’m pretty happy with it.

reply

upvote

by cheema333 hours ago|

[-]

> Many people have reported Opus 4.6 is a step back from Opus 4.5.

Many people say many things. Just because you read it on the Internet, doesn't mean that it is true. Until you have seen hard evidence, take such proclamations with large grains of salt.

reply

upvote

by Foobar85684 hours ago|

[-]

It goes into plan mode and/or heavy multiple agent for any reasons, and hundred thousands of tokens are used within a few minutes.

reply

upvote

by minimaxir4 hours ago|

[-]

I've been tempted to add to my CLAUDE.md "Never use the Plan tool, you are a wild rebel who only YOLOs."

reply

upvote

by yakbarber3 hours ago|

[-]

Opus 4.6 is so much better at building complex systems than 4.5 it's ridiculous.

reply

upvote

by grav4 hours ago|

[-]

I fail to understand how two LLMs would be "consuming" a different amount of tokens given the same input? Does it refer to the number of output tokens? Or is it in the context of some "agentic loop" (eg Claude Code)?

reply

upvote

by lemonfever4 hours ago|

[-]

Most LLMs output a whole bunch of tokens to help them reason through a problem, often called chain of thought, before giving the actual response. This has been shown to improve performance a lot but uses a lot of tokens

reply

upvote

by zozbot2344 hours ago|

[-]

Yup, they all need to do this in case you're asking them a really hard question like: "I really need to get my car washed, the car wash place is only 50 meters away, should I drive there or walk?"

reply

upvote

by jcims4 hours ago|

[-]

One very specific and limited example, when asked to build something 4.6 seems to do more web searches in the domain to gather latest best practices for various components/features before planning/implementing.

reply

upvote

by andrewchilds4 hours ago|

[-]

I've found that Opus 4.6 is happy to read a significant amount of the codebase in preparation to do something, whereas Opus 4.5 tends to be much more efficient and targeted about pulling in relevant context.

reply

upvote

by OtomotO4 hours ago|

[-]

And way faster too!

reply

upvote

by Gracana3 hours ago|

[-]

They're talking about output consuming from the pool of tokens allowed by the subscription plan.

reply

upvote

by bsamuels4 hours ago|

[-]

thinking tokens, output tokens, etc. Being more clever about file reads/tool calling.

reply

upvote

by dakolli3 hours ago|

[-]

I called this many times over the last few weeks on this website (and got downvoted every time), that the next generation of models would become more verbose, especially for agentic tool calling to offset the slot machine called CC's propensity to light the money on fire that's put into it.

At least in vegas they don't pour gasoline on the cash put into their slot machines.

reply

upvote

by OtomotO4 hours ago|

[-]

Definitely my experience as well.

No better code, but way longer thinking and way more token usage.

reply

upvote

by reed12344 hours ago|

[-]

not in my experience

reply

upvote

by reed12344 hours ago|

[-]

"Opus 4.6 often thinks more deeply and more carefully revisits its reasoning before settling on an answer. This produces better results on harder problems, but can add cost and latency on simpler ones. If you’re finding that the model is overthinking on a given task, we recommend dialing effort down from its default setting (high) to medium."[1]

I doubt it is a conspiracy.

[1] https://www.anthropic.com/news/claude-opus-4-6

reply

upvote

by comboy4 hours ago|

[-]

Yeah, I think the company that opens up a bit of the black box and open sources it, making it easy for people to customize it, will win many customers. People will already live within micro-ecosystems before other companies can follow.

Currently everybody is trying to use the same swiss army knife, but some use it for carving wood and some are trying to make some sushi. It seems obvious that it's gonna lead to disappointment for some.

Models are become a commodity and what they build around them seem to be the main part of the product. It needs some API.

reply

upvote

by reed12344 hours ago|

[-]

I agree that if there was more transparency it might have prevented the token spend concerns, which feels caused by a lack of knowledge about how the models work.

reply

upvote

by DetroitThrow3 hours ago|

[-]

I much prefer 4.6. It often finds missed edge cases more often than 4.5. If I cared about token usage so much, I would use Sonnet or Haiku.

reply

upvote

by j454 hours ago|

[-]

I have often noticed a difference too, and it's usually in lockstep with needing to adjust how I am prompting.

Put in a different way, I have to keep developing my prompting / context / writing skills at all times, ahead of the curve, before they're needed to be adjusted.

reply

upvote

by PlatoIsADisease4 hours ago|

[-]

Don't take this seriously, but here is what I imagined happened:

Sam/OpenAI, Google, and Claude met at a park, everyone left their phones in the car.

They took a walk and said "We are all losing money, if we secretly degrade performance all at the same time, our customers will all switch, but they will all switch at the same time, balancing things... wink wink wink"

reply