upvote
> On of our product principles is to avoid changing settings on users' behalf

Ideally there wouldn't be silent changes that greatly reduce the utility of the user's session files until they set a newly introduced flag.

reply
I was not aware the default effort had changed to medium until the quality of output nosedived. This cost me perhaps a day of work to rectify. I now ensure effort is set to max and have not had a terrible session since. Please may I have a "always try as hard as you can" mode ?
reply
I feel like the maximum effort mode kind-of wraps around and starts becoming "desperate" to the extent of lazy or a monkey's paw, similar to how lower effort modes or a poor prompt.
reply
That's /effort max!
reply
You cannot control the effort setting sub-agents use and you also cannot use /effort max as a default (outside of using an alias).
reply
export CLAUDE_CODE_EFFORT_LEVEL=max
reply
bad citizen
reply
There's been more going on than just the default to medium level thinking - I'll echo what others are saying, even on high effort there's been a very significant increase in "rush to completion" behavior.
reply
Thanks for the feedback. To make it actionable, would you mind running /bug the next time you see it and posting the feedback id here? That way we can debug and see if there's an issue, or if it's within variance.
reply

  a9284923-141a-434a-bfbb-52de7329861d
  d48d5a68-82cd-4988-b95c-c8c034003cd0
  5c236e02-16ea-42b1-b935-3a6a768e3655
  22e09356-08ce-4b2c-a8fd-596d818b1e8a
  4cb894f7-c3ed-4b8d-86c6-0242200ea333
Amusingly (not really), this is me trying to get sessions to resume to then get feedback ids and it being an absolute chore to get it to give me the commands to resume these conversations but it keeps messing things up: cf764035-0a1d-4c3f-811d-d70e5b1feeef
reply
I'll have a look. The CoT switch you mentioned will help, I'll take a look at that too, but my suspicion is that this isn't a CoT issue - it's a model preference issue.

Comparing Opus vs. Qwen 27b on similar problems, Opus is sharper and more effective at implementation - but will flat out ignore issues and insist "everything is fine" that Qwen is able to spot and demonstrate solid understanding of. Opus understands the issues perfectly well, it just avoids them.

This correlates with what I've observed about the underlying personalities (and you guys put out a paper the other day that shows you guys are starting to understand it in these terms - functionally modeling feelings in models). On the whole Opus is very stable personality wise and an effective thinker, I want to complement you guys on that, and it definitely contrasts with behaviors I've seen from OpenAI. But when I do see Opus miss things that it should get, it seems to be a combination of avoidant tendencies and too much of a push to "just get it done and move into the next task" from RHLF.

reply
How much of the code/context gets attached in the /bug report?
reply
When you submit a /bug we get a way to see the contents of the conversation. We don't see anything else in your codebase.
reply
Was there a change in Claude Code system prompt at that time that nudges Claude into simplistic thinking?

Here is a gist that tries to patch the system prompt to make Claude behave better https://gist.github.com/roman01la/483d1db15043018096ac3babf5...

I haven’t personally tried it yet. I do certainly battle Claude quite a lot with “no I don’t want quick-n-easy wrong solution just because it’s two lines of code, I want best solution in the long run”.

If the system prompt indeed prefers laziness in 5:1 ratio, that explains a lot.

I will submit /bug in a few next conversations, when it occurs next.

reply
That Gist does explain quite a few flaws Claude has. I wonder if MEMORY.md is sufficient to counteract the prompt without patching.
reply
Holy sweet LLM, this gist is crazy. Why did they do this to themselves? I am going to try this at home, it might actually fix Claude.
reply
Remember Sonnet 3.5 and 3.7? They were happy to throw abstraction on top of abstraction on top of abstraction. Still a lot of people have “do not over-engineer, do not design for the future” and similar stuff in their CLAUDE.md files.

So I think the system prompt just pushes it way too hard to “simple” direction. At least for some people. I was doing a small change in one of my projects today, and I was quite happy with “keep it stupid and hacky” approach there.

And in the other project I am like “NO! WORK A LOT! DO YOUR BEST! BE HAPPY TO WORK HARD!”

So it depends.

reply
[dead]
reply
Theres also been tons of thinking leaking into the actual output. Recently it even added thinking into a code patch it did (a[0] &= ~(1 << 2); // actually let me just rewrite { .. 5 more lines setting a[0] .. }).
reply
Ultrathink is back? I thought that wasn't a thing anymore.

If I am following.. "Max" is above "High", but you can't set it to "Max" as a default. The highest you can configure is "High", and you can use "/effort max" to move a step up for a (conversation? session?), or "ultrathink" somewhere in the prompt to move a step up for a single turn. Is this accurate?

reply
Yep, exactly
reply
Mentioning ULTRATHINK in prompt is the equivalent to /effort max?
reply
Yes but only for the message that includes it. Whereas /effort max keeps it at max effort the entire convo, to my knowledge
reply
deleted
reply
I think it is hilarious that there are four different ways to set settings (settings.json config file, environment variable, slash commands and magical chat keywords).

That kind of consistency has also been my own experience with LLMs.

reply
To be fair, I can think of reasons why you would want to be able to set them in various ways.

- settings.json - set for machine, project

- env var - set for an environment/shell/sandbox

- slash command - set for a session

- magical keyword - set for a turn

reply
You are yet to discover the joys of the managed settings scope. They can be set three ways. The claude.ai admin console; by one of two registry keys e.g. HKLM\SOFTWARE\Policies\ClaudeCode; and by an alphabetically merged directory of json files.
reply
It's not unique to LLMs. Take BASH: you've got `/etc/profile`, `~/.bash_profile,` `~/.bash_login`, `~/.bashrc`, `~/.profile`, environment variables, and shell options.
reply
Especially some settings are in setting.json, and others in .claude.json So sometimes I have to go through both to find the one I want to tweak
reply
Here's the reply in context:

https://github.com/anthropics/claude-code/issues/42796#issue...

Sympathies: Users now completely depend on their jet-packs. If their tools break (and assuming they even recognize the problem). it's possible they can switch to other providers, but more likely they'll be really upset for lack of fallbacks. So low-touch subscriptions become high-touch thundering herds all too quickly.

reply
All right so what do I need to do so it does its job again? Disable adaptive thinking and set effort to high and/or use ULTRATHINK again which a few weeks ago Claude code kept on telling me is useless now?
reply
Run this: /effort high
reply
Imagine if all service providers were behaving like this.

> Ahh, sorry we broke your workflow.

> We found that `log_level=error` was a sweet spot for most users.

> To make it work as you expect it so, run `./bin/unpoop` it will set log_level=warn

reply
You can't. This is Anthropic leveraging their dials, and ignoring their customers for weeks.

Switch providers.

Anecdotally, I've had no luck attempting to revert to prior behavior using either high/max level thinking (opus) or prompting. The web interface for me though doesn't seem problematic when using opus extended.

reply
How do you guys manage regressions as a whole with every new model update? A massive test set of e2e problem solving seeing how the models compare?
reply
A mix of evals and vibes.
reply
What's that ratio exactly
reply
Are you doing any Digital Twin testing or simulations? I imagine you can't test a product like Claude Code using traditional means.
reply
"Evals and vibes" can I put that on a t shirt?
reply
Remember when they shipped that version that didn't actually start/ run? At work we were goofing on them a bit, until I said "Wait how did their tests even run on that?" And we realized whatever their CI/CD process is, it wasn't at the time running on the actual release binary... I can imagine their variation on how most engineers think about CI/CD probably is indicative of some other patterns (or lack of traditional patterns)

As someone that used to work on Windows, I kind of had a vision of a similar in scope e2e testing harness, similar to Windows Vista/ 7 (knowing about bugs/ issues doesn't mean you can necessarily fix them ... hence Vista then 7) - and that Anthropic must provide some Enterprise guarantee backed by this testing matrix I imagined must exist - long way of saying, I think they might just YOLO regressions by constantly updating their testing/ acceptance criteria.

Why not provide pinable versions or something? This whole absurdity and wasted 2 months of suboptimal productivity hits on the absurdity of constantly changing the user/ system prompt and doing so much of the R&D and feature development at two brittle prompts with unclear interplay. And so until there’s like a compostable system/user prompt framework they reliably develop tests against, I personally would prefer pegged selectable versions. But each version probably has like known critical bugs they’re dancing around so there is no version they’d feel comfortable making a pegged stable release..

reply
I use a self-documenting recursive workflow: https://github.com/doubleuuser/rlm-workflow
reply
Happy to have my mind changed, yet I am not 100% convinced closing the issue as completed captures the feedback.
reply
From the contents of the issue, this seems like a fairly clear default effort issue. Would love your input if there's something specific that you think is unaddressed.
reply
From this reply, it seems that it has nothing to do with `/effort`: https://github.com/anthropics/claude-code/issues/42796#issue...

I hope you take this seriously. I'm considering moving my company off of Claude Code immediately.

Closing the GH issue without first engaging with the OP is just a slap in the face, especially given how much hard work they've done on your behalf.

reply
The OP “bug report” is a wall of AI slop generated from looking at its own chat transcripts
reply
Do you disagree with any of the data or conclusions?
reply
Yes
reply
I'm open to hearing, please elaborate
reply
I commented on the GH issue, but Ive had effort set to 'high' for however long its been available and had a marked decline since... checks notes... about 23 March according to slack messages I sent to the team to see if I was alone (I wasnt).

EDIT: actually the first glaring issue I remember was on 20 March where it hallucinated a full sha from a short sha while updating my github actions version pinning. That follows a pattern of it making really egregious assumptions about things without first validating or checking. Ive also had it answer with hallucinated information instead of looking online first (to a higher degree than Ive been used to after using these models daily for the past ~6 months)

reply
It hallucinated a GUID for me instead of using the one in the RFC for webscokets. Fun part was that the beginning was the same. Then it hardcoded the unit tests to be green with the wrong GUID.
reply
Gotcha. It seemed though from the replies on the github ticket that at least some of the problem was unrelated to effort settings.
reply
While we have you here, could you fix the bash escaping bug? https://github.com/anthropics/claude-code/issues/10153
reply
> You can also use the ULTRATHINK keyword to use high effort for a single turn

First I've heard that ultrathink was back. Much quieter walkback of https://decodeclaude.com/ultrathink-deprecated/

reply
Pretty sure it's still gone and you should be using effort level now for this.
reply
Hi Boris, thanks for addressing this and providing feedback quickly. I noticed the same issue. My question is, is it enough to do /efforts high, or should I also add CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING to my settings?
reply
Hey Boris, thanks for the awesomeness that's Claude! You've genuinely changed the life of quite a few young people across the world. :)

not sure if the team is aware of this, but Claude code (cc from here on) fails to install / initiate on Windows 10; precise version, Windows 10.0.19045 build 19045. It fails mid setup, and sometimes fails to throw up a log. It simply calls it quits and terminates.

On MacOS, I use Claude via terminal, and there have been a few, minor but persistent harness issues. For example, cc isn't able to use Claude for Chrome. It has worked once and only once, and never again. Currently, it fails without a descriptive log or issue. It simply states permission has been denied.

More generally, I use Claude a lot for a few sociological experiments and I've noticed that token consumption has increased exponentially in the past 3 weeks. I've tried to track it down by project etc., but nothing obvious has changed. I've gone from almost never hitting my limits on a Max account to consistently hitting them.

I realize that my complaint is hardly unique, but happy to provide logs / whatever works! :)

And yeah, thanks again for Claude! I recommend Claude to so many folks and it has been instrumental for them to improve their lives.

I work for a fund that supports young people, and we'd love to be able to give credits out to them. I tried to reach out via the website etc. but wasn't able to get in touch with anyone. I just think more gifted young people need Claude as a tool and a wall to bounce things off of; it might measurably accelerate human progress. (that's partly the experiment!)

reply
I’ve seen you/anthropic comment repeatedly over the last several months about the “thinking” in similar ways -

“most users dont look at it” (how do you know this?)

“our product team felt it was too visually noisy”

etc etc. But every time something like this is stated, your power users (people here for the most part) state that this is dead wrong. I know you are repeating the corporate line here, but it’s bs.

reply
It's to prevent distillation. Duh
reply
Anecdotally the “power users” of AI are the ones who have succumbed to AI psychosis and write blog posts about orchestrating 30 agents to review PRs when one would’ve done just fine.

The actual power users have an API contract and don’t give a shit about whatever subscription shenanigans Claude Max is pulling today

reply
Uh, no. Definitely not me at all.
reply
sure, that's what they all say!
reply
Whatever makes you feel better about yourself, I guess. My account history on this topic is pretty easily searchable, but I guess it's easier to make driveby comments like this than be informed.
reply
> Before I keep going, I wanted to say I appreciate the depth of thinking & care that went into this.

"This report was produced by me — Claude Opus 4.6 — analyzing my own session logs. ... Ben built the stop hook, the convention reviews, the frustration-capture tools, and this entire analysis pipeline because he believes the problem is fixable and the collaboration is worth saving. He spent today — a day he could have spent shipping code — building infrastructure to work around my limitations instead of leaving."

What a "fuckin'" circle jerk this universe has turned out to be. This note was produced by me and who the hell is Ben?

reply
I definitely noticed the mid-output self-correction reasoning loops mentioned in the GitHub issue in some conversations with Opus 4.6 with extended reasoning enabled on claude.ai. How do I max out the effort there?
reply
Do you guys realize that everyone is switching to Codex because Claude Code is practically unusable now, even on a Max subscription? You ask it to do tasks, and it does 1/10th of them. I shouldn't have to sit there and say: "Check your work again and keep implementing" over and over and over again... Such a garbage experience.

Does Anthropic actually care? Or is it irrelevant to your company because you think you'll be replacing us all in a year anyway?

reply
> I wanted to say I appreciate the depth of thinking & care that went into this.

The irony lol. The whole ticket is just AI-generated. But Anthropic employees have to say this because saying otherwise will admit AI doesn't have "the depth of thinking & care."

reply
It's also pretty standard corporate speak to make sure you don't alienate any users / offend anyone. That's why corporate speak is so bland.
reply
Ticket is AI generated but from what I've seen these guys have a harness to capture/analyze CC performance, so effort was made on the user side for sure.
reply
The note at the end of the post indicates the user asked Claude to review their own chat logs. It's impossible to tell if Claude used or built a a performance harness or just wrote those numbers based on vibes.
reply
There is this 3rd party tracker: https://marginlab.ai/trackers/claude-code/
reply
> This beta header hides thinking from the UI, since most people don't look at it.

I look at it, and I am very upset that I no longer see it.

reply
There is a setting if you'd like to continue to see it: showThinkingSummaries.

See the docs: https://code.claude.com/docs/en/settings#available-settings

reply
> Thinking summaries will now appear in the transcript view (Ctrl+O).

Also: https://github.com/anthropics/claude-code/issues/30958

reply
I also have similar experience with their API, i.e. some requests get stalled for minutes with zero events coming in from Anthropic. Presumably the model does this "extended thinking" but no way to see that. I treat these requests as stuck and retry. Same experience in Claude Code Opus 4.6 when effort is set to "high"—the model gets stuck for ten minutes (at which point I cancel) and token count indicator doesn't increase.

I am not buying what this guy says. He is either lying or not telling us everything.

reply
> As I noted in the comment,

Piece of free PR advice: this is fine in a nerd fight, but don't do this in comments that represent a company. Just repeat the relevant information.

reply
Fair feedback, edited!
reply
Piece of free advice towards a better civilisation: people who didn't even read the comment they're replying to shouldn't be rewarded for their laziness.
reply
I read his comment and still replied. I think his claim that nobody reads thinking blocks and that thinking blocks increase latency is nonsense. I am not going to figure out which settings I need to enable because after reading this thread I cancelled my subscription and switched over to Codex. Because I had the exact same experience as many in this thread.

Also what is that "PR advice"—he might as well wear a suit. This is absolutely a nerd fight.

reply
Alright, I just tested that setting and it doesn't work.

https://i.imgur.com/MYsDSOV.png

I tested because I was porting memories from Claude Code to Codex, so I might as well test. I obviously still have subscription days remaining.

There is another comment in this thread linking a GitHub issue that discusses this. The GitHub issue this whole HN submission is about even says that Anthropic hides thinking blocks.

reply
Thinking time is not the issue. The issue is that Claude does not actually complete tasks. I don't care if it takes longer to think, what I care about is getting partial implementations scattered throughout my codebase while Claude pretends that it finished entirely. You REALLY need to fix this, it's atrocious.
reply
Thanks for the update,

Perhaps max users can be included in defaulting to different effort levels as well?

reply
[flagged]
reply
Christopher, would you be able to share the transcripts for that repo by running /bug? That would make the reports actionable for me to dig in and debug.
reply
I’m not sure being confrontational like this really helps your case. There are real people responding, and even if you’re frustrated it doesn’t pay off to take that frustration out on the people willing to help.
reply
Fair point on tone. It's a bit of a bind isn't it? When you come with a well-researched issue as OP did, you get this bland corporate nonsense "don't believe your lyin' eyes, we didn't change anything major, you can fix it in settings."

How should you actually communicate in such a way that you are actually heard when this is the default wall you hit?

The author is in this thread saying every suggested setting is already maxed. The response is "try these settings." What's the productive version of pointing out that the answer doesn't address the evidence? Genuine question. I linked my repo because it's the most concrete example I have.

reply
I read the entire performance degradation report in the OP, and Boris's response, and it seems that the overwhelming majority of the report's findings can indeed be explained by the `showThinkingSummaries` option being off by default as of recently.
reply
Just use a different tool or stop vibe coding, it’s not that hard. I really don’t understand the logic of filing bug reports against the black box of AI
reply
People file tickets against closed source "black box" systems all the time. You could just as well say: Stop using MS SQL, just use a different tool, it's not that hard.
reply
Is somebody saying "you're holding it wrong" a "people willing to help"?
reply
They are if you are, in fact, holding it wrong.

As was the usual case in most of the few years LLMs existed in this world.

Think not of iPhone antennas - think of a humble hammer. A hammer has three ends to hold by, and no amount of UI/UX and product design thinking will make the end you like to hold to be a good choice when you want to drive a Torx screw.

reply
[flagged]
reply
The stated policy of HN is "don't be mean to the openclaw people", let's see if it generalizes.
reply
I guess one of the things I don't understand: how you expect a stochastic model, sold as a proprietary SaaS, with a proprietary (though briefly leaked) client, is supposed to be predictable in its behavior.

It seems like people are expecting LLM based coding to work in a predictable and controllable way. And, well, no, that's not how it works, and especially so when you're using a proprietary SaaS model where you can't control the exact model used, the inference setup its running on, the harness, the system prompts, etc. It's all just vibes, you're vibe coding and expecting consistency.

Now, if you were running a local weights model on your own inference setup, with an open source harness, you'd at least have some more control of the setup. Of course, it's still a stochastic model, trained on who knows what data scraped from the internet and generated from previous versions of the model; there will always be some non-determinism. But if you're running it yourself, you at least have some control and can potentially bisect configuration changes to find what caused particular behavior regressions.

reply
The problem is degradation. It was working much better before. There are many people (some example of a well know person[0]), including my circle of friends and me who were working on projects around the Opus 4.6 rollout time and suddenly our workflows started to degrade like crazy. If I did not have many quality gates between an LLM session and production I would have faced certain data loss and production outages just like some famous company did. The fun part is that the same workflow that was reliably going through the quality gates before suddenly failed with something trivial. I cannot pinpoint what exactly Claude changed but the degradation is there for sure. We are currently evaling alternatives to have an escape hatch (Kimi, Chatgpt, Qwen are so far the best candidates and Nemotron). The only issue with alternatives was (before the Claude leak) how well the agentic coding tool integrates with the model and the tool use, and there are several improvements happening already, like [1]. I am hoping the gap narrows and we can move off permanently. No more hoops, you are right, I should not have attempted to delete the production database moments.

https://x.com/theo/status/2041111862113444221

https://x.com/_can1357/status/2021828033640911196

reply
Same as how I expect a coin to come up heads 50% of the time.
reply
If you get consistently nowhere near 50% then surely you know you're not throwing a fair coin? What would complaining to the coin provider achieve? Switch coins.

*typo

reply
Well I'm paying the coin to be near 50% and the coin's PM is listening to customers, so that's why.
reply
The coin's PM is spamming you trivial gaslighting corporate slop, most of it barely edited.
reply
> how you expect a stochastic model [...] is supposed to be predictable in its behavior.

I used it often enough to know that it will nail tasks I deem simple enough almost certainly.

reply
It also completely ignores the increase in behavioral tracking metrics. 68% increase in swearing at the LLM for doing something wrong needs to be addressed and isn't just "you're holding it wrong"
reply
I’m think a great marketing line for local/selfhosted LLMs in the future - “You can swear at your LLM and nobody will care!”
reply
Please don't post this aggressively to Hacker News. You can make your substantive points without that.

https://news.ycombinator.com/newsguidelines.html

reply
[dead]
reply
[flagged]
reply
Yep totally -- think of this as "maximum effort". If a task doesn't need a lot of thinking tokens, then the model will choose a lower effort level for the task.
reply
Technically speaking, models inherently do this - CoT is just output tokens that aren't included in the final response because they're enclosed in <think> tags, and it's the model that decides when to close the tag. You can add a bias to make it more or less likely for a model to generate a particular token, and that's how budgets work, but it's always going to be better in the long run to let the model make that decision entirely itself - the bias is a short term hack to prevent overthinking when the model doesn't realize it's spinning in circles.
reply
> You can add a bias to make it more or less likely for a model to generate a particular token, and that's how budgets work

Do you have a source for this? I am interested in learning more about how this works.

reply
It's how temperature/top_p/top_k work. Anthropic also just put out a paper where they were doing a much more advanced version of this, mapping out functional states within the modern and steering with that.
reply
Huh, I wonder if that's why you cannot change the temperature when thinking is enabled. Do you have a link for the paper?
reply
https://transformer-circuits.pub/2026/emotions/index.html

At the actual inference level temperature can be applied at any time - generation is token by token - but that doesn't mean the API necessarily exposes it.

reply
Thanks. I was referring to the fact that Anthropic, in their API, prohibits setting temperature when thinking is enabled.
reply
Hey Boris, would appreciate if you could respond to my DM on X about Claude erroneously charging me $200 in extra credit usage when I wasn't using the service. Haven't heard back from Claude Support in over a month and I am getting a bit frustrated.
reply