undefined

upvote

points

by bcherny4 hours ago |

upvote

by anonymoushn3 minutes ago|

[-]

> On of our product principles is to avoid changing settings on users' behalf

Ideally there wouldn't be silent changes that greatly reduce the utility of the user's session files until they set a newly introduced flag.

reply

upvote

by richardjennings3 hours ago|

[-]

I was not aware the default effort had changed to medium until the quality of output nosedived. This cost me perhaps a day of work to rectify. I now ensure effort is set to max and have not had a terrible session since. Please may I have a "always try as hard as you can" mode ?

reply

upvote

by Avamander27 minutes ago|

[-]

I feel like the maximum effort mode kind-of wraps around and starts becoming "desperate" to the extent of lazy or a monkey's paw, similar to how lower effort modes or a poor prompt.

reply

upvote

by Schiendelman2 hours ago|

[-]

That's /effort max!

reply

upvote

by richardjennings1 hours ago|

[-]

You cannot control the effort setting sub-agents use and you also cannot use /effort max as a default (outside of using an alias).

reply

upvote

by bazhand38 seconds ago|

[-]

export CLAUDE_CODE_EFFORT_LEVEL=max

reply

upvote

by clevergadget18 minutes ago|

[-]

bad citizen

reply

upvote

by koverstreet4 hours ago|

[-]

There's been more going on than just the default to medium level thinking - I'll echo what others are saying, even on high effort there's been a very significant increase in "rush to completion" behavior.

reply

upvote

by bcherny4 hours ago|

[-]

Thanks for the feedback. To make it actionable, would you mind running /bug the next time you see it and posting the feedback id here? That way we can debug and see if there's an issue, or if it's within variance.

reply

upvote

by JamesSwift1 hours ago|

[-]

  a9284923-141a-434a-bfbb-52de7329861d
  d48d5a68-82cd-4988-b95c-c8c034003cd0
  5c236e02-16ea-42b1-b935-3a6a768e3655
  22e09356-08ce-4b2c-a8fd-596d818b1e8a
  4cb894f7-c3ed-4b8d-86c6-0242200ea333

Amusingly (not really), this is me trying to get sessions to resume to then get feedback ids and it being an absolute chore to get it to give me the commands to resume these conversations but it keeps messing things up: cf764035-0a1d-4c3f-811d-d70e5b1feeef

reply

upvote

by koverstreet4 hours ago|

[-]

I'll have a look. The CoT switch you mentioned will help, I'll take a look at that too, but my suspicion is that this isn't a CoT issue - it's a model preference issue.

Comparing Opus vs. Qwen 27b on similar problems, Opus is sharper and more effective at implementation - but will flat out ignore issues and insist "everything is fine" that Qwen is able to spot and demonstrate solid understanding of. Opus understands the issues perfectly well, it just avoids them.

This correlates with what I've observed about the underlying personalities (and you guys put out a paper the other day that shows you guys are starting to understand it in these terms - functionally modeling feelings in models). On the whole Opus is very stable personality wise and an effective thinker, I want to complement you guys on that, and it definitely contrasts with behaviors I've seen from OpenAI. But when I do see Opus miss things that it should get, it seems to be a combination of avoidant tendencies and too much of a push to "just get it done and move into the next task" from RHLF.

reply

upvote

by freedomben4 hours ago|

[-]

How much of the code/context gets attached in the /bug report?

reply

upvote

by bcherny4 hours ago|

[-]

When you submit a /bug we get a way to see the contents of the conversation. We don't see anything else in your codebase.

reply

upvote

by murkt1 hours ago|

[-]

Was there a change in Claude Code system prompt at that time that nudges Claude into simplistic thinking?

Here is a gist that tries to patch the system prompt to make Claude behave better https://gist.github.com/roman01la/483d1db15043018096ac3babf5...

I haven’t personally tried it yet. I do certainly battle Claude quite a lot with “no I don’t want quick-n-easy wrong solution just because it’s two lines of code, I want best solution in the long run”.

If the system prompt indeed prefers laziness in 5:1 ratio, that explains a lot.

I will submit /bug in a few next conversations, when it occurs next.

reply

upvote

by Avamander23 minutes ago|

[-]

That Gist does explain quite a few flaws Claude has. I wonder if MEMORY.md is sufficient to counteract the prompt without patching.

reply

upvote

by dev_l1x_be32 minutes ago|

[-]

Holy sweet LLM, this gist is crazy. Why did they do this to themselves? I am going to try this at home, it might actually fix Claude.

reply

upvote

by murkt20 minutes ago|

[-]

Remember Sonnet 3.5 and 3.7? They were happy to throw abstraction on top of abstraction on top of abstraction. Still a lot of people have “do not over-engineer, do not design for the future” and similar stuff in their CLAUDE.md files.

So I think the system prompt just pushes it way too hard to “simple” direction. At least for some people. I was doing a small change in one of my projects today, and I was quite happy with “keep it stupid and hacky” approach there.

And in the other project I am like “NO! WORK A LOT! DO YOUR BEST! BE HAPPY TO WORK HARD!”

So it depends.

reply

upvote

by enigmo1 hours ago|

[-]

[dead]

reply

upvote

by stefan_1 hours ago|

[-]

Theres also been tons of thinking leaking into the actual output. Recently it even added thinking into a code patch it did (a[0] &= ~(1 << 2); // actually let me just rewrite { .. 5 more lines setting a[0] .. }).

reply

upvote

by plexicle4 hours ago|

[-]

Ultrathink is back? I thought that wasn't a thing anymore.

If I am following.. "Max" is above "High", but you can't set it to "Max" as a default. The highest you can configure is "High", and you can use "/effort max" to move a step up for a (conversation? session?), or "ultrathink" somewhere in the prompt to move a step up for a single turn. Is this accurate?

reply

upvote

by bcherny4 hours ago|

[-]

Yep, exactly

reply

upvote

by dostick2 hours ago|

[-]

Mentioning ULTRATHINK in prompt is the equivalent to /effort max?

reply

upvote

by merlindru40 minutes ago|

[-]

Yes but only for the message that includes it. Whereas /effort max keeps it at max effort the entire convo, to my knowledge

reply

upvote

by 3 hours ago|

[-]

deleted

reply

upvote

by johndough3 hours ago|

[-]

I think it is hilarious that there are four different ways to set settings (settings.json config file, environment variable, slash commands and magical chat keywords).

That kind of consistency has also been my own experience with LLMs.

reply

upvote

by monatron2 hours ago|

[-]

To be fair, I can think of reasons why you would want to be able to set them in various ways.

- settings.json - set for machine, project

- env var - set for an environment/shell/sandbox

- slash command - set for a session

- magical keyword - set for a turn

reply

upvote

by larpingscholar25 minutes ago|

[-]

You are yet to discover the joys of the managed settings scope. They can be set three ways. The claude.ai admin console; by one of two registry keys e.g. HKLM\SOFTWARE\Policies\ClaudeCode; and by an alphabetically merged directory of json files.

reply

upvote

by SAI_Peregrinus2 hours ago|

[-]

It's not unique to LLMs. Take BASH: you've got `/etc/profile`, `~/.bash_profile,` `~/.bash_login`, `~/.bashrc`, `~/.profile`, environment variables, and shell options.

reply

upvote

by ggdxwz2 hours ago|

[-]

Especially some settings are in setting.json, and others in .claude.json So sometimes I have to go through both to find the one I want to tweak

reply

upvote

by w10-13 hours ago|

[-]

Here's the reply in context:

https://github.com/anthropics/claude-code/issues/42796#issue...

Sympathies: Users now completely depend on their jet-packs. If their tools break (and assuming they even recognize the problem). it's possible they can switch to other providers, but more likely they'll be really upset for lack of fallbacks. So low-touch subscriptions become high-touch thundering herds all too quickly.

reply

upvote

by dc_giant4 hours ago|

[-]

All right so what do I need to do so it does its job again? Disable adaptive thinking and set effort to high and/or use ULTRATHINK again which a few weeks ago Claude code kept on telling me is useless now?

reply

upvote

by bcherny4 hours ago|

[-]

Run this: /effort high

reply

upvote

by berkanunal3 hours ago|

[-]

Imagine if all service providers were behaving like this.

> Ahh, sorry we broke your workflow.

> We found that `log_level=error` was a sweet spot for most users.

> To make it work as you expect it so, run `./bin/unpoop` it will set log_level=warn

reply

upvote

by nimchimpsky41 minutes ago|

[-]

[dead]

reply

upvote

by stldev2 hours ago|

[-]

You can't. This is Anthropic leveraging their dials, and ignoring their customers for weeks.

Switch providers.

Anecdotally, I've had no luck attempting to revert to prior behavior using either high/max level thinking (opus) or prompting. The web interface for me though doesn't seem problematic when using opus extended.

reply

upvote

by aizk4 hours ago|

[-]

How do you guys manage regressions as a whole with every new model update? A massive test set of e2e problem solving seeing how the models compare?

reply

upvote

by bcherny4 hours ago|

[-]

A mix of evals and vibes.

reply

upvote

by giwook4 hours ago|

[-]

What's that ratio exactly

reply

upvote

by nimchimpsky40 minutes ago|

[-]

[dead]

reply

upvote

by nothinkjustai3 hours ago|

[-]

[flagged]

reply

upvote

by capnchaos4 hours ago|

[-]

Are you doing any Digital Twin testing or simulations? I imagine you can't test a product like Claude Code using traditional means.

reply

upvote

by efields2 hours ago|

[-]

"Evals and vibes" can I put that on a t shirt?

reply

upvote

by cududa10 minutes ago|

[-]

Remember when they shipped that version that didn't actually start/ run? At work we were goofing on them a bit, until I said "Wait how did their tests even run on that?" And we realized whatever their CI/CD process is, it wasn't at the time running on the actual release binary... I can imagine their variation on how most engineers think about CI/CD probably is indicative of some other patterns (or lack of traditional patterns)

As someone that used to work on Windows, I kind of had a vision of a similar in scope e2e testing harness, similar to Windows Vista/ 7 (knowing about bugs/ issues doesn't mean you can necessarily fix them ... hence Vista then 7) - and that Anthropic must provide some Enterprise guarantee backed by this testing matrix I imagined must exist - long way of saying, I think they might just YOLO regressions by constantly updating their testing/ acceptance criteria.

Why not provide pinable versions or something? This whole absurdity and wasted 2 months of suboptimal productivity hits on the absurdity of constantly changing the user/ system prompt and doing so much of the R&D and feature development at two brittle prompts with unclear interplay. And so until there’s like a compostable system/user prompt framework they reliably develop tests against, I personally would prefer pegged selectable versions. But each version probably has like known critical bugs they’re dancing around so there is no version they’d feel comfortable making a pegged stable release..

reply

upvote

by try-working1 hours ago|

[-]

I use a self-documenting recursive workflow: https://github.com/doubleuuser/rlm-workflow

reply

upvote

by DennisL1234 hours ago|

[-]

Happy to have my mind changed, yet I am not 100% convinced closing the issue as completed captures the feedback.

reply

upvote

by bcherny4 hours ago|

[-]

From the contents of the issue, this seems like a fairly clear default effort issue. Would love your input if there's something specific that you think is unaddressed.

reply

upvote

by vecter3 hours ago|

[-]

From this reply, it seems that it has nothing to do with `/effort`: https://github.com/anthropics/claude-code/issues/42796#issue...

I hope you take this seriously. I'm considering moving my company off of Claude Code immediately.

Closing the GH issue without first engaging with the OP is just a slap in the face, especially given how much hard work they've done on your behalf.

reply

upvote

by wonnage3 hours ago|

[-]

The OP “bug report” is a wall of AI slop generated from looking at its own chat transcripts

reply

upvote

by vecter2 hours ago|

[-]

Do you disagree with any of the data or conclusions?

reply

upvote

by wonnage1 hours ago|

[-]

Yes

reply

upvote

by vecter1 hours ago|

[-]

I'm open to hearing, please elaborate

reply

upvote

by JamesSwift2 hours ago|

[-]

I commented on the GH issue, but Ive had effort set to 'high' for however long its been available and had a marked decline since... checks notes... about 23 March according to slack messages I sent to the team to see if I was alone (I wasnt).

EDIT: actually the first glaring issue I remember was on 20 March where it hallucinated a full sha from a short sha while updating my github actions version pinning. That follows a pattern of it making really egregious assumptions about things without first validating or checking. Ive also had it answer with hallucinated information instead of looking online first (to a higher degree than Ive been used to after using these models daily for the past ~6 months)

reply

upvote

by dev_l1x_be2 hours ago|

[-]

It hallucinated a GUID for me instead of using the one in the RFC for webscokets. Fun part was that the beginning was the same. Then it hardcoded the unit tests to be green with the wrong GUID.

reply

upvote

by DennisL1231 hours ago|

[-]

Gotcha. It seemed though from the replies on the github ticket that at least some of the problem was unrelated to effort settings.

reply

upvote

by KenoFischer3 hours ago|

[-]

While we have you here, could you fix the bash escaping bug? https://github.com/anthropics/claude-code/issues/10153

reply

upvote

by starkparker4 hours ago|

[-]

> You can also use the ULTRATHINK keyword to use high effort for a single turn

First I've heard that ultrathink was back. Much quieter walkback of https://decodeclaude.com/ultrathink-deprecated/

reply

upvote

by giwook1 hours ago|

[-]

Pretty sure it's still gone and you should be using effort level now for this.

reply

upvote

by migali49g1 hours ago|

[-]

Hi Boris, thanks for addressing this and providing feedback quickly. I noticed the same issue. My question is, is it enough to do /efforts high, or should I also add CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING to my settings?

reply

upvote

by areoform2 hours ago|

[-]

Hey Boris, thanks for the awesomeness that's Claude! You've genuinely changed the life of quite a few young people across the world. :)

not sure if the team is aware of this, but Claude code (cc from here on) fails to install / initiate on Windows 10; precise version, Windows 10.0.19045 build 19045. It fails mid setup, and sometimes fails to throw up a log. It simply calls it quits and terminates.

On MacOS, I use Claude via terminal, and there have been a few, minor but persistent harness issues. For example, cc isn't able to use Claude for Chrome. It has worked once and only once, and never again. Currently, it fails without a descriptive log or issue. It simply states permission has been denied.

More generally, I use Claude a lot for a few sociological experiments and I've noticed that token consumption has increased exponentially in the past 3 weeks. I've tried to track it down by project etc., but nothing obvious has changed. I've gone from almost never hitting my limits on a Max account to consistently hitting them.

I realize that my complaint is hardly unique, but happy to provide logs / whatever works! :)

And yeah, thanks again for Claude! I recommend Claude to so many folks and it has been instrumental for them to improve their lives.

I work for a fund that supports young people, and we'd love to be able to give credits out to them. I tried to reach out via the website etc. but wasn't able to get in touch with anyone. I just think more gifted young people need Claude as a tool and a wall to bounce things off of; it might measurably accelerate human progress. (that's partly the experiment!)

reply

upvote

by JohnMakin3 hours ago|

[-]

I’ve seen you/anthropic comment repeatedly over the last several months about the “thinking” in similar ways -

“most users dont look at it” (how do you know this?)

“our product team felt it was too visually noisy”

etc etc. But every time something like this is stated, your power users (people here for the most part) state that this is dead wrong. I know you are repeating the corporate line here, but it’s bs.

reply

upvote

by exfalso55 minutes ago|

[-]

It's to prevent distillation. Duh

reply

upvote

by wonnage3 hours ago|

[-]

Anecdotally the “power users” of AI are the ones who have succumbed to AI psychosis and write blog posts about orchestrating 30 agents to review PRs when one would’ve done just fine.

The actual power users have an API contract and don’t give a shit about whatever subscription shenanigans Claude Max is pulling today

reply

upvote

by JohnMakin2 hours ago|

[-]

Uh, no. Definitely not me at all.

reply

upvote

by wonnage1 hours ago|

[-]

sure, that's what they all say!

reply

upvote

by JohnMakin1 hours ago|

[-]

Whatever makes you feel better about yourself, I guess. My account history on this topic is pretty easily searchable, but I guess it's easier to make driveby comments like this than be informed.

reply

upvote

by yubblegum3 hours ago|

[-]

> Before I keep going, I wanted to say I appreciate the depth of thinking & care that went into this.

"This report was produced by me — Claude Opus 4.6 — analyzing my own session logs. ... Ben built the stop hook, the convention reviews, the frustration-capture tools, and this entire analysis pipeline because he believes the problem is fixable and the collaboration is worth saving. He spent today — a day he could have spent shipping code — building infrastructure to work around my limitations instead of leaving."

What a "fuckin'" circle jerk this universe has turned out to be. This note was produced by me and who the hell is Ben?

reply

upvote

by matheusmoreira3 hours ago|

[-]

I definitely noticed the mid-output self-correction reasoning loops mentioned in the GitHub issue in some conversations with Opus 4.6 with extended reasoning enabled on claude.ai. How do I max out the effort there?

reply

upvote

by ting04 hours ago|

[-]

Do you guys realize that everyone is switching to Codex because Claude Code is practically unusable now, even on a Max subscription? You ask it to do tasks, and it does 1/10th of them. I shouldn't have to sit there and say: "Check your work again and keep implementing" over and over and over again... Such a garbage experience.

Does Anthropic actually care? Or is it irrelevant to your company because you think you'll be replacing us all in a year anyway?

reply

upvote

by raincole4 hours ago|

[-]

> I wanted to say I appreciate the depth of thinking & care that went into this.

The irony lol. The whole ticket is just AI-generated. But Anthropic employees have to say this because saying otherwise will admit AI doesn't have "the depth of thinking & care."

reply

upvote

by vlovich1234 hours ago|

[-]

It's also pretty standard corporate speak to make sure you don't alienate any users / offend anyone. That's why corporate speak is so bland.

reply

upvote

by rafaelmn3 hours ago|

[-]

Ticket is AI generated but from what I've seen these guys have a harness to capture/analyze CC performance, so effort was made on the user side for sure.

reply

upvote

by notatallshaw2 hours ago|

[-]

The note at the end of the post indicates the user asked Claude to review their own chat logs. It's impossible to tell if Claude used or built a a performance harness or just wrote those numbers based on vibes.

reply

upvote

by gardnr1 hours ago|

[-]

There is this 3rd party tracker: https://marginlab.ai/trackers/claude-code/

reply

upvote

by ai_slop_hater4 hours ago|

[-]

> This beta header hides thinking from the UI, since most people don't look at it.

I look at it, and I am very upset that I no longer see it.

reply

upvote

by bcherny4 hours ago|

[-]

There is a setting if you'd like to continue to see it: showThinkingSummaries.

See the docs: https://code.claude.com/docs/en/settings#available-settings

reply

upvote

by starkparker4 hours ago|

[-]

> Thinking summaries will now appear in the transcript view (Ctrl+O).

Also: https://github.com/anthropics/claude-code/issues/30958

reply

upvote

by ai_slop_hater4 hours ago|

[-]

I also have similar experience with their API, i.e. some requests get stalled for minutes with zero events coming in from Anthropic. Presumably the model does this "extended thinking" but no way to see that. I treat these requests as stuck and retry. Same experience in Claude Code Opus 4.6 when effort is set to "high"—the model gets stuck for ten minutes (at which point I cancel) and token count indicator doesn't increase.

I am not buying what this guy says. He is either lying or not telling us everything.

reply

upvote

by antonvs3 hours ago|

[-]

> As I noted in the comment,

Piece of free PR advice: this is fine in a nerd fight, but don't do this in comments that represent a company. Just repeat the relevant information.

reply

upvote

by bcherny3 hours ago|

[-]

Fair feedback, edited!

reply

upvote

by trvz1 hours ago|

[-]

Piece of free advice towards a better civilisation: people who didn't even read the comment they're replying to shouldn't be rewarded for their laziness.

reply

upvote

by ai_slop_hater1 hours ago|

[-]

I read his comment and still replied. I think his claim that nobody reads thinking blocks and that thinking blocks increase latency is nonsense. I am not going to figure out which settings I need to enable because after reading this thread I cancelled my subscription and switched over to Codex. Because I had the exact same experience as many in this thread.

Also what is that "PR advice"—he might as well wear a suit. This is absolutely a nerd fight.

reply

upvote

by ai_slop_hater53 minutes ago|

[-]

Alright, I just tested that setting and it doesn't work.

https://i.imgur.com/MYsDSOV.png

I tested because I was porting memories from Claude Code to Codex, so I might as well test. I obviously still have subscription days remaining.

There is another comment in this thread linking a GitHub issue that discusses this. The GitHub issue this whole HN submission is about even says that Anthropic hides thinking blocks.

reply

upvote

by ting03 hours ago|

[-]

Thinking time is not the issue. The issue is that Claude does not actually complete tasks. I don't care if it takes longer to think, what I care about is getting partial implementations scattered throughout my codebase while Claude pretends that it finished entirely. You REALLY need to fix this, it's atrocious.

reply

upvote

by j453 hours ago|

[-]

Thanks for the update,

Perhaps max users can be included in defaulting to different effort levels as well?

reply

upvote

by ctoth4 hours ago|

[-]

[flagged]

reply

upvote

by bcherny2 hours ago|

[-]

Christopher, would you be able to share the transcripts for that repo by running /bug? That would make the reports actionable for me to dig in and debug.

reply

upvote

by quietsegfault4 hours ago|

[-]

I’m not sure being confrontational like this really helps your case. There are real people responding, and even if you’re frustrated it doesn’t pay off to take that frustration out on the people willing to help.

reply

upvote

by ctoth3 hours ago|

[-]

Fair point on tone. It's a bit of a bind isn't it? When you come with a well-researched issue as OP did, you get this bland corporate nonsense "don't believe your lyin' eyes, we didn't change anything major, you can fix it in settings."

How should you actually communicate in such a way that you are actually heard when this is the default wall you hit?

The author is in this thread saying every suggested setting is already maxed. The response is "try these settings." What's the productive version of pointing out that the answer doesn't address the evidence? Genuine question. I linked my repo because it's the most concrete example I have.

reply

upvote

by enraged_camel1 hours ago|

[-]

I read the entire performance degradation report in the OP, and Boris's response, and it seems that the overwhelming majority of the report's findings can indeed be explained by the `showThinkingSummaries` option being off by default as of recently.

reply

upvote

by wonnage3 hours ago|

[-]

Just use a different tool or stop vibe coding, it’s not that hard. I really don’t understand the logic of filing bug reports against the black box of AI

reply

upvote

by geysersam1 hours ago|

[-]

People file tickets against closed source "black box" systems all the time. You could just as well say: Stop using MS SQL, just use a different tool, it's not that hard.

reply

upvote

by malfist3 hours ago|

[-]

Is somebody saying "you're holding it wrong" a "people willing to help"?

reply

upvote

by TeMPOraL2 hours ago|

[-]

They are if you are, in fact, holding it wrong.

As was the usual case in most of the few years LLMs existed in this world.

Think not of iPhone antennas - think of a humble hammer. A hammer has three ends to hold by, and no amount of UI/UX and product design thinking will make the end you like to hold to be a good choice when you want to drive a Torx screw.

reply

upvote

by Retr0id3 hours ago|

[-]

[flagged]

reply

upvote

by throwaway6137462 hours ago|

[-]

[dead]

reply

upvote

by BigTTYGothGF3 hours ago|

[-]

The stated policy of HN is "don't be mean to the openclaw people", let's see if it generalizes.

reply

upvote

by lambda3 hours ago|

[-]

I guess one of the things I don't understand: how you expect a stochastic model, sold as a proprietary SaaS, with a proprietary (though briefly leaked) client, is supposed to be predictable in its behavior.

It seems like people are expecting LLM based coding to work in a predictable and controllable way. And, well, no, that's not how it works, and especially so when you're using a proprietary SaaS model where you can't control the exact model used, the inference setup its running on, the harness, the system prompts, etc. It's all just vibes, you're vibe coding and expecting consistency.

Now, if you were running a local weights model on your own inference setup, with an open source harness, you'd at least have some more control of the setup. Of course, it's still a stochastic model, trained on who knows what data scraped from the internet and generated from previous versions of the model; there will always be some non-determinism. But if you're running it yourself, you at least have some control and can potentially bisect configuration changes to find what caused particular behavior regressions.

reply

upvote

by dev_l1x_be2 hours ago|

[-]

The problem is degradation. It was working much better before. There are many people (some example of a well know person[0]), including my circle of friends and me who were working on projects around the Opus 4.6 rollout time and suddenly our workflows started to degrade like crazy. If I did not have many quality gates between an LLM session and production I would have faced certain data loss and production outages just like some famous company did. The fun part is that the same workflow that was reliably going through the quality gates before suddenly failed with something trivial. I cannot pinpoint what exactly Claude changed but the degradation is there for sure. We are currently evaling alternatives to have an escape hatch (Kimi, Chatgpt, Qwen are so far the best candidates and Nemotron). The only issue with alternatives was (before the Claude leak) how well the agentic coding tool integrates with the model and the tool use, and there are several improvements happening already, like [1]. I am hoping the gap narrows and we can move off permanently. No more hoops, you are right, I should not have attempted to delete the production database moments.

https://x.com/theo/status/2041111862113444221

https://x.com/_can1357/status/2021828033640911196

reply

upvote

by stavros2 hours ago|

[-]

Same as how I expect a coin to come up heads 50% of the time.

reply

upvote

by muyuu45 minutes ago|

[-]

If you get consistently nowhere near 50% then surely you know you're not throwing a fair coin? What would complaining to the coin provider achieve? Switch coins.

*typo

reply

upvote

by stavros43 minutes ago|

[-]

Well I'm paying the coin to be near 50% and the coin's PM is listening to customers, so that's why.

reply

upvote

by muyuu10 minutes ago|

[-]

The coin's PM is spamming you trivial gaslighting corporate slop, most of it barely edited.

reply

upvote

by randomNumber71 hours ago|

[-]

> how you expect a stochastic model [...] is supposed to be predictable in its behavior.

I used it often enough to know that it will nail tasks I deem simple enough almost certainly.

reply

upvote

by malfist3 hours ago|

[-]

It also completely ignores the increase in behavioral tracking metrics. 68% increase in swearing at the LLM for doing something wrong needs to be addressed and isn't just "you're holding it wrong"

reply

upvote

by alchemist1e92 hours ago|

[-]

I’m think a great marketing line for local/selfhosted LLMs in the future - “You can swear at your LLM and nobody will care!”

reply

upvote

by dang1 hours ago|

[-]

Please don't post this aggressively to Hacker News. You can make your substantive points without that.

https://news.ycombinator.com/newsguidelines.html

reply

upvote

by iwalton33 hours ago|

[-]

[dead]

reply

upvote

by tatrions4 hours ago|

[-]

[flagged]

reply

upvote

by bcherny4 hours ago|

[-]

Yep totally -- think of this as "maximum effort". If a task doesn't need a lot of thinking tokens, then the model will choose a lower effort level for the task.

reply

upvote

by koverstreet4 hours ago|

[-]

Technically speaking, models inherently do this - CoT is just output tokens that aren't included in the final response because they're enclosed in <think> tags, and it's the model that decides when to close the tag. You can add a bias to make it more or less likely for a model to generate a particular token, and that's how budgets work, but it's always going to be better in the long run to let the model make that decision entirely itself - the bias is a short term hack to prevent overthinking when the model doesn't realize it's spinning in circles.

reply

upvote

by ai_slop_hater3 hours ago|

[-]

> You can add a bias to make it more or less likely for a model to generate a particular token, and that's how budgets work

Do you have a source for this? I am interested in learning more about how this works.

reply

upvote

by koverstreet3 hours ago|

[-]

It's how temperature/top_p/top_k work. Anthropic also just put out a paper where they were doing a much more advanced version of this, mapping out functional states within the modern and steering with that.

reply

upvote

by ai_slop_hater3 hours ago|

[-]

Huh, I wonder if that's why you cannot change the temperature when thinking is enabled. Do you have a link for the paper?

reply

upvote

by koverstreet3 hours ago|

[-]

https://transformer-circuits.pub/2026/emotions/index.html

At the actual inference level temperature can be applied at any time - generation is token by token - but that doesn't mean the API necessarily exposes it.

reply

upvote

by ai_slop_hater3 hours ago|

[-]

Thanks. I was referring to the fact that Anthropic, in their API, prohibits setting temperature when thinking is enabled.

reply

upvote

by nickvec2 hours ago|

[-]

Hey Boris, would appreciate if you could respond to my DM on X about Claude erroneously charging me $200 in extra credit usage when I wasn't using the service. Haven't heard back from Claude Support in over a month and I am getting a bit frustrated.

reply