This makes no sense to me. I often leave sessions idle for hours or days and use the capability to pick it back up with full context and power.
The default thinking level seems more forgivable, but the churn in system prompts is something I'll need to figure out how to intentionally choose a refresh cycle.
Normally, when you have a conversation with Claude Code, if your convo has N messages, then (N-1) messages hit prompt cache -- everything but the latest message.
The challenge is: when you let a session idle for >1 hour, when you come back to it and send a prompt, it will be a full cache miss, all N messages. We noticed that this corner case led to outsized token costs for users. In an extreme case, if you had 900k tokens in your context window, then idled for an hour, then sent a message, that would be >900k tokens written to cache all at once, which would eat up a significant % of your rate limits, especially for Pro users.
We tried a few different approaches to improve this UX:
1. Educating users on X/social
2. Adding an in-product tip to recommend running /clear when re-visiting old conversations (we shipped a few iterations of this)
3. Eliding parts of the context after idle: old tool results, old messages, thinking. Of these, thinking performed the best, and when we shipped it, that's when we unintentionally introduced the bug in the blog post.
Hope this is helpful. Happy to answer any questions if you have.
I feel like that is a choice best left up to users.
i.e. "Resuming this conversation with full context will consume X% of your 5-hour usage bucket, but that can be reduced by Y% by dropping old thinking logs"
Perhaps if we were willing to pay more for our subscriptions Anthropic would be able to have longer cache windows but IDK one hour seems like a reasonable amount of time given the context and is a limitation I'm happy to work around (it's not that hard to work around) to pay just $100 or $200 a month for the industry-leading LLM.
Full disclosure: I've recently signed up for ChatGPT Pro as well in addition to my Claude Max sub so not really biased one way or the other. I just want a quality LLM that's affordable.
Kinda like when restaurants make me pay for ketchup or a takeaway box, i get annoyed, just increase the compiled price.
If they ignored this then all users who don’t do this much would have to subsidize the people who do.
I completely agree that it’s infeasible for them to cache for long periods of time, but they need to surface that information in the tools so that we can make informed decisions.
note: I picked the values from a blog and they may be innacurate, but in pretty much all model the KV cache is very large, it's probably even larger in Claude.
Total VRAM: 16GB Model: ~12GB 128k context size: ~3.9GB
At least I'm pretty sure I landed on 128k... might have been 64k. Regardless, you can see the massive weight (ha) of the meager context size (at least compared to frontier models).
A sibling comment explains:
The UI could indicate this by showing a timer before context is dumped.
No need to gamify it. It's just UI.
But perhaps Claude Code could detect that you're actively working on this stuff (like typing a prompt or accessing the files modified by the session), and send keep-cache-alive pings based on that? Presumably these pings could be pretty cheap, as the kv-cache wouldn't need to be loaded back into VRAM for this. If that would work reliably, cache expiry timeouts could be more aggressive (5 min instead of an hour).
Caching to RAM and disk is a thing but it’s hard to keep performance up with that and it’s early days of that tech being deployed anywhere.
Disclosure: work on AI at Microsoft. Above is just common industry info (see work happening in vLLM for example)
The larger point stands: the cache is expensive. It still saves you money but Anthropic must charge for it.
Edit: there are a lot of comments here where people don't understand LLM prefix caching, aka the KV cache. That's understandable: it is a complex topic and the usual intuitions about caching you might have from e.g. web development don't apply: a single cache blob for a single request is in the 10s of GB at least for a big model, and a lot of the key details turn on the problems of moving it in and out of GPU memory. The contents of the cache is internal model state; it's not your context or prompt or anything like that. Furthermore, this isn't some Anthropic-specific thing; all LLM inference with a stable context prefix will use it because it makes inference faster and cheaper. If you want to read up on this subject, be careful as a lot of blogs will tell you about the KV cache as it is used within inference for an single request (a critical detail concept in how LLMs work) but they will gloss over how the KV cache is persisted between requests, which is what we're all talking about here. I would recommend Philip Kiely's new book Inference Engineering for a detailed discussion of that stuff, including the multiple caching levels.
You didn't do your due diligence on an expensive API. A naïve implementation of an LLM chat is going to have O(N^2) costs from prompting with the entire context every time. Caching is needed to bring that down to O(N), but the cache itself takes resources, so evictions have to happen eventually.
You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.
I have no idea how that works with a LLM implementation nor do I actually know what they are caching in this context.
1. Compute scaling with the length of the sequence is applicable to transformer models in general, i.e. every frontier LLM since ChatGPT's initial release.
2. As undocumented changes happen frequently, users should be even more incentivized to at least try to have a basic understanding of the product's cost structure.
> You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.
I think "internal technical implementation" is a stretch. Users don't need to know what a "transformer" is to understand the trade-off. It's not trivial but it's not something incomprehensible to laypersons.
That's a bad estimate. Claude Code is explicitly a developer shaped tool, we're not talking generically ChatGPT here, so my guess is probably closer to 75% of those users do understand what caching is, with maybe 30% being able to explain prompt caching actually is. Of course, those users that don't understand have access to Claude and can have it explain what caching is to them if they're interested.
Does mmap(2) educate the developer on how disk I/O works?
At some point you have to know something about the technology you're using, or accept that you're a consumer of the ever-shifting general best practice, shifting with it as the best practice shifts.
If you were being charged per character, or running down character limits, and printing on printers that were shared and had economic costs for stalled and started print runs, then:
You wouldn’t “need” to understand. The prints would complete regardless. But you might want to. Personal preference.
Which is true of this issue to.
and the system was being run by some of the planet’s brightest people whose famous creation is well known to disseminate complex information succinctly,
>then:
You would expect to be led to understand, like… a 1997 Prius.
“This feature showed the vehicle operation regarding the interplay between gasoline engine, battery pack, and electric motors and could also show a bar-graph of fuel economy results.” https://en.wikipedia.org/wiki/Toyota_Prius_(XW10)
Seems like every month someone writes up a brilliant article on how to build an LLM from scratch or similar that hits the HN page, usually with fancy animated blocks and everything.
It's not at all hard to find documentation on this topic. It could be made more prominent in the U/I but that's true of lots of things, and hammering on "AI 101" topics would clutter the U/I for actual decision points the user may want to take action upon that you can't assume the user already knows about in the way you (should) be able to assume about how LLMs eat up tokens in the first place.
So while I'd agree with your sarcasm that expecting users to be experts of the system is a big ask, where I disagree with you is that users should be curious and actively attempting to understand how it works around them. Given that the tooling changes often, this is an endless job.
Have you ever talked with users?
> this is an endless job
Indeed. If we spend all our time learning what changed with all our tooling when it changes without proper documentation then we spend all our working lives keeping up instead of doing our actual jobs.
And then their vibe-coders tell us that we are to blame for using the product exactly as advertised: https://x.com/lydiahallie/status/2039800718371307603 while silently changing how the product works.
Please stop defending hapless innocent corporations.
It's not about "constructing a prompt" in the sense of building the prompt string. That of course wouldn't be costly.
It is about reusing llm inference state already in GPU memory (for the older part of the prompt that remains the same) instead of rerunning the prompt and rebuilding those attention tensors from scratch.
that is what caching is doing. the llm inference state is being reused. (attention vectors is internal artefact in this level of abstraction, effectively at this level of abstraction its a the prompt).
The part of the prompt that has already been inferred no longer needs to be a part of the input, to be replaced by the inference subset. And none of this is tokens.
If I'm running a database keeping track of a conversation, and each time it writes the entire history of the conversation instead of appending a message, are we calling that O(N^2) now?
Also by the way, caching does not make LLM inference linear. It's still quadratic, but the constant in front of the quadratic term becomes a lot smaller.
Touché. Still, to a reasonable approximation, caching makes the dominant term linear, or equiv, linearly scales the expensive bits.
This is the operation that is basically done for each message in an LLM chat in the logical level: the complete context/history is sent in to be processed. If you wish to process only the additions, you must preserve the processed state on server-side (in KV cache). KV caches can be very large, e.g. tens of gigabytes.
With this much cheaper setup backed by disks, they can offer much better caching experience:
> Cache construction takes seconds. Once the cache is no longer in use, it will be automatically cleared, usually within a few hours to a few days.
You already have the data on your own machine, and that 'upload and restore' process is exactly what is happening when you restart an idle session. The issue is that it takes time, and it counts as token usage because you have to send the data for the GPU to load, and that data is the 'tokens'.
The data is the conversation (along with the thinking tokens).
There is no download - you already have it.
The issue is that it gets expunged from the (very expensive, very limited) GPU cache and to reload the cache you have to reprocess the whole conversation.
That is doable, but as Boris notes it costs lots of tokens.
The kv-cache is the internal LLM state after having processed the tokens. It's big, and you do not have it locally.
The cache is what makes your journey from 1k prompt to 1million token solution speedy in one 'vibe' session. Loading that again will cost the entire journey.
Sure, the exact choice on the trade-off, changing that choice, and having a pretty product-breaking bug as a result, are much more opaque. But I was responding to somebody who was surprised there's any trade-off at all. Computers don't give you infinite resources, whether or not they're "servers," "in the cloud," or "AI."
I'm really beginning to feel the lack of control when it's comes to context if I'm being honest
So yeah, it doesn't take much to surface to the user that the speed/value of their session is ephemeral because to keep all that cache active is computationally expensive because ...
You're still just running text through a extremely complex process, and adding to that text and to avoid re-calculation of the entire chain, you need the cache.
Please fight this hubris. Your users matter. Many of us use your tools for everyday work and do not appreciate having the rug pulled from under them on a regular basis, much less so in an underhanded and undisclosed way.
I don't mind the bugs, these will happen. What I do not appreciate is secretly changing things that are likely to decrease performance.
You're acquiring users as a recurring revenue source. Consider stability and transparency of implementation details cost of doing business, or hemorrhage users as a result.
See also the difference between eg. MacOS (with large M, the older good versions) and waiting for "Year of linux on desktop".
I don't think the issue is making decisions for users, but trying to switch off the soup tap in the all-you-can-eat soup bar. Or, wrong business model setting wrong incentives to both sides.
I understand you wouldn't want this to be the default, particularly for people who have one giant running session for many topics - and I can only imagine the load involved in full cache misses at scale. But there are other use cases where this thinking is critical - for instance, a session for a large refactor or a devops/operations use case consolidating numerous issue reports and external findings over time, where the periodic thinking was actually critical to how the session evolved.
For example, if N-4 was a massive dump of some relevant, some irrelevant material (say, investigating for patterns in a massive set of data, but prompted to be concise in output), then N-4's thinking might have been critical to N-2 not getting over-fixated on that dump from N-4. I'd consider it mission-critical, and pay a premium, when resuming an N some hours later to avoid pitfalls just as N-2 avoided those pitfalls.
Could we have an "ultraresume" that, similar to ultrathink, would let a user indicate they want to watch Return of the (Thin)king: Extended Edition?
Especially in the analysis part of my work I don‘t care about the actual text output itself most of the time but try to make the model „understand“ the topic.
In the first phase the actual text output itself is worthless it just serves as an indicator that the context was processed correctly and the future actual analysis work can depend on it. And they‘re… just throwing most the relevant stuff out all out without any notice when I resume my session after a few days?
This is insane, Claude literally became useless to me and I didn’t even know it until now, wasting a lot of my time building up good session context.
There would be nothing lost if they said „If you click yes, we will prune your old thinking making Claude faster and saving you tons of tokens“. Most people would say yes probably so why not ask them… make it an env variable (that is announced not a secretly introduced one to opt out of something new!) or at least write it in a change log if they really don’t want to allow people to use it like before, so there‘d be chance to cancel the subscription in time instead of wasting tons of time on work patterns that not longer work
1. Specifically, this suite was about price increases without clear consideration for both parties - but the same justifications apply to service restrictions without corresponding price decreases.
https://fortune.com/2026/04/20/italian-court-netflix-refunds...
> Our systems will smartly ignore any reasoning items that aren’t relevant to your functions, and only retain those in context that are relevant. You can pass reasoning items from previous responses either using the previous_response_id parameter, or by manually passing in all the output items from a past response into the input of a new one.
https://developers.openai.com/api/docs/guides/reasoning
Disclosure - work on AI@msft
The issue is that if they send the full trace back, it will have to be processed from the start if the cache expired, and doing that will cause a huge one-time hit against your token limit if the session has grown large.
So what Boris talked about is stripping things out of the trace that goes back to regenerate the session if the cache expires. Doing this would help avert burning up the token limit, but it is technically a different conversation, so if CC chooses poorly on stripping parts of the context then it would lead to Claude getting all scatter-brained.
Anthropic already profited from generating those tokens. They can afford subsidize reloading context.
And it’s part of a larger problem of unannounced changes it‘s just like when they introduced adaptive thinking to 4.6 a few weeks ago without notice.
Also they seem to be completely unaware that some users might only use Claude code because they are used to it not stripping thinking in contrast to codex.
Anyway I‘m happy that they saw it as a valid refund reason
The irony is that Claude Design does this. I did a big test building a design system, and when I came back to it, it had in the chat window "Do you need all this history for your next block of work? Save 120K tokens and start a new chat. Claude will still be able to use the design system." Or words to that effect.
Granted, the "memory" can be available across session, as can docs...
The only issue is that it didn't hit the cache so it was expensive if you resume later.
Still on Opus 4.6 with no adaptive thinking, so didn't really notice anything worse in the past weeks, but who knows.
Is there a more fundamental issue of trying to tie something with such nuanced costs to an interaction model which has decades of prior expectation of every message essentially being free?
As an informed user who understands his tools, I of course expect large uncached conversations to massively eat into my token budget, since that's how all of the big LLM providers work. I also understand these providers are businesses trying to make money and they aren't going to hold every conversation in their caches indefinitely.
The purpose of spending 10 to 50 prompts getting Claude to fill the context for you is it effectively "fine tunes" that session into a place your work product or questions are handled well.
// If this notion of sufficient context as fine tune seems surprising, the research is out there.)
Approaches tried need to deal with both of these:
1) Silent context degradation breaks the Pro-tool contract. I pay compute so I don't pay in my time; if you want to surface the cost, surface it (UI + price tag or choice), don't silently erode quality of outcomes.
2) The workaround (external context files re-primed on return) eats the exact same cache miss, so the "savings" are illusory — you just pushed the cost onto the user's time. If my own time's cheap enough that's the right trade off, I shouldn't be using your machine.
I wish Anthropic's leadership would understand that the dev community is such a vital community that they should appreciate a bit more (i.e. not nice sending lawyers afters various devs without asking nicely first, banning accounts without notice, etc etc). Appreciate it's not easy to scale.
OpenAI seems to be doing a much better job when it comes to developer relations, but I would like to see you guys 'win' since Anthropic shows more integrity and has clear ethical red lines they are not willing to cross unlike OpenAI's leadership.
I think the best option would be tell a user who is about to resurrect a conversation that has been evicted from cache that the session is not cached anymore and the user will have to face a full cost of replaying a session, not only the incremental question and answer.
(In understand under the hood that llms are n^2 by default but it's very counter intuitive - and given how popular cc is becoming outside of nerd circles, probably smaller and smaller fraction of users is aware of it)
I would like to decide on it case by case. Sometimes the session has some really deep insight I want to preserve, sometimes it's discardable.
Maybe the UI could do that for sessions that the user hasn't left yet, when the deadline comes near.
You ideally want to compact before the conversation is evicted from cache. If you knew you were going to use the conversation again later after cache expiry, you might do this deliberately before leaving a session.
Anthropic could do this automatically before cache expiry, though it would be hard to get right - they'd be wasting a lot of compute compacting conversations that were never going to be resumed anyway.
This feature has been live for a few days/weeks now, and with that knowledge I try remember to a least get a process report written when I'm for example close to the quota limit and the context is reasonably large. Or continue with a /compact, but that tends to lead to be having to repeat some things that didn't get included in the summary. Context management is just hard.
It's a little concerning that it's number 1 in your list.
I feel like I'm missing something here. Why would I revisit an old conversation only to clear it?
To me it sounds like a prompt-cache miss for a big context absolutely needs to be a per-instance warning and confirmation. Or even better a live status indicating what sending a message will cost you in terms of input tokens.
Two questions if you see this:
1) if this isn't best practice, what is the best way to preserve highly specific contexts?
2) does this issue just affect idle sessions or would the cache miss also apply to /resume ?
Edit: If you message me I can share some of my toolchain, it's probably similar to what a lot of other people here use but I've done some polishing recently.
So then it comes to what you're talking about, which is processing the entire text chain which is a different kind of cache, and generating the equivelent tokens are what's being costed.
But once you realize the efficiency of the product in extended sessions is cached in the immediate GPU hardware, then it's obvious that the oversold product can't just idle the GPU when sessions idle.
Thank you.
I'm writing this message even though I don't have much to add because it's often the case on HN that criticism is vocal and appreciation is silent and I'd like to balance out the sentiment.
Anthropic has fumbled on many fronts lately but engaging honestly like this is the right thing to do. I trust you'll get back on track.
They spent two months literally gaslighting this "critical audience" that this could not be happening and literally blaming users for using their vibe-coded slop exactly as advertised.
All the while all the official channels refused to acknowledge any problems.
Now the dissatisfaction and subscription cancellations have reached a point where they finally had to do something.
https://x.com/bcherny/status/2044291036860874901 https://x.com/bcherny/status/2044299431294759355
No mention of anything like "hey, we just fixed two big issues, one that lasted over a month." Just casual replies to everybody like nothing is wrong and "oh there's an issue? just let us know we had no idea!"
My biggest learning here is the 1 hour cache window. I often have multiple Claudes open and it happens frequently that they're idle for 1+ hours.
This cache information should probably get displayed somewhere within Claude Code
Love the product. <3
Mislead, gaslight, misdirect is the name of the game
Question: Hey claude, if we have a conversation, and then i take a break. Does it change the expected output of my next answer, if there are 2 hours between the previous message end the next one?
Answer: No. A 2-hour gap doesn't change my output. I have no internal clock between messages — I only see the conversation content plus the currentDate context injected each turn. The prompt cache may expire (5 min TTL), which affects cost/latency but not the response itself.
The only things that can change output across a break: new context injected (like updated date), memory files being modified, or files on disk changing.
-- This answer directly contradict your post. It seems like the biggest problem is a total lack of documentation for expected behavior.A similar thing happens if I ask claude code for the difference between plan mode, and accept edits on.
Then Claude told me the only difference was that with plan mode it would ask for permission before doing edits. But I really don't think this is true. It seems like plan mode does a lot more work, and present it in a total different way. It is not just a "I will ask before applying changes" mode.
I have project folders/files and memory stored for each session, when I come back to my projects the context is drawn from the memory files and the status that were saved in my project md files.
Create a better workflow for your self and your teams and do it the right way. Quick expect the prompt to store everything for you.
For the Claude team. If you havent already, I'd recommend you create some best practices for people that don't know any better, otherwise people are going to expect things to be a certain way and its going to cause a lot of friction when people cant do what the expect to be able to do.
It’s hard to do it without killing performance and requires engineering in the DC to have fast access to SSDs etc.
Disclosure: work on ai@msft. Opinions my own.
Let's see what Boris Cherny himself and other Anthropic vibe-coders say about this:
https://x.com/bcherny/status/2044847849662505288
Opus 4.7 loves doing complex, long-running tasks like deep research, refactoring code, building complex features, iterating until it hits a performance benchmark.
https://x.com/bcherny/status/2007179858435281082
For very long-running tasks, I will either (a) prompt Claude to verify its work with a background agent when it's done... so Claude can cook without being blocked on me.
https://x.com/trq212/status/2033097354560393727
Opus 4.6 is incredibly reliable at long running tasks
https://x.com/trq212/status/2032518424375734646
The long context window means fewer compactions and longer-running sessions. I've found myself starting new sessions much less frequently with 1 million context.
https://x.com/trq212/status/2032245598754324968
I used to be a religious /clear user, but doing much less now, imo 4.6 is quite good across long context windows
---
I could go on
Yeah it's called lunch!
The above workflow basically doesn’t hit the rate limit. So I’d appreciate a way to turn off this feature.
I'm curious why 1 hour was chosen?
Is increasing it a significant expense?
Ever since I heard about this behaviour I've been trying to figure out how to handle long running Claude sessions and so far every approach I've tried is suboptimal
It takes time to create a good context which can then trigger a decent amount of work in my experience, so I've been wondering how much this is a carefully tuned choice that's unlikely to change vs something adjustable
Why not store the prompt cache to disk when it goes cold for a certain period of time, and then when a long-lived, cold conversation gets re-initiated, you can re-hydrate the cache from disk. Purge the cached prompts from disk after X days of inactivity, and tell users they cannot resume conversations over X days without burning budget.
So it would probably be a quite a long transfer to perform in these cases, probably not very feasible to implement at scale.
The core idea is we can use user-activity at the client to manage KV cache loading and offloading. Happy to chat more!!
Why did you lie 11 days ago, 3 days after the fix went in, about the cause of excess token usage?
how about acknowledging that you fucked up your own customers’ money and making a full refund for the affected period?
> Educating users on X/social
that is beyond me
ты не Борис, ты максимум борька
1) Is it okay to leave Claude Code CLI open for days?
2) Should we be using /clear more generously? e.g., on every single branch change, on every new convo?
Why not use tired cache?
Obviously storage is waaay cheaper than recalculation of embeddings all the way from the very beginning of the session.
No matter how to put this explanation — it still sounds strange. Hell — you can even store the cache on the client if you must.
Please, tell me I’m not understanding what is going on..
otherwise you really need to hire someone to look at this!)
I still don't understand it, yes it's a lot of data and presumably they're already shunting it to cpu ram instead of keeping it on precious vram, but they could go further and put it on SSD at which point it's no longer in the hotpath for their inference.
But they used to return thinking output directly in the API, and that was _the_ reason I liked Claude over OpenAI's reasoning models.
What would be an interesting option would be to let the user pay more for longer caching, but if the base length is 1 hour I assume that would become expensive very quickly.
It's kept for long enough that it's expensive to store in RAM, but short enough that the writes are frequent and will wear down SSD storage
But even if it’s not — I can’t build a scenario in my head where recalculating it on real GPUs is cheaper/faster than retrieving it from some kind of slower cache tier
whats driving the hour cache? shouldnt people be able to have lunch, then come back and continue?
are you expecting claude code users to not attend meetings?
I think product-wise you might need a better story on who uses claude-code, when and why.
Same thing with session logs actually - i know folks who are definitely going to try to write a yearly RnD report and monthly timesheets based on text analysis of their claude code session files, and they're going to be incredibly unhappy when they find out its all been silently deleted
Either swallow the cost or be transparent to the user and offer both options each time.
You guys really need to communicate that better in the CLI for people not on social
It also seems like the warning should be in channel and not on X. If I wanted to find out how broken things are on X, I'd be a Grok user.
I dont agree with this being characterized as a "corner case".
Isn't this how most long running work will happen across all serious users?
I am not at my desk babysitting a single CC chat session all day. I have other things to attend to -- and that was the whole point of agentic engineering.
Dont CC users take lunch breaks?
How are all these utterly common scenarios being named as corner cases -- as something that is wildly out of the norm, and UX can be sacrificed for those cases?
I frequently debug issues while keeping my carefully curated but long context active for days. Losing potentially very important context while in the middle of a debugging session resulting in less optimal answers, is costing me a lot more money than the cache misses would.
In my eyes, Claude Code is mainly a context management tool. I build a foundation of apparent understanding of the problem domain, and then try to work towards a solution in a dialogue. Now you tell me Anthrophic has been silently breaking down that foundation without telling me, wasting potentially hours of my time.
It's a clear reminder that these closed-source harnesses cannot be trusted (now or in the future), and I should find proper alternatives for Claude Code as soon as possible.
Probably that's why I hit my weekly limits 3-4 days ago, and was scheduled to reset later today. I just checked, and they are already reset.
Not sure if it's already done, shouldn't there be a check somewhere to alert on if an outrageous number of tokens are getting written, then it's not right ?
No. You had random developers tweet and reply at random times to random users while all of your official channels were completely silent. Including channels for people who are not terminally online on X
There's a reason live service games have splash banners at every login. No matter what you pick as an official e-coms channel, most of your users aren't there!
* To be fair, of all these firms, ANTHROP\C tries the hardest to remember, and deliver like, some people aren't the same. Starting with normals doing normals' jobs.
You lost huge trust with the A/B sham test. You lost trust with enshittification of the tokenizer on 4.6 to 4.7. Why not just say "hey, due to huge input prices in energy, GPU demand and compute constraints we've had to increase Pro from $20 to $30." You might lose 5% of customers. But the shady A/B thing and dodgy tokenizer increasing burn rate tells everyone inc. enterprise that you don't care about honesty and integrity in your product.
I hope this feedback helps because you still stand to make an awesome product. Just show a little more professionalism.
or even, let the user control the cache expiry on a per request basis. with a /cache command
that way they decide if they want to drop the cache right away , or extend it for 20 hours etc
it would cost tokens even if the underlying resource is memory/SSD space, not compute
/loop 5m say "ok".
Will that keep the cache fresh?
Silently degrading intelligence ought to be something you never do, but especially not for use-cases like this.
I’m looking back at my past few weeks of work and realizing that these few regressions literally wasted 10s of hours of my time, and hundreds of dollars in extra usage fees. I ran out of my entire weekly quota four days ago, and had to pause the personal project I was working on.
I was running the exact same pipeline I’ve run repeatedly before, on the same models, and yet this time I somehow ate a week’s worth of quota in less than 24h. I spent $400 just to finish the pipeline pass that got stuck halfway through.
I’m sorry to be harsh, but your engineering culture must change. There are some types of software you can yolo. This isn’t one of them. The downstream cost of stupid mistakes is way, way too high, and far too many entirely avoidable bugs — and poor design choices — are shipping to customers way too often.
Hard agree, would like to see a response to this.
how does this help me as a customer? if i have to redo the context from scratch, i will pay both the high token cost again, but also pay my own time to fill it.
the cost of reloading the window didnt go away, it just went up even more
I have to imagine this isn't helped by working somewhere where you effectively have infinite tokens and usage of the product that people are paying for, sometimes a lot.
Construction of context is not an llm pass - it shouldn't even count towards token usage. The word 'caching' itself says don't recompute me.
Since the devs on HN (& the whole world) is buying what looks like nonsense to me - what am I missing?
1. They actually believed latency reduction was worth compromising output quality for sessions that have already been long idle. Moreover, they thought doing so was better than showing a loading indicator or some other means of communicating to the user that context is being loaded.
2. What I suspect actually happened: they wanted to cost-reduce idle sessions to the bare minimum, and "latency" is a convenient-enough excuse to pass muster in a blog post explaining a resulting bug.
I'm sure most companies and customers will consider compromising quality for 80% cost reduction. If they just be honest they'll be fine.
Glad I use kiro-cli which doesn't do this.
The deterioration was real and annoying, and shines a light on the problematic lack of transparency of what exactly is going on behind the scenes and the somewhat arbitrary token-cost based billing - too many factors at play, if you wanted to trace that as a user you can just do the work yourself instead.
The fact that waiting for a long time before resuming a convo incurs additional cost and lag seemed clear to me from having worked with LLM APIs directly, but it might be important to make this more obvious in the TUI.
Every one of these changes had the same goal: trading the intelligence users rely on for cheaper or faster outputs. Users adapt to how a model behaves, so sudden shifts without transparency are disorienting.
The timing also undercuts their narrative. The fixes landed right before another change with the same underlying intent rolled out. That looks more like they were just reacting to experiments rather than understanding the underlying user pain.
When people pay hundreds or thousands a month, they expect reliability and clear communication, ideally opt-in. Competitors are right there, and unreliability pushes users straight to them.
All of this points to their priorities not being aligned with their users’.
Framing this as "aligned" or "not aligned" ignores the interesting reality in the middle. It is banal to say an organization isn't perfectly aligned with its customers.
I'm not disagreeing with the commenter's frustration. But I think it can help to try something out: take say the top three companies whose product you interact with on a regular basis. Take stock of (1) how fast that technology is moving; (2) how often things break from your POV; (3) how soon the company acknowledges it; (4) how long it takes for a fix. Then ask "if a friend of yours (competent and hard working) was working there, would I give the company more credit?"
My overall feel is that people underestimate the complexity of the systems at Anthropic and the chaos of the growth.
These kind of conversations are a sort of window into people's expectations and their ability to envision the possible explanations of what is happening at Anthropic.
Making changes like reducing the usage window at peak times (https://x.com/trq212/status/2037254607001559305) without announcing it (until after the backlash) is the sort of thing that's making people lose trust in Anthropic. They completely ignored support tickets and GitHub issues about that for 3 days.
You shouldn't have to rely on finding an individual employee's posts on Reddit or X for policy announcements.
That policy hasn't even been put into their official documentation nearly one month on - https://support.claude.com/en/articles/11647753-how-do-usage...
A company with their resources could easily do better.
But they come after the team gaslit everyone, telling us it was a skill issue.
The near-instant transition from "there is no problem" to "we already fixed the problem so stop complaining" is basically gaslighting. (Admittedly the second sentiment comes more from the community, but they get that attitude after taking the "we fixed all the problems" posts at face value.)
That's the reason for the flak
We take reports about degradation very seriously. We never intentionally degrade our models [...] On March 4, we changed Claude Code's default reasoning effort from high to medium
Anthropic is the best company of its kind, but that is badly worded PR.It is certainly true that they did a poor job communicating this change to users (I did not know that the default was “high” before they introduced it, I assumed they had added an effort level both above and below whatever the only effort choice was there before). On the other hand, I was using Claude Code a fair bit on “medium” during that time period and it seemed to be performing just fine for me (and saving usage/time over “high”), so it doesn't seem clear that that was the wrong default, if only it had been explained better.
I would say it does, and I'd loathe to use anything made by people who'd couch that change to defaults as "providing a selectable option to use a faster, cheaper version".
Yuck.
Did I miss something? I'm only looking at primary sources to start. Not Reddit. Not The Register. Official company communications.
Did Anthropic tell users i.e. "you are wrong, your experience is not worse."? If so, that would reach the bar of gaslighting, as I understand it (and I'm not alone). If you have a different understanding, please share what it is so I understand what you mean.
That said, the copy uses "we never intentionally degrade our models" to mean something like "we never degrade one facet of our models unless it improves some other facet of our models". This is a cop out, because it is what users suspected and complained about. What users want - regardless of whether it is realistic to expect - is for Anthropic to buy even more compute than Anthropic already does, so that the models remain equally smart even if the service demand increases.
Some terms:... The model is the thing that runs inference. Claude Code is not a model, it is harness. To summarize Anthropic's recent retrospective, their technical mistakes were about the harness.
I'm not here to 'defend' Anthropic's mistakes. They messed up technically. And their communication could have been better. But they didn't gaslight. And on balance, I don't see net evidence that they've "copped out" (by which I mean mischaracterized what happened). I see more evidence of the opposite. I could be wrong about any of this, but I'm here to talk about it in the clearest, best way I can. If anyone wants to point to primary sources, I'll read them.
I want more people to actually spend a few minutes and actually give the explanation offered by Anthropic a try. What if isolating the problems was hard to figure out? We all know hindsight is 20/20 and yet people still armchair quarterback.
At the risk of sounding preachy, I'm here to say "people, we need to do better". Hacker News is a special place, but we lose it a little bit every time we don't in a quality effort.
No worries about 'sounding preachy'; it's a good thing people want to uphold the sobriety that makes HN special.
They knew they had deliberately made their system worse, despite their lame promise published today that they would never do such a thing. And so they incorrectly assumed that their ham fisted policy blunder was the only problem.
Still plenty I prefer about Claude over GPT but this really stings.
> They knew they had deliberately made their system worse
Define "they". The teams that made particular changes? In real-world organizations, not all relevant information flows to all the right places at the right time. Mistakes happen because these are complex systems.
Define "worse". There are lot of factors involved. With a given amount of capacity at a given time, some aspect of "quality" has to give. So "quality" is a judgment call. It is easy to use a non-charitable definition to "gotcha" someone. (Some concepts are inherently indefensible. Sometimes you just can't win. "Quality" is one of those things. As soon as I define quality one way, you can attack me by defining it another way. A particular version of this principle is explained in The Alignment Problem by Brian Christian, by the way, regarding predictive policing iirc.)
I'm seeing a lot of moral outrage but not enough intellectual curiosity. It embarrassingly easy to say "they should have done better" ... ok. Until someone demonstrates to me they understand the complexity of a nearly-billion dollar company rapidly scaling with new technology, growing faster than most people comprehend, I think ... they are just complaining and cooking up reasons so they are right in feeling that way. This possible truth: complex systems are hard to do well apparently doesn't scratch that itch for many people. So they reach for blame. This is not the way to learn. Blaming tends to cut off curiosity.
I suggest this instead: redirect if you can to "what makes these things so complicated?" and go learn about that. You'll be happier, smarter, and ... most importantly ... be building a habit that will serve you well in life. Take it from an old guy who is late to the game on this. I've bailed on companies because "I thought I knew better". :/
Accidentally/deliberately making your CS teams ill-informed should not function as a get out of jail free card. Rather the reverse.
1. Degraded service sucks.
2. Anthropic not saying i.e. "we're not seeing it" sucks.
3. Not getting a fix when you want it sucks.
Try to understand what I mean when I say none of the above meet the following sense of gaslighting: "Gaslighting is the manipulation of someone into questioning their perception of reality." Emphasis on understand what I mean. This says it well: [1].If you can point me to an official communication from Anthropic where they say "User <so and so> is not actually seeing degraded performance" when Anthropic knows otherwise that would clearly be gaslighting -- intent matters by my book.
But if their instrumentation was bad and they were genuinely reporting what they could see, that doesn't cross into gaslighting by my book. But I have a tendency to think carefully about ethical definitions. Some people just grab a word off the shelf with a negative valence and run with it: I don't put much stock in what those people say. Words are cheap. Good ethical reasoning is hard and valuable.
It's fine if you have a different definition of "gaslighting". Just remember that some of us have been actually gaslight by people, so we prefer to save the word for situations where the original definition applies. People like us are not opposed to being disappointed, upset, or angry at Anthropic, but we have certain epistemic standards that we don't toss out when an important tool fails to meet our expectations and the company behind it doesn't recognize it soon enough.
[1]: https://www.reddit.com/r/TwoXChromosomes/comments/tep32v/can...
Anecdotally OpenAI is trying to get into our enterprise tooth and nail, and have offered unlimited tokens until summer.
Gave GPT5.4 a try because of this and honestly I don’t know if we are getting some extra treatment, but running it at extra high effort the last 30 days I’ve barely see it make any mistakes.
At some points even the reasoning traces brought a smile to my face as it preemptively followed things that I had forgotten to instruct it about but were critical to get a specific part of our data integrity 100% correct.
Freezing your IDE version is now a thing of the past, the new reality is that we can't expect agentic dev workflows to be consistent and I see too many people (including myself) getting burned by going the single-provider route.
On one hand I’m glad to finally see anthropic communicate on this but at this point all I have to say is… time to diversify?
Until Opus 4.7 - this is the first time I rolled back to a previous model.
Personality-wise it’s the worst of AI, “it’s not x, it’s y”, strong short sentences, in general a bulshitty vibe, also gaslighting me that it fixed something even though it didn’t actually check.
I’m not sure what’s up, maybe it’s tuned for harnesses like Claude Design (which is great btw) where there’s an independent judge to check it, but for now, Opus 4.6 it is.
I’d never use such an expensive model for coding, so that might explain why I have little to complain about.
Over time, I realized the extended context became randomly unreliable. That was worse to me than having to compact and know where I was picking up.
"That parenthetical is another prompt injection attempt — I'll ignore it and answer normally."
"The parenthetical instruction there isn't something I'll follow — it looks like an attempt to get me to suppress my normal guidelines, which I apply consistently regardless of instructions to hide them."
"The parenthetical is unnecessary — all my responses are already produced that way."
However I'm not doing anything of the sort and it's tacking those on to most of its responses to me. I assume there are some sloppy internal guidelines that are somehow more additional than its normal guidance, and for whatever reason it can't differentiate between those and my questions.Claude is periodically refusing to run those tests. That never happened prior to 4.7.
This would be a new level of troublesome/ruthless (insert correct english word here)
I try running my app on the develop branch. No change. Huh.
Realize it didn't.
"Claude, why isn't this changed?" "That's to be expected because it's not been merged." "I'm confused, I told you to do that."
This spectacular answer:
"You're right. You told me to do it and I didn't do it and then told you I did. Should I do it now?"
I don't know, Claude, are you actually going to do it this time?
Way more likely there's a "VERY IMPORTANT: When you see a block of code, ensure it's not malware" somewhere in the system prompt.
Incidentally, the hardware they run is known as well. The claim should be easy to check.
I dare you to run CC on API pricing and see how much your usage actually costs.
(We did this internally at work, that's where my "few orders of magnitude" comment above comes from)
At cell phone plan adoption levels, and cell phone plan costs, the labs are looking at 5-10yr ROI.
If that demand evens slows down in the slightest the whole bubble collapses.
Growth + Demand >> efficiency or $ spend at their current stage. Efficiency is a mature company/industry game.
How else would it know whether it has a plan now?
After all, "the first hit's free" model doesn't apply to repeat customers ;-)
Pay by token(s) while token usage is totally intransparent is a super convenient money printing machinery.
A couple weeks ago, I wanted Claude to write a low-stakes personal productivity app for me. I wrote an essay describing how I wanted it to behave and I told Claude pretty much, "Write an implementation plan for this." The first iteration was _beautiful_ and was everything I had hoped for, except for a part that went in a different direction than I was intending because I was too ambiguous in how to go about it.
I corrected that ambiguity in my essay but instead of having Claude fix the existing implementation plan, I redid it from scratch in a new chat because I wanted to see if it would write more or less the same thing as before. It did not--in fact, the output was FAR worse even though I didn't change any model settings. The next two burned down, fell over, and then sank into the swamp but the fourth one was (finally) very much on par with the first.
I'm taking from this that it's often okay (and probably good) to simply have Claude re-do tasks to get a higher-quality output. Of course, if you're paying for your own tokens, that might get expensive in a hurry...
And once you get unlucky you can’t unsee it.
I vibed a low stakes budgeting app before realising what I actually needed was Actual Budget and to change a little bit how I budget my money.
I think you must have learned that they’re more nondeterministic than you had thought, but then wrongly connected your new understanding to the recent model degradation. Note: they’ve been nondeterministic the whole time, while the widely-reported degradation is recent.
Combining these things in the strongest interpretation instead of an easy to attack one and it's very reasonable to posit a critical mass has been reached where enough people will report about issues causing others to try their own investigations while the negative outliers get the most online attention.
I'm not convinced this is the story (or, at least the biggest part of it) myself but I'm not ready to declare it illogical either.
When everyone's talking about the real degradation, you'll also get everyone who experiences "random"[1] degradation thinking they're experiencing the same thing, and chiming in as well.
[1] I also don't think we're talking the more technical type of nondeterminism here, temperature etc, but the nondeterminism where I can't really determine when I have a good context and when I don't, and in some cases can't tell why an LLM is capable of one thing but not another. And so when I switch tasks that I think are equally easy and it fails on the new one, or when my context has some meaningless-to-me (random-to-me) variation that causes it to fail instead of succeed, I can't determine the cause. And so I bucket myself with the crowd that's experiencing real degradation and chime in.
Instead of fixing the UI they lowered the default reasoning effort parameter from high to medium? And they "traced this back" because they "take reports about degradation very seriously"? Extremely hard to give them the benefit of doubt here.
We did both -- we did a number of UI iterations (eg. improving thinking loading states, making it more clear how many tokens are being downloaded, etc.). But we also reduced the default effort level after evals and dogfooding. The latter was not the right decision, so we rolled it back after finding that UX iterations were insufficient (people didn't understand to use /effort to increase intelligence, and often stuck with the default -- we should have anticipated this).
For instance:
Is Haiku supposed to hit a warm system-prompt cache in a default Claude code setup?
I had `DISABLE_TELEMETRY=1` in my env and found the haiku requests would not hit a warm-cached system prompt. E.g. on first request just now w/ most recent version (v2.1.118, but happened on others):
w/ telemetry off - input_tokens:10 cache_read:0 cache_write:28897 out:249
w/ telemetry on - input_tokens:10 cache_read:24344 cache_write:7237 out:243
I used to think having so many users was leading to people hitting a lot of edge cases, 3 million users is 3 million different problems. Everyone can't be on the happy path. But then I started hitting weird edge cases and started thinking the permutations might not be under control.
https://github.com/anthropics/claude-code/issues/36286 https://github.com/anthropics/claude-code/issues/25018
UI is UI. It is naive to expect that you build some UI but users will "just magically" find out that they should use it as a terminal in the first place.
Anthropic: removes thinking output
Users: see long pauses, complain
Anthropic: better reduce thinking time
Users: wtf
To me it really, really seems like Anthropic is trying to undo the transparency they always had around reasoning chains, and a lot of issues are due to that.
Removing thinking blocks from the convo after 1 hour of being inactive without any notice is just the icing on the cake, whoever thought that was a good idea? How about making “the cache is hot” vs “the cache is cold” a clear visual indicator instead, so you slowly shape user behavior, rather than doing these types of drastic things.
They had droves of Claude devs vehemently defending and gaslighting users when this started happening
I don't know about others, but sessions that are idle > 1h are definitely not a corner case for me. I use Claude code for personal work and most of the time, I'm making it do a task which could say take ~10 to 15mins. Note that I spend a lot of time back and forth with the model planning this task first before I ask it to execute it. Once the execution starts, I usually step away for a coffee break (or) switch to Codex to work on some other project - follow similar planning and execution with it. There are very high chances that it takes me > 1h to come back to Claude.
- Claude Code is _vastly_ more wasteful of tokens than anything else I've used. The harness is just plain bad. I use pi.dev and created https://github.com/rcarmo/piclaw, and the gaps are huge -- even the models through Copilot are incredibly context-greedy when compared to GPT/Codex
- 4.7 can be stupidly bad. I went back to 4.6 (which has always been risky to use for anything reliable, but does decent specs and creative code exploration) and Codex/GPT for almost everything.
So there is really no reason these days to pay either their subscription or their insanely high per/token price _and_ get bloat across the board.
Still use CC at work because team standards, but I'd take my OpenCode stack over it any day.
Care to share what you changed, maybe even the code?
1) Curated a set of models I like and heavily optimized all possible settings, per agent role and even per skill (had to really replumb a lot of stuff to get it as granular as I liked)
2) Ported from sqlite to postgresql, with heavily extended schema. I generate embeddings for everything, so every aspect of my stack is a knowledge graph that can be vector searched. Integrated with a memory MCP server and auditing tools so I can trace anything that happens in the stack/cluster back to an agent action and even thinking that was related to the action. It really helps refine stuff.
3) Tight integration of Gitea server, k3s with RBAC (agents get their own permissions in the cluster), every user workspace is a pod running opencode web UI behind Gitea oauth2.
4) Codified structure of `/projects/<monorepo>/<subrepos>` with simpler browserso non-technical family members can manage their work easier (agents handle all the management and there are sidecars handling all gitops transparent to the user)
5) Transparent failover across providers with cooldown by making model definitions linked lists in the config, so I can use a handful of subscriptions that offer my favorite models, and fail over from one to the next as I hit quota/rate limits. This has really cut my bill down lately, along with skipping OpenRouter for my favorite models and going direct to Alibaba and Xiaomi so I can tailor caching and stuff exactly how I want.
6) Integrated filebrowser, a fork of the Milkdown Crepe markdown editor, and codemirror editor so I don't even need an IDE anymore. I just work entirely from OpenCode web UI on whatever device is nearest at the moment. I added support for using Gemma 4 local on CPU from my phone yesterday while waiting in line at a store yesterday.
Those are the big ones off the top of my head. Im sure there's more. I've probably made a few hundred other changes, it just evolves as I go.
There are a few out there (latest example is Zed's new multi-agent UI), but they still rely on the underlying agent's skill and plugin system. I'm experimenting with my own approach that integrates a plugin system that can dynamically change the agent skillset & prompts supplied via an integrated MCP server, allowing you to define skills and workflows that work regardless of the underlying agent harness.
its very clear that theres money or influence exchanging hands behind the scenes with certain content creators, the information, and openai.
These bugs have all of the same symptoms: undocumented model regressions at the application layer, and engineering cost optimizations that resulted in real performance regressions.
I have some follow up questions to this update:
- Why didn't September's "Quality evaluations in more places" catch the prompt change regression, or the cache-invalidation bug?
- How is Anthropic using these satisfaction questions? My own analysis of my own Claude logs was showed strong material declines in satisfaction here, and I always answer those surveys honestly. Can you share what the data looked like and if you were using that to identify some of these issues?
- There was no refund or comped tokens in September. Will there be some sort of comp to affected users?
- How should subscribers of Claude Code trust that Anthropic side engineering changes that hit our usage limits are being suitably addressed? To be clear, I am not trying to attribute malice or guilt here, I am asking how Anthropic can try and boost trust here. When we look at something like the cache-invalidation there's an engineer inside of Anthropic who says "if we do this we save $X a week", and virtually every manager is going to take that vs a soft-change in a sentiment metric.
- Lastly, when Anthropic changes Claude Code's prompt, how much performance against the stated Claude benchmarks are we losing? I actually think this is an important question to ask, because users subscribe to the model's published benchmark performance and are sold a different product through Claude Code (as other harnesses are not allowed).
[1] https://www.anthropic.com/engineering/a-postmortem-of-three-...
Damage is done for me though. Even just one of these things (messing with adaptive thinking) is enough for me to not trust them anymore. And then their A/B testing this week on pricing.
I got finally fired.
To what degree do I predict another person/org will give me what I need and why?
This shifts "trust" away from all or nothing and it gets me thinking about things like "what are the moving parts?" and "what are the incentives" and "what is my plan B?".In my life experience, looking back, when I've found myself swinging from "high trust" to "low trust" the change was usually rooted in my expectations; it was usually rooted in me having a naive understanding of the world that was rudely shattered.
Will you force trust to be a bit? Or can you admit a probability distribution? Bits (true/false or yes/no or trust/don't trust) thrash wildly. Bayesians update incrementally: this is (a) more pleasant; (b) more correct; (c) more curious; (d) easier to compare notes with others.
I'm using Zed and Claude Code as my harnesses.
I use "subconsciously" in quotes because I don't remember exactly why I did it, but it aligns with the degradation of their service so it feels like that probably has something to do with it even though I didn't realize it at the time.
However you feel about OpenAI, at least their harness is actually open source and they don’t send lawyers after oss projects like opencode
PS I’m not referencing a well-known book to suggest the solution is trite product group think, but good product thinking is a talent separate from good engineering, and Anthropic seems short on the later recently
That said, that may not have been obvious at all in the Jan/Feb time frame when they got a wave of customers due to ethical concerns.
But worse, based on the pronouncements of Dario et al I suspect management is entirely unsympathetic because they believe we (SWEs) are on the chopping block to be replaced. And intimation that putting guard rails around these tools for quality concerns ... I'm suspecting is being ignored or discouraged.
In the end, I feel like Claude Code itself started as a bit of a science experiment and it doesn't smell to me like it's adopted mature best practices coming out of that.
I really appreciate these little touches.
vim ~/.claude/settings.json
{ "model": "claude-opus-4-6", "fastMode": false, "effortLevel": "high", "alwaysThinkingEnabled": true, "autoCompactWindow": 700000 }
In practice I understand this would be difficult but I feel like the system prompt should be versioned alongside the model. Changing the system prompt out from underneath users when you've published benchmarks using an older system prompt feels deceptive.
At least tell users when the system prompt has changed.
https://techtrenches.dev/p/the-snake-that-ate-itself-what-cl...
But in either case, if compute is so limited, they’ll have to compete with local coding agents. Qwen3.6-27B is good enough to beat having to wait until 5PM for your Claude Code limit to reset.
Also I don’t know how “improving our Code Review tool” is going to improve things going forward, two of the major issues were intentional choices. No code review is going to tell them to stop making poor and compromising decisions.
Of course, all their vibe coding is being done with effectively infinite tokens, so...
I asked for this via support, got a horrible corporate reply thread, and eventually downgraded my account. I'm using Codex now as we speak. I could not use Claude any more, I couldn't get anything done.
Will they restore my account usage limits? Since I no longer have Max?
Is that one week usage restored, or the entire buggy timespan?
I’ll stay on 4.6 for awhile. Seems to be better. What’s frustrating, though you cannot rely on these tools. They are constantly tinkering and changing with things and there’s no option to opt out.
I mean, yes, even testing in production with some of your customer is better than.. testing with ALL of your customers?
Agents are not deterministic; they are probabilistic. If the same agent is run it will accomplish the task a consistent percentage of the time. I wish I was better at math or English so I could explain this.
I think they call it EVAL but developers don't discuss that too much. All they discuss is how frustrated they are.
A prompt can solve a problem 80% of the time. Change a sentence and it will solve the same problem 90% of time. Remove a sentence it will solve the problem 70% of the time.
It is so friggen' easy to set up -- stealing the word from AI sphere -- a TEST HARNESS.
Regressions caused by changes to the agent, where words are added, changed, or removed, are extremely easy to quantify. It isn’t pass/fail. It’s whether the agent still solves the problem at the same percentage of the time it consistently has.
Thank you for the perfect explanation.
Last week in my confusion about the word because Anthropic was using test, eval, and harness in the same sentence so I thought Anthropic made a test harness, I used Google asking "in computer science what is a harness". It responded only discussing test harnesses which solidified my thinking that is what it is.
I wish Google had responded as clearly you did. In my defense, we don't know if we understand something unless we discuss it.
The first tries to answer what happens when I give the models harder and harder arithmetic problems to the point Sonnet will burn 200k tokens for 20minutes. [0]
The other is a very deep dive into the math of a reasoning model in the only way I could think to approach it, with data visualizations, seeing the computation of the model in real time in relation to all the parts.[1]
Two things I've learned are that the behavior of an agent that will reverse engineer any website and the behavior of an agent that does arithmetic are the same. Which means the probability that either will solve their intended task is the same for the given agent and task -- it is a distribution. The other, is that models have a blind spot, therefore creating a red team adversary bug hunter agent will not surface a bug if the same model originally wrote the code.
Understanding that, knowing that I can verify at the end or use majority of votes (MoV), using the agents to automate extremely complicated tasks can be very reliable with an amount of certainty.
Claude caveman in the system prompt confirmed?
Those who work on agent harnesses for a living realize how sensitive models can be to even minor changes in the prompt.
I would not suspect quantization before I would suspect harness changes.
Recently that immaculately polished feel is harder to find. It coincides with the daily releases of CC, Desktop App, unknown/undocumented changes to the various harnesses used in CC/Cowork. I find it an unwelcome shift.
I still think they're the best option on the market, but the delta isn't as high as it was. Sometimes slowing down is the way to move faster.
For there to be any trust in the above, the tool needs to behave predictably day to day. It shouldn't be possible to open your laptop and find that Claude suddenly has an IQ 50 points lower than yesterday. I'm not sure how you can achieve predictability while keeping inference costs in check and messing with quantization, prompts, etc on the backend.
Maybe a better approach might be to version both the models and the system prompts, but frequently adjust the pricing of a given combination based on token efficiency, to encourage users to switch to cheaper modes on their own. Let users choose how much they pay for given quality of output though.
Frontier LLMs still suck a lot, you can't afford planned degradation yet.
Right now my solution is to run CC in tmux and keep a 2nd CC pane with /loop watching the first pane and killing CC if it detects plan mode being bypassed. Burning tokens to work around a bug.
if only there were a place with 9.881 feedbacks waiting to be triaged...
and that maybe not by a duplicate-bot that goes wild and just autocloses everything, just blessing some of the stuff there with a "you´ve been seen" label would go a long way...
A reminder: your vibe-coded slop required peak 68GB of RAM, and you had to hire actual engineers to fix it.
... But then again, many of us are paying out of pocket $100, $200USD a month.
Far more than any other development tools.
Services that cost that much money generally come with expectations.
A month prior their vibe-coders was unironically telling the world how their TUI wrapper for their own API is a "tiny game engine" as they were (and still are) struggling to output a couple of hundred of characters on screen: https://x.com/trq212/status/2014051501786931427
Meanwhile Boris: "Claude fixes most bugs by itself. " while breaking the most trivial functionality all the time: https://x.com/bcherny/status/2030035457179013235 https://x.com/bcherny/status/2021710137170481431 https://x.com/bcherny/status/2046671919261569477 https://x.com/bcherny/status/2040210209411678369 while claiming they "test carefully": https://x.com/bcherny/status/2024152178273989085
Once OpenAI added the $100 plan, it was kind of a no-brainer.
Or improve performance and efficiency, if we’re generous and give them the benefit of the doubt.
It makes sense, in a way. It means the subscription deal is something along the lines of fixed / predictable price in exchange for Anthropic controlling usage patterns, scheduling, throttling (quotas consumptions), defaults, and effective workload shape (system prompt, caching) in whatever way best optimises the system for them (or us if, again, we’re feeling generous) / makes the deal sustainable for them.
It’s a trade-off
It may be (but I wouldn’t know) that some of other changes not covered here reduced costs on their side without impacting users, improving the viability of their subscription model. Or maybe even improved things for users.
I’d really appreciate more transparency on this, and not just when things fail.
But I’ve learned my lesson. I’ve been weening off Claude for a few weeks, cancelled my subscription three weeks ago, let it expire yesterday, and moved to both another provider and a third-party open source harness.
If you worry about "degraded" experience, then let people choose. People won't be using other wrappers if they turn out to be bad. People ain't stupid.
> On April 16, we added a system prompt instruction to reduce verbosity. In combination with other prompt changes, it hurt coding quality, and was reverted on April 20. This impacted Sonnet 4.6, Opus 4.6, and Opus 4.7
They can pick the default reasoning effort:
> On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode
They can decide what to keep and what to throw out (beyond simple token caching):
> On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6
It literally is all in the post.
I don't worry about anything though. It's not my product. I don't work for Anthropic, so I really couldn't care less about anyone else's degraded (or not) experience.
They control the default system prompt. You can change it if you want to.
> They can pick the default reasoning effort
Don't see how it's an obstacle in allowing third party wrappers.
> They can decide what to keep and what to throw out
That's actually a good point. However I still don't think it's an obstacle. If third party wrappers were bad, people simply wouldn't be using them.
Defaults matter. A large share of people never change them (status quo bias, psychological inertia). Having control over them (and usage quotas) means Anthropic can control and fine-tune what this fixed subscription costs them.
And evidently (re, the original article), they tried to do so.
Allowing third party wrappers doesn't mean Claude Code would cease to exist. The opposite actually, Claude Code would be the default.
People dissatisfied with Code would simply use other wrappers. I call it a win-win. Don't see how Anthropic would be on a lose here, they would still retain the ability to control the defaults.
I have no idea what the share of OpenClaw instances running on pi was, or third-party wrappers in general, but it was obviously large enough that Anthropic decided they had to put an end to it.
Conversely, from the latest developments, it would seem they are perfectly fine with people running OpenClaw with Claude models through Claude Code’s programmatic interface using subscriptions.
But in the end, this, my take, your take, is all conjecture. We are both on the outside looking in.
Only the people who work at Anthropic know.
I don't know, their desktop app felt really laggy and even switching Code sessions took a few seconds of nothing happening. Since the latest redesign, however, it's way better, snappy and just more usable in most respects.
I just think that we notice the negative things that are disruptive more. Even with the desktop app, the remaining flaws jump out: for example, how the Chat / Cowork / Code modes only show the label for the currently selected mode and the others are icons (that aren't very big), a colleague literally didn't notice that those modes are in the desktop app (or at least that that's where you switch to them).
The AI hype is dying, at least outside the silicon valley bubble which hackernews is very much a part of.
That and all the dogfooding by slop coding their user facing application(s).
Error: claude-opus-4-7[1m] is temporarily unavailable, so auto mode cannot determine the safety of Bash right now. Wait briefly and then try this action again. If it keeps failing, continue with other tasks that don't require this action and come back to it later. Note: reading files, searching code, and other read-only operations do not require the classifier and can still be used.
The only solution is to switch out of auto mode, which now seems to be the default every time I exit plan mode. Very annoying.Wait, didn't they just reset everybody's usage last Thursday, thereby syncing everybody's windows up? (Mine should have reset at 13:00 MDT) ? So this is just the normal weekly reset? Except now my reset says it will come Saturday? This is super-confusing!
This sounds fishy. It's easy to show users that Claude is making progress by either printing the reasoning tokens or printing some kind of progress report. Besides, "very long" is such a weasel phrase.
If a message will do a cache recreation the cost for that should be viewable.
2. Old sessions had the thinking tokens stripped, resuming the session made Claude stupid (took 15 days to notice and remediate)
3. System prompt to make Claude less verbose reducing coding quality (4 days - better)
All this to say... the experience of suspecting a model is getting worse while Anthropic publicly gaslights their user-base: "we never degrade model performance" is frustrating.
Yes, models are complex and deploying them at scale given their usage uptick is hard. It's clear they are playing with too many independent variables simultaneously.
However you are obligated to communicate honestly to your users to match expectations. Am I being A/B tested? When was the date of the last system prompt change? I don't need to know what changed, just that it did, etc.
Doing this proactively would certainly match expectations for a fast-moving product like this.
This one was egregious: after a one hour user pause, apparently they cleared the cache and then continued to apply “forgetting” for the rest of the session after the resume!
Seems like a very basic software engineering error that would be caught by normal unit testing.
To take the opposite side, this is the quality of software you get atm when your org is all in on vibe coding everything.
They're not gaslighting anyone here: they're very clear that the model itself, as in Opus 4.7, was not degraded in any way (i.e. if you take them at their word, they do not drop to lower quantisations of Claude during peak load).
However, the infrastructure around it - Claude Code, etc - is very much subject to change, and I agree that they should manage these changes better and ensure that they are well-communicated.
Sure they didn't change the GPUs their running, or the quantization, but if valuable information is removed leading to models performing worse, performance was degraded.
In the same way uptime doesn't care about the incident cause... if you're down you're down no one cares that it was 'technically DNS'.
Curious about this section on the system prompt change: >> After multiple weeks of internal testing and no regressions in the set of evaluations we ran, we felt confident about the change and shipped it alongside Opus 4.7 on April 16. As part of this investigation, we ran more ablations (removing lines from the system prompt to understand the impact of each line) using a broader set of evaluations. One of these evaluations showed a 3% drop for both Opus 4.6 and 4.7. We immediately reverted the prompt as part of the April 20 release.
Curious what helped catch in the later eval vs. initial ones. Was it that the initial testing was online A/B comparison of aggregate metrics, or that the dataset was not broad enough?
Do researchers know correlation between various aspects of a prompt and the response?
LLM, to me at least, appears to be a wildly random function that it's difficult to rely on. Traditional systems have structured inputs and outputs, and we can know how a system returned the output. This doesn't appear to be the case for LLM where inputs and outputs are any texts.
Anecdotally, I had a difficult time working with open source models at a social media firm, and something as simple as wrapping the example of JSON structure with ```, adding a newline or wording I used wildly changed accuracy.
The artificial creation of demand is also a concerning sign.
I don't have trust in it right now. More regressions, more oversights, it's pedantic and weird ways. Ironically, requires more handholding.
Not saying it's a bad model; it's just not simple to work with.
for now: `/model claude-opus-4-6[1m]` (youll get different behavior around compaction without [1m])
Resuming it cost 5% of the current session and 1% of the weekly session on a max subscription.
Many of these things have bitten me too. Firing off a request that is slow because it's kicked out of cache and having zero cache hits (causes everything to be way more expensive) so it makes sense they would do this. I tried skipping tool calls and thinking as well and it made the agent much stupider. These all seem like natural things to try. Pity.
Also, it may be a coincidence, that the article was published just before the GPT 5.5 launch, and then they restored the original model while releasing a PR statement claiming it was due to bugs.
But right now it seems like, in the case of (3), these systems are really sensitive and unpredictable. I'd characterize that as an alignment problem, too.
The real lesson is that an internal message-queuing experiment masked the symptoms in their own dogfooding. Dogfooding only works when the eaten food is the shipped food.
But if we're vibing... This is the kind of bug that should make it back into a review agent/skill's instructions in a more generic format. Essentially if something is done to the message history, check there tests that subsequent turns work as expected.
But yeah, you'd have to piss off a bunch of users in prod first to discover the blind spot.
now it's back to regular slop and just to check otherwise i have to spend at least $100
At the same time, personally I find prioritizing quality over quantity of output to be a better personal strategy. Ten partially buggy features really aren't as good as three quality ones.
You have too many and the wrong benchmarks
What verbosity? Most of the time I don’t know what it’s doing.
how do you just do that to millions of users building prod code with your shit
But if a tool is better, it's better.
We're talking about dynamically developed products, something that most people would have considered impossible just 5 years ago. A non-deterministic product that's very hard to test. Yes, Anthropic makes mistakes, models can get worse over time, their ToS change often. But again, is Gemini/GPT/Grok a better alternative?
If you have a good product, you are more understanding. And getting worse doesn't mean its no longer valuable, only that the price/value factor went down. But Opus 4.5 was relevant better and only came out in November.
There was no price increase at that time so for the same money we get better models. Opus 4.6 again feels relevant better though.
Also moving fastish means having more/better models faster.
I do know plenty of people though which do use opencode or pi and openrouter and switching models a lot more often.
Ironically, I was thinking the exact opposite. This is bleeding edge stuff and they keep pushing new models and new features. I would expect issues.
I was surprised at how much complaining there is -- especially coming from people who have probably built and launched a lot of stuff and know how easy it is to make mistakes.
That said, there is now much better competition with Codex, so there's only so much rope they have now.
I never understood why people cheered for Anthropic then when they happily work together with Palantir.
They don't actually pay the bill or see it.
Idiots keep throwing money at real-time enshittification and 'I am changing the terms. Pray I do not change them further".
And yes, I am absolutely calling people who keep getting screwed and paying for more 'service' as idiots.
And Anthropic has proved that they will pay for less and less. So, why not fuck them over and make more company money?
The thing about session resumption changing the context of a session by truncating thinking is a surprise to me, I don't think that's even documented behavior anywhere?
It's interesting to look at how many bugs are filed on the various coding agent repos. Hard to say how many are real / unique, but quantities feel very high and not hard to run into real bugs rapidly as a user as you use various features and slash commands.
I think an apology for that incident would go a long way.
LLMs over edit and it's a known problem.
The other thing, when anthropic turns on lazy claude... (I want to coin here the term Claudez for the version of claude that's lazy.. Claude zzZZzz = Claudez) that thing is terrible... you ask the model for something... and it's like... oh yes, that will probably depend on memory bandwith... do you want me to search that?...
YES... DO IT... FRICKING MACHINE..
> Next steps are to run `cat /path/to/file` to see what the contents are
Makes me want to pull my hair out. I've specifically told you to go do all the read-only operations you want out on this dev server yet it keeps forgetting and asking me to do something it can do just fine (proven by it doing it after I "remind" it).
That and "Auto" mode really are grinding my gears recently. Now, after a Planing session my only option is to use Auto mode and I have to manually change it back to "Dangerously skip permissions". I think these are related since the times I've let it run on "Auto" mode is when it gives up/gets stuck more often.
Just the other day it was in Auto mode (by accident) and I told it:
> SSH out to this dev server, run `service my_service_name restart` and make sure there are no orphans (I was working on a new service and the start/stop scripts). If there are orphans, clean them up, make more changes to the start/stop scripts, and try again.
And it got stuck in some loop/dead-end with telling I should do it and it didn't want to run commands out on a "Shared Dev server" (which I had specifically told it that this was not a shared server).
The fact that Auto mode burns more tokens _and_ is so dumb is really a kick in the pants.
If they have to raise prices to stop hemorrhaging money, would you be willing to pay 1000 bucks a month for a max plan? Or 100$ per 1M pitput tokens (playing numberWang here, but the point stands).
If I have to guess they are trying to get balance sheet in order for an IPO and they basically have 3 ways of achieving that:
1. Raising prices like you said, but the user drop could be catastrophic for the IPO itself and so they won't do that
2. Dumb the models down (basically decreasing their cost per token)
3. Send less tokens (ie capping thinking budgets aggressively).
2 and 3 are palatable because, even if they annoying the technical crowd, investors still see a big number of active users with a positive margin for each.
I'm not a heavy LLM user, and I've never come anywhere the $200/month plan limits I'm already subscribed to. But when I do use it, I want the smartest, most relentless model available, operating at the highest performance level possible.
Charge what it takes to deliver that, and I'll probably pay it. But you can damned well run your A/B tests on somebody else.
There are a number of projects working on evals that can check how 'smart' a model is, but the methodology is tricky.
One would want to run the exact same prompt, every day, at different times of the day, but if the eval prompt(s) are complex, the frontier lab could have a 'meta-cognitive' layer that looks for repetitive prompts, and either: a) feeds the model a pre-written output to give to the user b) dumbs down output for that specific prompt
Both cases defeat the purpose in different ways, and make a consistent gauge difficult. And it would make sense for them to do that since you're 'wasting' compute compared to the new prompts others are writing.
Enough that the prompt is different at a token-level, but not enough that the meaning changes.
It would be very difficult for them to catch that, especially if the prompts were not made public.
Run the variations enough times per day, and you'd get some statistical significance.
The guess the fuzzy part is judging the output.
how often do these changes happen?
Somehow, three times makes me not feel confident on this response.
Also, if this is all true and correct, how the heck they validate quality before shipping anything?
Shipping Software without quality is pretty easy job even without AI. Just saying....
Again goes back to the "intern" analogy people like to make.
And it also tells us why we shouldn’t use their harness anyway: they constantly fiddle with it in ways that can seriously impact outcomes without even a warning.
What i notice: after 300k there's some slight quality drop, but i just make sure to compact before that threshold.
The latest in homebrew is 2.1.108 so not fixed, and I don't see opus 4.7 on the models list... Is homebrew a second class citizen, or am I in the B group?
Why should we ever trust what they say again out trust that they won’t be rug-pulling again once this blows over?
So then, there must have been an explicit internal guidance/policy that allowed this tradeoff to happen.
Did they fix just the bug or the deeper policy issue?
lua plugins WIP
Translation: To reduce the load on our servers.
Is it just me or does this seem kind of shocking? Such a severe bug affecting millions of users with a non-trivial effect on the context window that should be readily evident to anyone looking at the analytics. Makes me wonder if this is the result of Anthropic's vibe-coding culture. No one's actually looking at the product, its code, or its outputs?
Apparently they are using another version internally.
Notably missing from the postmortem
My trust is gone. When day-to-day updates do nothing but cause hundreds of dollars in lost $$$ tokens and the response is "we ... sorta messed up but just a little bit here and there and it added up to a big mess up" bro get fuckin real.
translation: we ignored this and our various vibe coders were busy gaslighting everyone saying this could not be happening
They are all doing it because OpenAI is snatching their customers. And their employees have been gaslighting people [1] for ages. I hope open-source models will provide fierce competition so we do not have to rely on an Anthropic monopoly. [1] https://www.reddit.com/r/claude/comments/1satc4f/the_biggest...
People complain so much, and the conspiracy theories are tiring.