I feel like that is a choice best left up to users.
i.e. "Resuming this conversation with full context will consume X% of your 5-hour usage bucket, but that can be reduced by Y% by dropping old thinking logs"
Perhaps if we were willing to pay more for our subscriptions Anthropic would be able to have longer cache windows but IDK one hour seems like a reasonable amount of time given the context and is a limitation I'm happy to work around (it's not that hard to work around) to pay just $100 or $200 a month for the industry-leading LLM.
Full disclosure: I've recently signed up for ChatGPT Pro as well in addition to my Claude Max sub so not really biased one way or the other. I just want a quality LLM that's affordable.
Kinda like when restaurants make me pay for ketchup or a takeaway box, i get annoyed, just increase the compiled price.
If they ignored this then all users who don’t do this much would have to subsidize the people who do.
I completely agree that it’s infeasible for them to cache for long periods of time, but they need to surface that information in the tools so that we can make informed decisions.
They have a limited number of resources and can’t keep everyone’s VM running forever.
note: I picked the values from a blog and they may be innacurate, but in pretty much all model the KV cache is very large, it's probably even larger in Claude.
Total VRAM: 16GB
Model: ~12GB
128k context size: ~3.9GB
At least I'm pretty sure I landed on 128k... might have been 64k. Regardless, you can see the massive weight (ha) of the meager context size (at least compared to frontier models).A sibling comment explains:
The UI could indicate this by showing a timer before context is dumped.
No need to gamify it. It's just UI.
But perhaps Claude Code could detect that you're actively working on this stuff (like typing a prompt or accessing the files modified by the session), and send keep-cache-alive pings based on that? Presumably these pings could be pretty cheap, as the kv-cache wouldn't need to be loaded back into VRAM for this. If that would work reliably, cache expiry timeouts could be more aggressive (5 min instead of an hour).
Caching to RAM and disk is a thing but it’s hard to keep performance up with that and it’s early days of that tech being deployed anywhere.
Disclosure: work on AI at Microsoft. Above is just common industry info (see work happening in vLLM for example)
The larger point stands: the cache is expensive. It still saves you money but Anthropic must charge for it.
Edit: there are a lot of comments here where people don't understand LLM prefix caching, aka the KV cache. That's understandable: it is a complex topic and the usual intuitions about caching you might have from e.g. web development don't apply: a single cache blob for a single request is in the 10s of GB at least for a big model, and a lot of the key details turn on the problems of moving it in and out of GPU memory. The contents of the cache is internal model state; it's not your context or prompt or anything like that. Furthermore, this isn't some Anthropic-specific thing; all LLM inference with a stable context prefix will use it because it makes inference faster and cheaper. If you want to read up on this subject, be careful as a lot of blogs will tell you about the KV cache as it is used within inference for an single request (a critical detail concept in how LLMs work) but they will gloss over how the KV cache is persisted between requests, which is what we're all talking about here. I would recommend Philip Kiely's new book Inference Engineering for a detailed discussion of that stuff, including the multiple caching levels.
You didn't do your due diligence on an expensive API. A naïve implementation of an LLM chat is going to have O(N^2) costs from prompting with the entire context every time. Caching is needed to bring that down to O(N), but the cache itself takes resources, so evictions have to happen eventually.
You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.
1. Compute scaling with the length of the sequence is applicable to transformer models in general, i.e. every frontier LLM since ChatGPT's initial release.
2. As undocumented changes happen frequently, users should be even more incentivized to at least try to have a basic understanding of the product's cost structure.
> You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.
I think "internal technical implementation" is a stretch. Users don't need to know what a "transformer" is to understand the trade-off. It's not trivial but it's not something incomprehensible to laypersons.
I have no idea how that works with a LLM implementation nor do I actually know what they are caching in this context.
That's a bad estimate. Claude Code is explicitly a developer shaped tool, we're not talking generically ChatGPT here, so my guess is probably closer to 75% of those users do understand what caching is, with maybe 30% being able to explain prompt caching actually is. Of course, those users that don't understand have access to Claude and can have it explain what caching is to them if they're interested.
Does mmap(2) educate the developer on how disk I/O works?
At some point you have to know something about the technology you're using, or accept that you're a consumer of the ever-shifting general best practice, shifting with it as the best practice shifts.
If you were being charged per character, or running down character limits, and printing on printers that were shared and had economic costs for stalled and started print runs, then:
You wouldn’t “need” to understand. The prints would complete regardless. But you might want to. Personal preference.
Which is true of this issue to.
and the system was being run by some of the planet’s brightest people whose famous creation is well known to disseminate complex information succinctly,
>then:
You would expect to be led to understand, like… a 1997 Prius.
“This feature showed the vehicle operation regarding the interplay between gasoline engine, battery pack, and electric motors and could also show a bar-graph of fuel economy results.” https://en.wikipedia.org/wiki/Toyota_Prius_(XW10)
Seems like every month someone writes up a brilliant article on how to build an LLM from scratch or similar that hits the HN page, usually with fancy animated blocks and everything.
It's not at all hard to find documentation on this topic. It could be made more prominent in the U/I but that's true of lots of things, and hammering on "AI 101" topics would clutter the U/I for actual decision points the user may want to take action upon that you can't assume the user already knows about in the way you (should) be able to assume about how LLMs eat up tokens in the first place.
So while I'd agree with your sarcasm that expecting users to be experts of the system is a big ask, where I disagree with you is that users should be curious and actively attempting to understand how it works around them. Given that the tooling changes often, this is an endless job.
Have you ever talked with users?
> this is an endless job
Indeed. If we spend all our time learning what changed with all our tooling when it changes without proper documentation then we spend all our working lives keeping up instead of doing our actual jobs.
I believe if one were to read my post it'd have been clear that I *am* a user.
This *is* "hacker" news after all. I think it's a safe assumption that people sitting here discussing CC are an inquisitive sort who want to understand what's under the hood of their tools and are likely to put in some extra time to figure it out.
And then their vibe-coders tell us that we are to blame for using the product exactly as advertised: https://x.com/lydiahallie/status/2039800718371307603 while silently changing how the product works.
Please stop defending hapless innocent corporations.
It's not about "constructing a prompt" in the sense of building the prompt string. That of course wouldn't be costly.
It is about reusing llm inference state already in GPU memory (for the older part of the prompt that remains the same) instead of rerunning the prompt and rebuilding those attention tensors from scratch.
that is what caching is doing. the llm inference state is being reused. (attention vectors is internal artefact in this level of abstraction, effectively at this level of abstraction its a the prompt).
The part of the prompt that has already been inferred no longer needs to be a part of the input, to be replaced by the inference subset. And none of this is tokens.
If I'm running a database keeping track of a conversation, and each time it writes the entire history of the conversation instead of appending a message, are we calling that O(N^2) now?
Try this out using a local LLM. You'll see that as the conversation grows, your prompts take longer to execute. It's not exponential but it's significant. This is in fact how all autoregressive LLMs work.
Also by the way, caching does not make LLM inference linear. It's still quadratic, but the constant in front of the quadratic term becomes a lot smaller.
Touché. Still, to a reasonable approximation, caching makes the dominant term linear, or equiv, linearly scales the expensive bits.
This is the operation that is basically done for each message in an LLM chat in the logical level: the complete context/history is sent in to be processed. If you wish to process only the additions, you must preserve the processed state on server-side (in KV cache). KV caches can be very large, e.g. tens of gigabytes.
With this much cheaper setup backed by disks, they can offer much better caching experience:
> Cache construction takes seconds. Once the cache is no longer in use, it will be automatically cleared, usually within a few hours to a few days.
You already have the data on your own machine, and that 'upload and restore' process is exactly what is happening when you restart an idle session. The issue is that it takes time, and it counts as token usage because you have to send the data for the GPU to load, and that data is the 'tokens'.
The data is the conversation (along with the thinking tokens).
There is no download - you already have it.
The issue is that it gets expunged from the (very expensive, very limited) GPU cache and to reload the cache you have to reprocess the whole conversation.
That is doable, but as Boris notes it costs lots of tokens.
The kv-cache is the internal LLM state after having processed the tokens. It's big, and you do not have it locally.
The cache is what makes your journey from 1k prompt to 1million token solution speedy in one 'vibe' session. Loading that again will cost the entire journey.
Sure, the exact choice on the trade-off, changing that choice, and having a pretty product-breaking bug as a result, are much more opaque. But I was responding to somebody who was surprised there's any trade-off at all. Computers don't give you infinite resources, whether or not they're "servers," "in the cloud," or "AI."
I'm really beginning to feel the lack of control when it's comes to context if I'm being honest
So yeah, it doesn't take much to surface to the user that the speed/value of their session is ephemeral because to keep all that cache active is computationally expensive because ...
You're still just running text through a extremely complex process, and adding to that text and to avoid re-calculation of the entire chain, you need the cache.
I understand you wouldn't want this to be the default, particularly for people who have one giant running session for many topics - and I can only imagine the load involved in full cache misses at scale. But there are other use cases where this thinking is critical - for instance, a session for a large refactor or a devops/operations use case consolidating numerous issue reports and external findings over time, where the periodic thinking was actually critical to how the session evolved.
For example, if N-4 was a massive dump of some relevant, some irrelevant material (say, investigating for patterns in a massive set of data, but prompted to be concise in output), then N-4's thinking might have been critical to N-2 not getting over-fixated on that dump from N-4. I'd consider it mission-critical, and pay a premium, when resuming an N some hours later to avoid pitfalls just as N-2 avoided those pitfalls.
Could we have an "ultraresume" that, similar to ultrathink, would let a user indicate they want to watch Return of the (Thin)king: Extended Edition?
Especially in the analysis part of my work I don‘t care about the actual text output itself most of the time but try to make the model „understand“ the topic.
In the first phase the actual text output itself is worthless it just serves as an indicator that the context was processed correctly and the future actual analysis work can depend on it. And they‘re… just throwing most the relevant stuff out all out without any notice when I resume my session after a few days?
This is insane, Claude literally became useless to me and I didn’t even know it until now, wasting a lot of my time building up good session context.
There would be nothing lost if they said „If you click yes, we will prune your old thinking making Claude faster and saving you tons of tokens“. Most people would say yes probably so why not ask them… make it an env variable (that is announced not a secretly introduced one to opt out of something new!) or at least write it in a change log if they really don’t want to allow people to use it like before, so there‘d be chance to cancel the subscription in time instead of wasting tons of time on work patterns that not longer work
1. Specifically, this suite was about price increases without clear consideration for both parties - but the same justifications apply to service restrictions without corresponding price decreases.
https://fortune.com/2026/04/20/italian-court-netflix-refunds...
> Our systems will smartly ignore any reasoning items that aren’t relevant to your functions, and only retain those in context that are relevant. You can pass reasoning items from previous responses either using the previous_response_id parameter, or by manually passing in all the output items from a past response into the input of a new one.
https://developers.openai.com/api/docs/guides/reasoning
Disclosure - work on AI@msft
The issue is that if they send the full trace back, it will have to be processed from the start if the cache expired, and doing that will cause a huge one-time hit against your token limit if the session has grown large.
So what Boris talked about is stripping things out of the trace that goes back to regenerate the session if the cache expires. Doing this would help avert burning up the token limit, but it is technically a different conversation, so if CC chooses poorly on stripping parts of the context then it would lead to Claude getting all scatter-brained.
Anthropic already profited from generating those tokens. They can afford subsidize reloading context.
And it’s part of a larger problem of unannounced changes it‘s just like when they introduced adaptive thinking to 4.6 a few weeks ago without notice.
Also they seem to be completely unaware that some users might only use Claude code because they are used to it not stripping thinking in contrast to codex.
Anyway I‘m happy that they saw it as a valid refund reason
The irony is that Claude Design does this. I did a big test building a design system, and when I came back to it, it had in the chat window "Do you need all this history for your next block of work? Save 120K tokens and start a new chat. Claude will still be able to use the design system." Or words to that effect.
Granted, the "memory" can be available across session, as can docs...
The only issue is that it didn't hit the cache so it was expensive if you resume later.
Still on Opus 4.6 with no adaptive thinking, so didn't really notice anything worse in the past weeks, but who knows.
Please fight this hubris. Your users matter. Many of us use your tools for everyday work and do not appreciate having the rug pulled from under them on a regular basis, much less so in an underhanded and undisclosed way.
I don't mind the bugs, these will happen. What I do not appreciate is secretly changing things that are likely to decrease performance.
You're acquiring users as a recurring revenue source. Consider stability and transparency of implementation details cost of doing business, or hemorrhage users as a result.
See also the difference between eg. MacOS (with large M, the older good versions) and waiting for "Year of linux on desktop".
I don't think the issue is making decisions for users, but trying to switch off the soup tap in the all-you-can-eat soup bar. Or, wrong business model setting wrong incentives to both sides.
It would be nice to be able to summarize/cut into a new leaner conversation vs having to coax all the context back into a fresh one. Something like keep the last 100,000 tokens.
I believe /compact achieves something like this? It just takes so long to summarize that it creates friction.
The purpose of spending 10 to 50 prompts getting Claude to fill the context for you is it effectively "fine tunes" that session into a place your work product or questions are handled well.
// If this notion of sufficient context as fine tune seems surprising, the research is out there.)
Approaches tried need to deal with both of these:
1) Silent context degradation breaks the Pro-tool contract. I pay compute so I don't pay in my time; if you want to surface the cost, surface it (UI + price tag or choice), don't silently erode quality of outcomes.
2) The workaround (external context files re-primed on return) eats the exact same cache miss, so the "savings" are illusory — you just pushed the cost onto the user's time. If my own time's cheap enough that's the right trade off, I shouldn't be using your machine.
I wish Anthropic's leadership would understand that the dev community is such a vital community that they should appreciate a bit more (i.e. not nice sending lawyers afters various devs without asking nicely first, banning accounts without notice, etc etc). Appreciate it's not easy to scale.
OpenAI seems to be doing a much better job when it comes to developer relations, but I would like to see you guys 'win' since Anthropic shows more integrity and has clear ethical red lines they are not willing to cross unlike OpenAI's leadership.
I think the best option would be tell a user who is about to resurrect a conversation that has been evicted from cache that the session is not cached anymore and the user will have to face a full cost of replaying a session, not only the incremental question and answer.
(In understand under the hood that llms are n^2 by default but it's very counter intuitive - and given how popular cc is becoming outside of nerd circles, probably smaller and smaller fraction of users is aware of it)
I would like to decide on it case by case. Sometimes the session has some really deep insight I want to preserve, sometimes it's discardable.
Maybe the UI could do that for sessions that the user hasn't left yet, when the deadline comes near.
You ideally want to compact before the conversation is evicted from cache. If you knew you were going to use the conversation again later after cache expiry, you might do this deliberately before leaving a session.
Anthropic could do this automatically before cache expiry, though it would be hard to get right - they'd be wasting a lot of compute compacting conversations that were never going to be resumed anyway.
This feature has been live for a few days/weeks now, and with that knowledge I try remember to a least get a process report written when I'm for example close to the quota limit and the context is reasonably large. Or continue with a /compact, but that tends to lead to be having to repeat some things that didn't get included in the summary. Context management is just hard.
It's a little concerning that it's number 1 in your list.
Two questions if you see this:
1) if this isn't best practice, what is the best way to preserve highly specific contexts?
2) does this issue just affect idle sessions or would the cache miss also apply to /resume ?
Edit: If you message me I can share some of my toolchain, it's probably similar to what a lot of other people here use but I've done some polishing recently.
So then it comes to what you're talking about, which is processing the entire text chain which is a different kind of cache, and generating the equivelent tokens are what's being costed.
But once you realize the efficiency of the product in extended sessions is cached in the immediate GPU hardware, then it's obvious that the oversold product can't just idle the GPU when sessions idle.
Thank you.
I'm writing this message even though I don't have much to add because it's often the case on HN that criticism is vocal and appreciation is silent and I'd like to balance out the sentiment.
Anthropic has fumbled on many fronts lately but engaging honestly like this is the right thing to do. I trust you'll get back on track.
They spent two months literally gaslighting this "critical audience" that this could not be happening and literally blaming users for using their vibe-coded slop exactly as advertised.
All the while all the official channels refused to acknowledge any problems.
Now the dissatisfaction and subscription cancellations have reached a point where they finally had to do something.
https://x.com/bcherny/status/2044291036860874901 https://x.com/bcherny/status/2044299431294759355
No mention of anything like "hey, we just fixed two big issues, one that lasted over a month." Just casual replies to everybody like nothing is wrong and "oh there's an issue? just let us know we had no idea!"
My biggest learning here is the 1 hour cache window. I often have multiple Claudes open and it happens frequently that they're idle for 1+ hours.
This cache information should probably get displayed somewhere within Claude Code
Love the product. <3
Mislead, gaslight, misdirect is the name of the game
Question: Hey claude, if we have a conversation, and then i take a break. Does it change the expected output of my next answer, if there are 2 hours between the previous message end the next one?
Answer: No. A 2-hour gap doesn't change my output. I have no internal clock between messages — I only see the conversation content plus the currentDate context injected each turn. The prompt cache may expire (5 min TTL), which affects cost/latency but not the response itself.
The only things that can change output across a break: new context injected (like updated date), memory files being modified, or files on disk changing.
-- This answer directly contradict your post. It seems like the biggest problem is a total lack of documentation for expected behavior.A similar thing happens if I ask claude code for the difference between plan mode, and accept edits on.
Then Claude told me the only difference was that with plan mode it would ask for permission before doing edits. But I really don't think this is true. It seems like plan mode does a lot more work, and present it in a total different way. It is not just a "I will ask before applying changes" mode.
I have project folders/files and memory stored for each session, when I come back to my projects the context is drawn from the memory files and the status that were saved in my project md files.
Create a better workflow for your self and your teams and do it the right way. Quick expect the prompt to store everything for you.
For the Claude team. If you havent already, I'd recommend you create some best practices for people that don't know any better, otherwise people are going to expect things to be a certain way and its going to cause a lot of friction when people cant do what the expect to be able to do.
It’s hard to do it without killing performance and requires engineering in the DC to have fast access to SSDs etc.
Disclosure: work on ai@msft. Opinions my own.
Let's see what Boris Cherny himself and other Anthropic vibe-coders say about this:
https://x.com/bcherny/status/2044847849662505288
Opus 4.7 loves doing complex, long-running tasks like deep research, refactoring code, building complex features, iterating until it hits a performance benchmark.
https://x.com/bcherny/status/2007179858435281082
For very long-running tasks, I will either (a) prompt Claude to verify its work with a background agent when it's done... so Claude can cook without being blocked on me.
https://x.com/trq212/status/2033097354560393727
Opus 4.6 is incredibly reliable at long running tasks
https://x.com/trq212/status/2032518424375734646
The long context window means fewer compactions and longer-running sessions. I've found myself starting new sessions much less frequently with 1 million context.
https://x.com/trq212/status/2032245598754324968
I used to be a religious /clear user, but doing much less now, imo 4.6 is quite good across long context windows
---
I could go on
Yeah it's called lunch!
The above workflow basically doesn’t hit the rate limit. So I’d appreciate a way to turn off this feature.
Is there a more fundamental issue of trying to tie something with such nuanced costs to an interaction model which has decades of prior expectation of every message essentially being free?
As an informed user who understands his tools, I of course expect large uncached conversations to massively eat into my token budget, since that's how all of the big LLM providers work. I also understand these providers are businesses trying to make money and they aren't going to hold every conversation in their caches indefinitely.
Why not store the prompt cache to disk when it goes cold for a certain period of time, and then when a long-lived, cold conversation gets re-initiated, you can re-hydrate the cache from disk. Purge the cached prompts from disk after X days of inactivity, and tell users they cannot resume conversations over X days without burning budget.
So it would probably be a quite a long transfer to perform in these cases, probably not very feasible to implement at scale.
The core idea is we can use user-activity at the client to manage KV cache loading and offloading. Happy to chat more!!
I feel like I'm missing something here. Why would I revisit an old conversation only to clear it?
To me it sounds like a prompt-cache miss for a big context absolutely needs to be a per-instance warning and confirmation. Or even better a live status indicating what sending a message will cost you in terms of input tokens.
1) Is it okay to leave Claude Code CLI open for days?
2) Should we be using /clear more generously? e.g., on every single branch change, on every new convo?
I'm curious why 1 hour was chosen?
Is increasing it a significant expense?
Ever since I heard about this behaviour I've been trying to figure out how to handle long running Claude sessions and so far every approach I've tried is suboptimal
It takes time to create a good context which can then trigger a decent amount of work in my experience, so I've been wondering how much this is a carefully tuned choice that's unlikely to change vs something adjustable
how about acknowledging that you fucked up your own customers’ money and making a full refund for the affected period?
> Educating users on X/social
that is beyond me
ты не Борис, ты максимум борька
Why not use tired cache?
Obviously storage is waaay cheaper than recalculation of embeddings all the way from the very beginning of the session.
No matter how to put this explanation — it still sounds strange. Hell — you can even store the cache on the client if you must.
Please, tell me I’m not understanding what is going on..
otherwise you really need to hire someone to look at this!)
I still don't understand it, yes it's a lot of data and presumably they're already shunting it to cpu ram instead of keeping it on precious vram, but they could go further and put it on SSD at which point it's no longer in the hotpath for their inference.
But they used to return thinking output directly in the API, and that was _the_ reason I liked Claude over OpenAI's reasoning models.
What would be an interesting option would be to let the user pay more for longer caching, but if the base length is 1 hour I assume that would become expensive very quickly.
It's kept for long enough that it's expensive to store in RAM, but short enough that the writes are frequent and will wear down SSD storage
But even if it’s not — I can’t build a scenario in my head where recalculating it on real GPUs is cheaper/faster than retrieving it from some kind of slower cache tier
whats driving the hour cache? shouldnt people be able to have lunch, then come back and continue?
are you expecting claude code users to not attend meetings?
I think product-wise you might need a better story on who uses claude-code, when and why.
Same thing with session logs actually - i know folks who are definitely going to try to write a yearly RnD report and monthly timesheets based on text analysis of their claude code session files, and they're going to be incredibly unhappy when they find out its all been silently deleted
Why did you lie 11 days ago, 3 days after the fix went in, about the cause of excess token usage?
It also seems like the warning should be in channel and not on X. If I wanted to find out how broken things are on X, I'd be a Grok user.
Either swallow the cost or be transparent to the user and offer both options each time.
You guys really need to communicate that better in the CLI for people not on social
I frequently debug issues while keeping my carefully curated but long context active for days. Losing potentially very important context while in the middle of a debugging session resulting in less optimal answers, is costing me a lot more money than the cache misses would.
In my eyes, Claude Code is mainly a context management tool. I build a foundation of apparent understanding of the problem domain, and then try to work towards a solution in a dialogue. Now you tell me Anthrophic has been silently breaking down that foundation without telling me, wasting potentially hours of my time.
It's a clear reminder that these closed-source harnesses cannot be trusted (now or in the future), and I should find proper alternatives for Claude Code as soon as possible.
No. You had random developers tweet and reply at random times to random users while all of your official channels were completely silent. Including channels for people who are not terminally online on X
There's a reason live service games have splash banners at every login. No matter what you pick as an official e-coms channel, most of your users aren't there!
* To be fair, of all these firms, ANTHROP\C tries the hardest to remember, and deliver like, some people aren't the same. Starting with normals doing normals' jobs.
I dont agree with this being characterized as a "corner case".
Isn't this how most long running work will happen across all serious users?
I am not at my desk babysitting a single CC chat session all day. I have other things to attend to -- and that was the whole point of agentic engineering.
Dont CC users take lunch breaks?
How are all these utterly common scenarios being named as corner cases -- as something that is wildly out of the norm, and UX can be sacrificed for those cases?
Probably that's why I hit my weekly limits 3-4 days ago, and was scheduled to reset later today. I just checked, and they are already reset.
Not sure if it's already done, shouldn't there be a check somewhere to alert on if an outrageous number of tokens are getting written, then it's not right ?
You lost huge trust with the A/B sham test. You lost trust with enshittification of the tokenizer on 4.6 to 4.7. Why not just say "hey, due to huge input prices in energy, GPU demand and compute constraints we've had to increase Pro from $20 to $30." You might lose 5% of customers. But the shady A/B thing and dodgy tokenizer increasing burn rate tells everyone inc. enterprise that you don't care about honesty and integrity in your product.
I hope this feedback helps because you still stand to make an awesome product. Just show a little more professionalism.
or even, let the user control the cache expiry on a per request basis. with a /cache command
that way they decide if they want to drop the cache right away , or extend it for 20 hours etc
it would cost tokens even if the underlying resource is memory/SSD space, not compute
/loop 5m say "ok".
Will that keep the cache fresh?
Silently degrading intelligence ought to be something you never do, but especially not for use-cases like this.
I’m looking back at my past few weeks of work and realizing that these few regressions literally wasted 10s of hours of my time, and hundreds of dollars in extra usage fees. I ran out of my entire weekly quota four days ago, and had to pause the personal project I was working on.
I was running the exact same pipeline I’ve run repeatedly before, on the same models, and yet this time I somehow ate a week’s worth of quota in less than 24h. I spent $400 just to finish the pipeline pass that got stuck halfway through.
I’m sorry to be harsh, but your engineering culture must change. There are some types of software you can yolo. This isn’t one of them. The downstream cost of stupid mistakes is way, way too high, and far too many entirely avoidable bugs — and poor design choices — are shipping to customers way too often.
Hard agree, would like to see a response to this.
how does this help me as a customer? if i have to redo the context from scratch, i will pay both the high token cost again, but also pay my own time to fill it.
the cost of reloading the window didnt go away, it just went up even more
I have to imagine this isn't helped by working somewhere where you effectively have infinite tokens and usage of the product that people are paying for, sometimes a lot.
Construction of context is not an llm pass - it shouldn't even count towards token usage. The word 'caching' itself says don't recompute me.
Since the devs on HN (& the whole world) is buying what looks like nonsense to me - what am I missing?