undefined

upvote

points

by dbeardsl20 hours ago |

upvote

by giwook18 hours ago|

[-]

Another way to think about it might be that caching is part of Anthropic's strategy to reduce costs for its users, but they are now trying to be more mindful of their costs (probably partly due to significant recent user growth as well as plans to IPO which demand fiscal prudence).

Perhaps if we were willing to pay more for our subscriptions Anthropic would be able to have longer cache windows but IDK one hour seems like a reasonable amount of time given the context and is a limitation I'm happy to work around (it's not that hard to work around) to pay just $100 or $200 a month for the industry-leading LLM.

Full disclosure: I've recently signed up for ChatGPT Pro as well in addition to my Claude Max sub so not really biased one way or the other. I just want a quality LLM that's affordable.

reply

upvote

by jimkleiber15 hours ago|

[-]

I might be willing to pay more, maybe a lot more, for a higher subscription than claude max 20x, but the only thing higher is pay per token and i really dont like products that make me have to be that minutely aware of my usage, especially when it has unpredictability to it. I think there's a reason most telecoms went away from per minute or especially per MB charging. Even per GB, as they often now offer X GB, and im ok with that on phone but much less so on computer because of the unpredictability of a software update size.

Kinda like when restaurants make me pay for ketchup or a takeaway box, i get annoyed, just increase the compiled price.

reply

upvote

by adam_patarino4 hours ago|

[-]

Token anxiety is real mental overhead.

reply

upvote

by sharts17 hours ago|

[-]

That doesn’t make sense to pay more for cache warming. Your session for the most part is already persisted. Why would it be reasonable to pay again to continue where you left off at any time in the future?

reply

upvote

by jeremyjh16 hours ago|

[-]

Because it significantly increases actual costs for Anthropic.

If they ignored this then all users who don’t do this much would have to subsidize the people who do.

reply

upvote

by tikkabhuna9 hours ago|

[-]

I’m coming at this as a complete Claude amateur, but caching for any other service is an optimisation for the company and transparent for the user. I don’t think I’ve ever used a service and thought “oh there’s a cache miss. Gotta be careful”.

I completely agree that it’s infeasible for them to cache for long periods of time, but they need to surface that information in the tools so that we can make informed decisions.

reply

upvote

by libraryofbabel7 hours ago|

[-]

That is because LLM KV caching is not like caches you are used to (see my other comments, but it's 10s of GB per request and involves internal LLM state that must live on or be moved onto a GPU and much of the cost is in moving all that data around). It cannot be made transparent for the user because the bandwidth costs are too large a fraction of unit economics for Anthropic to absorb, so they have to be surfaced to the user in pricing and usage limits. The alternative is a situation where users whose clients use the cache efficiently end up dramatically subsidizing users who use it inefficiently, and I don't think that's a good solution at all. I'd much rather this be surfaced to users as it is with all commercial LLM apis.

reply

upvote

by theshrike792 hours ago|

[-]

Think of it like this: Anthropic has to keep a full virtual machine running just for you. How long should it idle there taking resources when you only pay a static monthly fee and not hourly?

They have a limited number of resources and can’t keep everyone’s VM running forever.

reply

upvote

by danso13 hours ago|

[-]

Genuine question: is the cost to keep a persistent warmed cache for sessions idling for hours/days not significant when done for hundreds of thousands of users? Wouldn’t it pose a resource constraint on Anthropic at some point?

reply

upvote

by 1 hours ago|

[-]

deleted

reply

upvote

by tmountain6 hours ago|

[-]

Related question, is it at all feasible to store cache locally to offload memory costs and then send it over the wire when needed?

reply

upvote

by dev_hugepages4 hours ago|

[-]

No, the cache is a few GB large for most usual context sizes. It depends on model architecture, but if you take Gemma 4 31B at 256K context length, it takes 11.6GB of cache

note: I picked the values from a blog and they may be innacurate, but in pretty much all model the KV cache is very large, it's probably even larger in Claude.

reply

upvote

by libraryofbabel50 minutes ago|

[-]

To extend your point: it's not really the storage costs of the size of the cache that's the issue (server-side SSD storage of a few GB isn't expensive), it's the fact that all that data must be moved quickly onto a GPU in a system in which the main constraint is precisely GPU memory bandwidth. That is ultimately the main cost of the cache. If the only cost was keeping a few 10s of GB sitting around on their servers, Anthropic wouldn't need to charge nearly as much as they do for it.

reply

upvote

by bavell3 hours ago|

[-]

Yesterday I was playing around with Gemma4 26B A4B with a 3 bit quant and sizing it for my 16GB 9070XT:

  Total VRAM: 16GB
  Model: ~12GB
  128k context size: ~3.9GB

At least I'm pretty sure I landed on 128k... might have been 64k. Regardless, you can see the massive weight (ha) of the meager context size (at least compared to frontier models).

reply

upvote

by johnsonbuilds12 hours ago|

[-]

[dead]

reply

upvote

by cadamsdotcom15 hours ago|

[-]

Sure, it wouldn’t make sense if they only had one customer to serve :)

reply

upvote

by uoaei6 hours ago|

[-]

Exactly, even in the throes of today's wacky economic tides, storage is still cheap. Write the model state immediately after the N context messages in cache to disk and reload without extra inference on the context tokens themselves. If every customer did this for ~3 conversations per user you still would only need a small fraction of a typical datacenter to house the drives necessary. The bottleneck becomes architecture/topology and the speed of your buses, which are problems that have been contended with for decades now, not inference time on GPUs.

reply

upvote

by jeremyjh5 hours ago|

[-]

This has nothing to do with the cost of storage. Surprisingly, you are not better informed than Anthropic on the subject of serving AI inference models.

A sibling comment explains:

https://news.ycombinator.com/item?id=47886200

reply

upvote

by JumpCrisscross20 hours ago|

[-]

> I was never under the impression that gaps in conversations would increase costs

The UI could indicate this by showing a timer before context is dumped.

reply

upvote

by vyr16 hours ago|

[-]

a countdown clock telling you that you should talk to the model again before your streak expires? that's the kind of UX i'd expect from an F2P mobile game or an abandoned shopping cart nag notification

reply

upvote

by abustamam16 hours ago|

[-]

Well sure if you put it that way, they're similar. But it's either you don't see it and you get surprised by increased quota usage, or you do see it and you know what it means. Bonus points if they let you turn it off.

No need to gamify it. It's just UI.

reply

upvote

by thinkmassive13 hours ago|

[-]

Plenty of room for a middle ground, like a static timestamp per session that shows expiration time, without the distraction of a constantly changing UI element.

reply

upvote

by matheusmoreira11 hours ago|

[-]

Why not an automated ping message that's cheap for the model to respond to?

reply

upvote

by cortesoft11 hours ago|

[-]

Because the cache is held on anthropics side, and they aren't going to hold your context in cache indefinitely.

reply

upvote

by karsinkk20 hours ago|

[-]

Yes!! A UI widget that shows how far along on the prompt cache eviction timelines we are would be great.

reply

upvote

by vanviegen7 hours ago|

[-]

That sounds stressful.

But perhaps Claude Code could detect that you're actively working on this stuff (like typing a prompt or accessing the files modified by the session), and send keep-cache-alive pings based on that? Presumably these pings could be pretty cheap, as the kv-cache wouldn't need to be loaded back into VRAM for this. If that would work reliably, cache expiry timeouts could be more aggressive (5 min instead of an hour).

reply

upvote

by jimkleiber15 hours ago|

[-]

I tried to hack the statusline to show this but when i tried, i don't think the api gave that info. I'd love if they let us have more variables to access in the statusline.

reply

upvote

by kiratp14 hours ago|

[-]

By caching they mean “cached in GPU memory”. That’s a very very scarce resource.

Caching to RAM and disk is a thing but it’s hard to keep performance up with that and it’s early days of that tech being deployed anywhere.

Disclosure: work on AI at Microsoft. Above is just common industry info (see work happening in vLLM for example)

reply

upvote

by libraryofbabel9 hours ago|

[-]

Nit: It doesn’t have to live in GPU memory. The system will use multiple levels of caching and will evict older cached data to CPU RAM or to disk if a request hasn’t recently come in that used that prefix. The problem is, the KV caches are huge (many GB) and so moving them back onto the GPU is expensive: GPU memory bandwidth is the main resource constraint in inference. It’s also slow.

The larger point stands: the cache is expensive. It still saves you money but Anthropic must charge for it.

Edit: there are a lot of comments here where people don't understand LLM prefix caching, aka the KV cache. That's understandable: it is a complex topic and the usual intuitions about caching you might have from e.g. web development don't apply: a single cache blob for a single request is in the 10s of GB at least for a big model, and a lot of the key details turn on the problems of moving it in and out of GPU memory. The contents of the cache is internal model state; it's not your context or prompt or anything like that. Furthermore, this isn't some Anthropic-specific thing; all LLM inference with a stable context prefix will use it because it makes inference faster and cheaper. If you want to read up on this subject, be careful as a lot of blogs will tell you about the KV cache as it is used within inference for an single request (a critical detail concept in how LLMs work) but they will gloss over how the KV cache is persisted between requests, which is what we're all talking about here. I would recommend Philip Kiely's new book Inference Engineering for a detailed discussion of that stuff, including the multiple caching levels.

reply

upvote

by bede5 hours ago|

[-]

I too would far rather bear a token cost than have my sessions rot silently beneath my feet. I usually have ~5 running CC sessions, some of which I may leave for a week or two of inactivity at a time.

reply

upvote

by airstrike2 hours ago|

[-]

same here, and I suspect there are dozens of us

reply

upvote

by computably20 hours ago|

[-]

> I was never under the impression that gaps in conversations would increase costs nor reduce quality. Both are surprising and disappointing.

You didn't do your due diligence on an expensive API. A naïve implementation of an LLM chat is going to have O(N^2) costs from prompting with the entire context every time. Caching is needed to bring that down to O(N), but the cache itself takes resources, so evictions have to happen eventually.

reply

upvote

by doesnt_know19 hours ago|

[-]

How do you do "due diligence" on an API that frequently makes undocumented changes and only publishes acknowledgement of change after users complain?

You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.

reply

upvote

by dlivingston12 hours ago|

[-]

What is being discussed is KV caching [0], which is used across every LLM model to reduce inference compute from O(n^2) to O(n). This is not specific to Claude nor Anthropic.

[0]: https://huggingface.co/blog/not-lain/kv-caching

reply

upvote

by computably12 hours ago|

[-]

> How do you do "due diligence" on an API that frequently makes undocumented changes and only publishes acknowledgement of change after users complain?

1. Compute scaling with the length of the sequence is applicable to transformer models in general, i.e. every frontier LLM since ChatGPT's initial release.

2. As undocumented changes happen frequently, users should be even more incentivized to at least try to have a basic understanding of the product's cost structure.

> You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.

I think "internal technical implementation" is a stretch. Users don't need to know what a "transformer" is to understand the trade-off. It's not trivial but it's not something incomprehensible to laypersons.

reply

upvote

by tempest_14 hours ago|

[-]

I use CC, and I understand what caching means.

I have no idea how that works with a LLM implementation nor do I actually know what they are caching in this context.

reply

upvote

by libraryofbabel7 hours ago|

[-]

They are caching internal LLM state, which is in the 10s of GB for each session. It's called a KV cache (because the internal state that is cached are the K and V matrices) and it is fundamental to how LLM inference works; it's not some Anthropic-specific design decision. See my other comment for more detail and a reference.

reply

upvote

by hakanderyal11 hours ago|

[-]

CC can explain it clearly, which how I learned about how the inference stack works.

reply

upvote

by fragmede9 hours ago|

[-]

> 99.99% of users won't even understand the words that are being used.

That's a bad estimate. Claude Code is explicitly a developer shaped tool, we're not talking generically ChatGPT here, so my guess is probably closer to 75% of those users do understand what caching is, with maybe 30% being able to explain prompt caching actually is. Of course, those users that don't understand have access to Claude and can have it explain what caching is to them if they're interested.

reply

upvote

by solarkraft19 hours ago|

[-]

I somewhat disagree that this is due diligence. Claude Code abstracts the API, so it should abstract this behavior as well, or educate the user about it.

reply

upvote

by mpyne17 hours ago|

[-]

> Claude Code abstracts the API, so it should abstract this behavior as well, or educate the user about it.

Does mmap(2) educate the developer on how disk I/O works?

At some point you have to know something about the technology you're using, or accept that you're a consumer of the ever-shifting general best practice, shifting with it as the best practice shifts.

reply

upvote

by websap15 hours ago|

[-]

Does using print() in Python means I need to understand the Kernel? This is an absurd thought.

reply

upvote

by Nevermark8 hours ago|

[-]

That might be an absurd comparison, but we can fix that.

If you were being charged per character, or running down character limits, and printing on printers that were shared and had economic costs for stalled and started print runs, then:

You wouldn’t “need” to understand. The prints would complete regardless. But you might want to. Personal preference.

Which is true of this issue to.

reply

upvote

by Barbing8 hours ago|

[-]

>If you were being charged per character, or running down character limits, and printing on printers that were shared and had economic costs for stalled and started print runs,

and the system was being run by some of the planet’s brightest people whose famous creation is well known to disseminate complex information succinctly,

>then:

You would expect to be led to understand, like… a 1997 Prius.

“This feature showed the vehicle operation regarding the interplay between gasoline engine, battery pack, and electric motors and could also show a bar-graph of fuel economy results.” https://en.wikipedia.org/wiki/Toyota_Prius_(XW10)

reply

upvote

by redsocksfan455 hours ago|

[-]

[dead]

reply

upvote

by zem17 hours ago|

[-]

mmap(2) and all its underlying machinery are open source and well documented besides.

reply

upvote

by mpyne16 hours ago|

[-]

There are open-source and even open-weight models that operate in exactly this way (as it's based off of years of public research), and even if there weren't the way that LLMs generate responses to inputs is superbly documented.

Seems like every month someone writes up a brilliant article on how to build an LLM from scratch or similar that hits the HN page, usually with fancy animated blocks and everything.

It's not at all hard to find documentation on this topic. It could be made more prominent in the U/I but that's true of lots of things, and hammering on "AI 101" topics would clutter the U/I for actual decision points the user may want to take action upon that you can't assume the user already knows about in the way you (should) be able to assume about how LLMs eat up tokens in the first place.

reply

upvote

by computably12 hours ago|

[-]

I would say this is abstracting the behavior.

reply

upvote

by margalabargala18 hours ago|

[-]

Okay, sure. There's a dollar/intelligence tradeoff. Let me decide to make it, don't silently make Claude dumber because I forgot about a terminal tab for an hour. Just because a project isn't urgent doesn't mean it's not important. If I thought it didn't need intelligence I would use Sonnet or Haiku.

reply

upvote

by someguyiguess19 hours ago|

[-]

Yes. It’s perfectly reasonable to expect the user to know the intricacies of the caching strategy of their llm. Totally reasonable expectation.

reply

upvote

by jghn16 hours ago|

[-]

To some extent I'd say it is indeed reasonable. I had observed the effect for a while: if I walked away from a session I noticed that my next prompt would chew up a bunch of context. And that led me to do some digging, at which point I discovered their prompt caching.

So while I'd agree with your sarcasm that expecting users to be experts of the system is a big ask, where I disagree with you is that users should be curious and actively attempting to understand how it works around them. Given that the tooling changes often, this is an endless job.

reply

upvote

by abustamam16 hours ago|

[-]

> users should be curious and actively attempting to understand how it works

Have you ever talked with users?

> this is an endless job

Indeed. If we spend all our time learning what changed with all our tooling when it changes without proper documentation then we spend all our working lives keeping up instead of doing our actual jobs.

reply

upvote

by jghn2 hours ago|

[-]

> Have you ever talked with users?

I believe if one were to read my post it'd have been clear that I *am* a user.

This *is* "hacker" news after all. I think it's a safe assumption that people sitting here discussing CC are an inquisitive sort who want to understand what's under the hood of their tools and are likely to put in some extra time to figure it out.

reply

upvote

by abustamam1 hours ago|

[-]

We're inquisitive but at the end of the day many of us just want to get our work done. If it's a toy project, sure. Tinker away, dissect away. When my boss is breathing down my neck on why a feature is taking so long? No time for inquiries.

reply

upvote

by Octoth0rpe14 hours ago|

[-]

There are general users of the average SaaS, and there are claude code users. There's no doubt in my mind that our expectations should be somewhat higher for CC users re: memory. I'm personally not completely convinced that cache eviction should be part of their thought process while using CC, but it's not _that_ much of a stretch.

reply

upvote

by abustamam11 hours ago|

[-]

Personally I've never thought about cache eviction as it pertains to CC. It's just not something that I ever needed to think about. Maybe I'm just not a power user but I just use the product the way I want to and it just works.

reply

upvote

by troupo10 hours ago|

[-]

Anthropic literally advertises long sessions, 1M context, high reasoning etc.

And then their vibe-coders tell us that we are to blame for using the product exactly as advertised: https://x.com/lydiahallie/status/2039800718371307603 while silently changing how the product works.

Please stop defending hapless innocent corporations.

reply

upvote

by jghn2 hours ago|

[-]

This oversells how obfuscated it is. I'm far from a power user, and the opposite of a vibe coder. Yet I noticed the effect on my own just from general usage. If I can do it, anyone can do it.

reply

upvote

by taormina47 minutes ago|

[-]

Listen, no one cares if you think you’re smart for seeing through the lies of their marketing team. You’re being intentionally obtuse.

reply

upvote

by redsocksfan455 hours ago|

[-]

[dead]

reply

upvote

by coldtea18 hours ago|

[-]

It's not like they have a poweful all-knowing oracle that can explain it to them at their dispos... oh, wait!

reply

upvote

by esafak18 hours ago|

[-]

They have to know that this could bite them and to ask the question first.

reply

upvote

by nixpulvis17 hours ago|

[-]

I do think having some insight into the current state of the cache and a realistic estimate for prompt token use is something we should demand.

reply

upvote

by switchbak13 hours ago|

[-]

If there was an affordance on the TUI that made this visible and encouraged users to learn more - that would go a long way.

reply

upvote

by exac17 hours ago|

[-]

It is more useful to read posts and threads like this exact thread IMO. We can't know everything, and the currently addressed market for Claude Code is far from people who would even think about caching to begin with.

reply

upvote

by kang18 hours ago|

[-]

It seems you haven't done the due diligence on what part of the API is expensive - constructing a prompt shouldn't be same charge/cost as llm pass.

reply

upvote

by coldtea18 hours ago|

[-]

It seems you haven't done the due diligence on what the parent meant :)

It's not about "constructing a prompt" in the sense of building the prompt string. That of course wouldn't be costly.

It is about reusing llm inference state already in GPU memory (for the older part of the prompt that remains the same) instead of rerunning the prompt and rebuilding those attention tensors from scratch.

reply

upvote

by kang17 hours ago|

[-]

You not only skipped the diligence but confused everyone repeating what I said :(

that is what caching is doing. the llm inference state is being reused. (attention vectors is internal artefact in this level of abstraction, effectively at this level of abstraction its a the prompt).

The part of the prompt that has already been inferred no longer needs to be a part of the input, to be replaced by the inference subset. And none of this is tokens.

reply

upvote

by computably13 hours ago|

[-]

I said "prompting with the entire context every time," I think it should be clear even to laypersons that the "prompting" cost refers to what the model provider charges you when you send them a prompt.

reply

upvote

by kovek18 hours ago|

[-]

What if the cache was backed up to cold storage? Instead of having to recompute everything.

reply

upvote

by vanviegen7 hours ago|

[-]

They probably already do that. But these caches can get pretty big (10s of GBs per session), so that adds up fast, even for cold storage.

reply

upvote

by bontaq17 hours ago|

[-]

How's that O(N^2)? How's it O(N) with caching? Does a 3 turn conversation cost 3 times as much with no caching, or 9 times as much?

reply

upvote

by jannyfer16 hours ago|

[-]

I’m not sure that it’s O(N) with caching but this illustrates the N^2 part:

https://blog.exe.dev/expensively-quadratic

reply

upvote

by bontaq11 hours ago|

[-]

If there was an exponential cost, I would expect to see some sort of pricing based on that. I would also expect to see it taking exponentially longer to process a prompt. I don't believe LLMs work like that. The "scary quadratic" referenced in what you linked seems to be pointing out that cache reads increase as your conversation continues?

If I'm running a database keeping track of a conversation, and each time it writes the entire history of the conversation instead of appending a message, are we calling that O(N^2) now?

reply

upvote

by bavell3 hours ago|

[-]

> I would also expect to see it taking exponentially longer to process a prompt. I don't believe LLMs work like that.

Try this out using a local LLM. You'll see that as the conversation grows, your prompts take longer to execute. It's not exponential but it's significant. This is in fact how all autoregressive LLMs work.

reply

upvote

by atq211910 hours ago|

[-]

Yes, that is indeed O(N^2). Which, by the way, is not exponential.

Also by the way, caching does not make LLM inference linear. It's still quadratic, but the constant in front of the quadratic term becomes a lot smaller.

reply

upvote

by computably6 hours ago|

[-]

> Also by the way, caching does not make LLM inference linear. It's still quadratic, but the constant in front of the quadratic term becomes a lot smaller.

Touché. Still, to a reasonable approximation, caching makes the dominant term linear, or equiv, linearly scales the expensive bits.

reply

upvote

by _flux7 hours ago|

[-]

What we would call O(n^2) in your rewriting message history would be the case where you have an empty database and you need to populate it with a certain message history. The individual operations would take 1, 2, 3, .. n steps, so (1/2)*n^2 in total, so O(n^2).

This is the operation that is basically done for each message in an LLM chat in the logical level: the complete context/history is sent in to be processed. If you wish to process only the additions, you must preserve the processed state on server-side (in KV cache). KV caches can be very large, e.g. tens of gigabytes.

reply

upvote

by raron19 hours ago|

[-]

How big this cached data is? Wouldn't it be possible to download it after idling a few minutes "to suspend the session", and upload and restore it when the user starts their next interaction?

reply

upvote

by throwdbaaway17 hours ago|

[-]

Should be about 10~20 GiB per session. Save/restore is exactly what DeepSeek does using its 3FS distributed filesystem: https://github.com/deepseek-ai/3fs#3-kvcache

With this much cheaper setup backed by disks, they can offer much better caching experience:

> Cache construction takes seconds. Once the cache is no longer in use, it will be automatically cleared, usually within a few hours to a few days.

reply

upvote

by cortesoft11 hours ago|

[-]

What they mean when they say 'cached' is that it is loaded into the GPU memory on anthropic servers.

You already have the data on your own machine, and that 'upload and restore' process is exactly what is happening when you restart an idle session. The issue is that it takes time, and it counts as token usage because you have to send the data for the GPU to load, and that data is the 'tokens'.

reply

upvote

by vanviegen6 hours ago|

[-]

Wrong on both counts. The kv-cache is likely to be offloaded to RAM or disk. What you have locally is just the log of messages. The kv-cache is the internal LLM state after having processed these messages, and it is a lot bigger.

reply

upvote

by nl12 hours ago|

[-]

> upload and restore it when the user starts their next interaction

The data is the conversation (along with the thinking tokens).

There is no download - you already have it.

The issue is that it gets expunged from the (very expensive, very limited) GPU cache and to reload the cache you have to reprocess the whole conversation.

That is doable, but as Boris notes it costs lots of tokens.

reply

upvote

by vanviegen6 hours ago|

[-]

You're quite confidently wrong! :-)

The kv-cache is the internal LLM state after having processed the tokens. It's big, and you do not have it locally.

reply

upvote

by cyanydeez18 hours ago|

[-]

I often see a local model QWEN3.5-Coder-Next grow to about 5 GB or so over the course of a session using llamacpp-server. I'd better these trillion parameter models are even worse. Even if you wanted to download it or offload it or offered that as a service, to start back up again, you'd _still_ be paying the token cost because all of that context _is_ the tokens you've just done.

The cache is what makes your journey from 1k prompt to 1million token solution speedy in one 'vibe' session. Loading that again will cost the entire journey.

reply

upvote

by miroljub17 hours ago|

[-]

This sounds like a religious cult priest blaming the common people for not understanding the cult leader's wish, which he never clearly stated.

reply

upvote

by computably6 hours ago|

[-]

A strange view. The trade-off has nothing to do with a specific ideology or notable selfishness. It is an intrinsic limitation of the algorithms, which anybody could reasonably learn about.

Sure, the exact choice on the trade-off, changing that choice, and having a pretty product-breaking bug as a result, are much more opaque. But I was responding to somebody who was surprised there's any trade-off at all. Computers don't give you infinite resources, whether or not they're "servers," "in the cloud," or "AI."

reply

upvote

by miroljub4 hours ago|

[-]

He was surprised because it was not clearly communicated. There's a lot of theory behind a product that you could (or could not) better understand, but in the end, something like price doesn't have much to do with the theoretical and practical behavior of the actual application.

reply

upvote

by winternewt5 hours ago|

[-]

Instead of just dropping all the context, the system could also run a compaction (summarizing the entire convo) before dropping it. Better to continue with a summary than to lose everything.

reply

upvote

by Folcon5 hours ago|

[-]

There's problems with this approach as well I've found

I'm really beginning to feel the lack of control when it's comes to context if I'm being honest

reply

upvote

by bcherny10 hours ago|

[-]

Yes! This is what we’re trying next.

reply

upvote

by cyanydeez18 hours ago|

[-]

It'd probably be helpful for power users and transparency to actually show how the cache is being used. If you run local models with llamacpp-server, you can watch how the cache slots fill up with every turn; when subagents spawn, you see another process id spin up and it takes up a cache slot; when the model starts slowing down is when the context grows (amd 395+ around 80-90k) and the cache loads are bigger because you've got all that.

So yeah, it doesn't take much to surface to the user that the speed/value of their session is ephemeral because to keep all that cache active is computationally expensive because ...

You're still just running text through a extremely complex process, and adding to that text and to avoid re-calculation of the entire chain, you need the cache.

reply

upvote

by nixpulvis17 hours ago|

[-]

How else would you implement it?

reply