You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.
1. Compute scaling with the length of the sequence is applicable to transformer models in general, i.e. every frontier LLM since ChatGPT's initial release.
2. As undocumented changes happen frequently, users should be even more incentivized to at least try to have a basic understanding of the product's cost structure.
> You're also talking about internal technical implementations of a chat bot. 99.99% of users won't even understand the words that are being used.
I think "internal technical implementation" is a stretch. Users don't need to know what a "transformer" is to understand the trade-off. It's not trivial but it's not something incomprehensible to laypersons.
I have no idea how that works with a LLM implementation nor do I actually know what they are caching in this context.
That's a bad estimate. Claude Code is explicitly a developer shaped tool, we're not talking generically ChatGPT here, so my guess is probably closer to 75% of those users do understand what caching is, with maybe 30% being able to explain prompt caching actually is. Of course, those users that don't understand have access to Claude and can have it explain what caching is to them if they're interested.
Does mmap(2) educate the developer on how disk I/O works?
At some point you have to know something about the technology you're using, or accept that you're a consumer of the ever-shifting general best practice, shifting with it as the best practice shifts.
If you were being charged per character, or running down character limits, and printing on printers that were shared and had economic costs for stalled and started print runs, then:
You wouldn’t “need” to understand. The prints would complete regardless. But you might want to. Personal preference.
Which is true of this issue to.
and the system was being run by some of the planet’s brightest people whose famous creation is well known to disseminate complex information succinctly,
>then:
You would expect to be led to understand, like… a 1997 Prius.
“This feature showed the vehicle operation regarding the interplay between gasoline engine, battery pack, and electric motors and could also show a bar-graph of fuel economy results.” https://en.wikipedia.org/wiki/Toyota_Prius_(XW10)
Seems like every month someone writes up a brilliant article on how to build an LLM from scratch or similar that hits the HN page, usually with fancy animated blocks and everything.
It's not at all hard to find documentation on this topic. It could be made more prominent in the U/I but that's true of lots of things, and hammering on "AI 101" topics would clutter the U/I for actual decision points the user may want to take action upon that you can't assume the user already knows about in the way you (should) be able to assume about how LLMs eat up tokens in the first place.
So while I'd agree with your sarcasm that expecting users to be experts of the system is a big ask, where I disagree with you is that users should be curious and actively attempting to understand how it works around them. Given that the tooling changes often, this is an endless job.
Have you ever talked with users?
> this is an endless job
Indeed. If we spend all our time learning what changed with all our tooling when it changes without proper documentation then we spend all our working lives keeping up instead of doing our actual jobs.
I believe if one were to read my post it'd have been clear that I *am* a user.
This *is* "hacker" news after all. I think it's a safe assumption that people sitting here discussing CC are an inquisitive sort who want to understand what's under the hood of their tools and are likely to put in some extra time to figure it out.
And then their vibe-coders tell us that we are to blame for using the product exactly as advertised: https://x.com/lydiahallie/status/2039800718371307603 while silently changing how the product works.
Please stop defending hapless innocent corporations.
It's not about "constructing a prompt" in the sense of building the prompt string. That of course wouldn't be costly.
It is about reusing llm inference state already in GPU memory (for the older part of the prompt that remains the same) instead of rerunning the prompt and rebuilding those attention tensors from scratch.
that is what caching is doing. the llm inference state is being reused. (attention vectors is internal artefact in this level of abstraction, effectively at this level of abstraction its a the prompt).
The part of the prompt that has already been inferred no longer needs to be a part of the input, to be replaced by the inference subset. And none of this is tokens.
If I'm running a database keeping track of a conversation, and each time it writes the entire history of the conversation instead of appending a message, are we calling that O(N^2) now?
Try this out using a local LLM. You'll see that as the conversation grows, your prompts take longer to execute. It's not exponential but it's significant. This is in fact how all autoregressive LLMs work.
Also by the way, caching does not make LLM inference linear. It's still quadratic, but the constant in front of the quadratic term becomes a lot smaller.
Touché. Still, to a reasonable approximation, caching makes the dominant term linear, or equiv, linearly scales the expensive bits.
This is the operation that is basically done for each message in an LLM chat in the logical level: the complete context/history is sent in to be processed. If you wish to process only the additions, you must preserve the processed state on server-side (in KV cache). KV caches can be very large, e.g. tens of gigabytes.
With this much cheaper setup backed by disks, they can offer much better caching experience:
> Cache construction takes seconds. Once the cache is no longer in use, it will be automatically cleared, usually within a few hours to a few days.
You already have the data on your own machine, and that 'upload and restore' process is exactly what is happening when you restart an idle session. The issue is that it takes time, and it counts as token usage because you have to send the data for the GPU to load, and that data is the 'tokens'.
The data is the conversation (along with the thinking tokens).
There is no download - you already have it.
The issue is that it gets expunged from the (very expensive, very limited) GPU cache and to reload the cache you have to reprocess the whole conversation.
That is doable, but as Boris notes it costs lots of tokens.
The kv-cache is the internal LLM state after having processed the tokens. It's big, and you do not have it locally.
The cache is what makes your journey from 1k prompt to 1million token solution speedy in one 'vibe' session. Loading that again will cost the entire journey.
Sure, the exact choice on the trade-off, changing that choice, and having a pretty product-breaking bug as a result, are much more opaque. But I was responding to somebody who was surprised there's any trade-off at all. Computers don't give you infinite resources, whether or not they're "servers," "in the cloud," or "AI."