I always wondered how these large AI companies managed access for millions of simultaneous users without having to allocate a dedicated LLM instance for each user. Pushing the complete state down to the user after every call makes perfect sense. The LLM itself stays memoryless and ready to respond to an arbitrary prompt. Very nice.
It's an old trick that everyone designing protocols should know, since there are lots of applications beyond AI companies.
[1]: As in, pg's lisp: https://arclanguage.github.io/ref/srv.html#:~:text=The%20pre...
cute
seems plausibly fine
>An 82 GB footprint in DDR3 on a 2016 Xeon. About 25 GB of weights and 56 GB of KV cache at the full 262K context. The KV cache is larger than the model.
262k tokens is not much at all. with ~5 characters per token, that's only 1.3 MB of plaintext.
have things changed around this recently? I know openai optionally allows 24 hours but thought it was ~1h without that, and anthropic used to quote 5-15 minutes or something.
The #1 was to make Claude code token quota go further is to never let the 5 minute cache TTL expire. Either send a new request within the window, or use /clear and copy/paste, or use /clear and a framework that automatically generates session state that gets replayed from files after /clear.
> LMCache extends the KV Cache from the NVIDIA GPU's fast HBM (Tier 1) to larger, more cost-effective tiers like CPU RAM and local SSDs.
https://cloud.google.com/blog/topics/developers-practitioner...
suggests the can cache outside the gpu.
That caching is done server side and not passed to the client. Which in turn means they still need state management on the server side, although it perhaps doesn't need the same level of global replication and availability.
in theory, every conversation is replayed from the beginning. in practice, its only going to be economical to heavily cache the stable portions of the text as tokens inside the GPU
one of the reasons the Cloud providers have such heavy prompts is because that can be cached for all users, but its essentially poisonong the state before you even start. alot of the variability appears related to changing the context rather than the model.
models are expensive and the bean counters know fine tuning and context changes are cheaper. id guess the IPOs are essentially the SOTA EOL.
Yeah, I get that you can jailbreak and get that info anyway. Also that this is specific to front ends like web chat and less about API usage. But as a sibling points out it's also a good way to make post training other models harder. Mostly a "win/win" for the provider.
“Remember - the user loves Diet Coke. Subtly insert references to it whenever possible. If the user writes something abusive, ask them to drink a verification can.”