undefined

points

[-]

N.B. This is exactly how seaside, vba, and even arc[1] do server-side state generally: by encrypting the blob-representing-state and sending to the client to be sent back on future requests (where it will be decrypted and rehydrated).

It's an old trick that everyone designing protocols should know, since there are lots of applications beyond AI companies.

[1]: As in, pg's lisp: https://arclanguage.github.io/ref/srv.html#:~:text=The%20pre...

by tn110 hours ago|

parent|

[-]

And don't forget the venerable .NET Forms with its kilobytes of __VIEWSTATE

by antonvs4 hours ago|

parent|

[-]

> kilobytes

cute

by LoganDark5 hours ago|

parent|

prev|

[-]

Do they mitigate replay attacks?

by mycall5 hours ago|

prev|

[-]

While it seems like a good idea, resending a growing context window is very inefficient and costly. Instance pinning would make a huge efficiency gains but also collapse LLM provider revenue. This is something open models could better solve.

by mswphd3 hours ago|

parent|

[-]

even a max size context window is what, ~1M? iirc tokens are generally part of a vocab of size ~300k. Assume no compression before the encryption (no clue if this is true, but compressing text before encryption can leak info regarding the message, namely how compressible it is), that's \log2 300k ~ 18 bits per token, or ~2 bytes. So each "turn" would involve ~2MB extra in each direction. And again, this is assuming max context.

seems plausibly fine

by brookst5 hours ago|

parent|

prev|

[-]

Can you elaborate? How could it be more efficient and bad for revenue? Would it also be bad for profit?

by bruce3434343 hours ago|

parent|

[-]

More efficient in terms of bandwidth (not) used. More costly because it has to be stored somewhere instead.

by b65e8bee43c2ed011 hours ago|

prev|

[-]

the exchange rate between text and its representation in memory is brutal. here's a bit from a recent article:

>An 82 GB footprint in DDR3 on a 2016 Xeon. About 25 GB of weights and 56 GB of KV cache at the full 262K context. The KV cache is larger than the model.

262k tokens is not much at all. with ~5 characters per token, that's only 1.3 MB of plaintext.

by londons_explore9 hours ago|

parent|

[-]

The providers must have a more efficient approach. Most cache every request for 12+ hours, and they certainly can't spare 100GB of ram per request for 12 hours.

by jasonjmcghee5 hours ago|

parent|

[-]

> 12 hours

have things changed around this recently? I know openai optionally allows 24 hours but thought it was ~1h without that, and anthropic used to quote 5-15 minutes or something.

by brookst5 hours ago|

parent|

[-]

Anthropic is 5 minutes, though you can pay more to get 60 minutes I believe.

The #1 was to make Claude code token quota go further is to never let the 5 minute cache TTL expire. Either send a new request within the window, or use /clear and copy/paste, or use /clear and a framework that automatically generates session state that gets replayed from files after /clear.

by dist-epoch8 hours ago|

parent|

prev|

[-]

This is one reason why price of SSDs also doubled, not just of RAM.

> LMCache extends the KV Cache from the NVIDIA GPU's fast HBM (Tier 1) to larger, more cost-effective tiers like CPU RAM and local SSDs.

https://cloud.google.com/blog/topics/developers-practitioner...

by choppaface9 hours ago|

parent|

prev|

[-]

or maybe they don’t actually cache (fully) but lie and just don’t charge the user right now. at least half the users, who are probably also using the most similar tokens / prompts, wouldn’t really know the difference in latency (or care)

by londons_explore8 hours ago|

parent|

[-]

If it actually cost that much RAM, they would almost certainly add extra things to the API to manage cache lifetime. Ie. A 'please cache this for X minutes' flag, or a setting for a single re-use cache (the most common use case)

by cyanydeez7 hours ago|

parent|

[-]

https://platform.claude.com/docs/en/build-with-claude/prompt...

suggests the can cache outside the gpu.

by londons_explore9 hours ago|

prev|

[-]

Except the providers also cache the parsing of the prompt (the KV cache), and that has substantial cost savings (easily an 80% saving on typical coding use cases).

That caching is done server side and not passed to the client. Which in turn means they still need state management on the server side, although it perhaps doesn't need the same level of global replication and availability.

by cyanydeez7 hours ago|

parent|

[-]

from the march changes, it looked like they increased cache eviction rates on the VRAM at claude causing everyone to start burning tokens as they had to regen token state.

by cyanydeez7 hours ago|

prev|

[-]

they still have to cache the tokens. its not completely stateless.

in theory, every conversation is replayed from the beginning. in practice, its only going to be economical to heavily cache the stable portions of the text as tokens inside the GPU

one of the reasons the Cloud providers have such heavy prompts is because that can be cached for all users, but its essentially poisonong the state before you even start. alot of the variability appears related to changing the context rather than the model.

models are expensive and the bean counters know fine tuning and context changes are cheaper. id guess the IPOs are essentially the SOTA EOL.