undefined

points

[-]

The providers must have a more efficient approach. Most cache every request for 12+ hours, and they certainly can't spare 100GB of ram per request for 12 hours.

by jasonjmcghee5 hours ago|

parent|

[-]

> 12 hours

have things changed around this recently? I know openai optionally allows 24 hours but thought it was ~1h without that, and anthropic used to quote 5-15 minutes or something.

by brookst5 hours ago|

parent|

[-]

Anthropic is 5 minutes, though you can pay more to get 60 minutes I believe.

The #1 was to make Claude code token quota go further is to never let the 5 minute cache TTL expire. Either send a new request within the window, or use /clear and copy/paste, or use /clear and a framework that automatically generates session state that gets replayed from files after /clear.

by dist-epoch9 hours ago|

parent|

prev|

[-]

This is one reason why price of SSDs also doubled, not just of RAM.

> LMCache extends the KV Cache from the NVIDIA GPU's fast HBM (Tier 1) to larger, more cost-effective tiers like CPU RAM and local SSDs.

https://cloud.google.com/blog/topics/developers-practitioner...

by choppaface9 hours ago|

parent|

prev|

[-]

or maybe they don’t actually cache (fully) but lie and just don’t charge the user right now. at least half the users, who are probably also using the most similar tokens / prompts, wouldn’t really know the difference in latency (or care)

by londons_explore8 hours ago|

parent|

[-]

If it actually cost that much RAM, they would almost certainly add extra things to the API to manage cache lifetime. Ie. A 'please cache this for X minutes' flag, or a setting for a single re-use cache (the most common use case)

by cyanydeez7 hours ago|

parent|

[-]

https://platform.claude.com/docs/en/build-with-claude/prompt...

suggests the can cache outside the gpu.