undefined

points

[-]

Same question I had in https://news.ycombinator.com/item?id=47819914

I still don't understand it, yes it's a lot of data and presumably they're already shunting it to cpu ram instead of keeping it on precious vram, but they could go further and put it on SSD at which point it's no longer in the hotpath for their inference.

by rkuska19 hours ago|

prev|

[-]

I don't think you can store the cache on client given the thinking is server side and you only get summaries in your client (even those are disabled by default).

by sargunv19 hours ago|

parent|

[-]

If they really need to guard the thinking output, they could encrypt it and store it client side. Later it'd be sent back and decrypted on their server.

But they used to return thinking output directly in the API, and that was _the_ reason I liked Claude over OpenAI's reasoning models.

by solarkraft19 hours ago|

prev|

[-]

I assume they are already storing the cache on flash storage instead of keeping it all in VRAM. KV caches are huge - that’s why it’s impractical to transfer to/from the client. It would also allow figuring out a lot about the underlying model, though I guess you could encrypt it.

What would be an interesting option would be to let the user pay more for longer caching, but if the base length is 1 hour I assume that would become expensive very quickly.

by tonyarkles19 hours ago|

parent|

[-]

Just to contextualize this... https://lmcache.ai/kv_cache_calculator.html. They only have smaller open models, but for Qwen3-32B with 50k tokens it's coming up with 7.62GB for the KV cache. Imagining a 900k session with, say, Opus, I think it'd be pretty unreasonable to flush that to the client after being idle for an hour.

by 2001zhaozhao16 hours ago|

parent|

prev|

[-]

I wonder whether prompt caches would be the perfect use case of something like Optane.

It's kept for long enough that it's expensive to store in RAM, but short enough that the writes are frequent and will wear down SSD storage

by ohcmon19 hours ago|

parent|

prev|

[-]

Yes — encryption is the solution for client side caching.

But even if it’s not — I can’t build a scenario in my head where recalculating it on real GPUs is cheaper/faster than retrieving it from some kind of slower cache tier