I still don't understand it, yes it's a lot of data and presumably they're already shunting it to cpu ram instead of keeping it on precious vram, but they could go further and put it on SSD at which point it's no longer in the hotpath for their inference.
But they used to return thinking output directly in the API, and that was _the_ reason I liked Claude over OpenAI's reasoning models.
What would be an interesting option would be to let the user pay more for longer caching, but if the base length is 1 hour I assume that would become expensive very quickly.
It's kept for long enough that it's expensive to store in RAM, but short enough that the writes are frequent and will wear down SSD storage
But even if it’s not — I can’t build a scenario in my head where recalculating it on real GPUs is cheaper/faster than retrieving it from some kind of slower cache tier