undefined

points

[-]

It’s a shitload of data, and it only works if all the tokens are 100% identical, i.e. all the attention values are exactly the same.

Typically it’s cached for about 5 minutes, you can pay extra for longer caches.

by stavros2 hours ago|

prev|

[-]

Probably because the costly operation is loading it onto the GPU, doesn't matter if it's from disk or from your request.

by zozbot2342 hours ago|

parent|

[-]

The point of prompt caching is to save on prefill which for large contexts (common for agentic workloads) is quite expensive per token. So there is a context length where storing that KV-cache is worth it, because loading it back in is more efficient than recomputing it. For larger SOTA models, the KV cache unit size is also much smaller compared to the compute cost of prefill, so caching becomes worthwhile even for smaller context.