upvote
BTW, I forgot to mention that you can make this work in a way, but only if your model architecture generalizes the context and attention mechanism such that it's no longer a pure sequence. So you could have a large amount of distinct "early" token sequences, with each being self-contained and not depending on any other tokens, e.g. your source code files might be such. Then later parts of the context would of course depend on all of those files as usual. This makes prefill for the earlier context both reusable and cheaply recomputable throughout, at the cost of losing some dependencies that would've been previously accounted for: your model becomes faster and more efficient, but perhaps not quite as smart.
reply
Sure, but any classical attention mechanism is quadratic in context length.
reply
But text generation is quadratic after the KV cache optimization. If every decode step now has to recompute KV cache including its latest and most expensive tokens (even with a quick, "draft" model) that's even worse.
reply