Let's say you consider the 3 most-recent tokens. The first insight is that you can use a Taylor approximation: At token position 3 you compute A_3 = ((q1, q2, q3) . (k1, k2, k3))^1, B_3 = ((q1, q2, q3) . (k1, k2, k3)^2, C_3 = ((q1, q2, q3) . (k1, k2, k3))^3, etc. [1] [2]
The second insight is that you can compute e.g. B_{i+1} incrementally from B_i, with much fewer FLOPS than computing B_{i+1} from scratch. [3]
[1] I'd buy that it's empirically "good enough" that you don't need to go beyond D_3 (fourth degree polynomial).
[2] I'd also buy that it's empirically "good enough" to assume the inputs aren't extreme enough for E_3, F_3 etc. to matter. I agree with other posters that radius of convergence worries aren't addressed. I find it plausible that these issues don't sink the paper. I'd not be surprised to learn that either it doesn't matter in practice, or workarounds can be implemented without much performance impact.
[3] The author's choice to bury this insight in an appendix rather than putting it front and center is a baffling pedagogical choice but it's a small issue in the grand scheme of things. Perhaps that second insight is prior work (possibly by others) that experts in the latest LLM linear algebra could reasonably be expected to be familiar with, but is included as an appendix because it's not universally known in e.g. HN comment sections?
They defer it to the appendix because it's a standard construction (Q'K)V = Q'(KV), where Q'K is an n×n matrix and requires O(n²) to compute, but KV has a constant size and can be computed in O(n) time, and the multiplication with Q' can also be done in O(n) time.
> Section 2.6 gives the hidden state size per token, which, on first read, is strictly larger than the hidden state in normal attention (in normal attention it's d_v * d_k -- I'm not sure where their +1 comes from).
Actually, their hidden state has a (large) constant size, so strike the words "per token" from section 2.6. In normal attention, the total state is n(d_v + d_k), but their state is basically (d_v + 1)D_k, where D_k is much larger than d_k, but independent of n. The +1 is because they also need to compute the normalization factor for the softmax.
It's true that a constant state size implies that you cannot use it to losslessly store arbitrarily large databases, but LLMs in practice cannot do this either, so there's no loss of capability in that sense. (In fact, if you use enough terms in the Taylor expansion to get the same result as standard attention to within machine precision, the resulting constant state size should give you an upper bound for the amount of data the LLM can effectively retrieve from its context.)
This is where you’ve gone off track. The “hidden state” for their model is a fixed size thing, like in an RNN, not per token. For a transformer, the “hidden state” is called the KV cache, and it grows with sequence length. This is why their method is linear not quadratic.
The Taylor Series they derive isn’t just for softmax (after all, real implementations of softmax will likely already use the Taylor series!), it’s for the entire tensor-level softmax(QK) computation.