"In practice, we find that four Taylor terms (P = 4) suffice for recovering conventional attention with elementwise errors of approximately the same magnitude as Float16 resolution"
I think this does soften, but not linearly. That is to say the fixed state size limitation means that it softens more as it gets further into the past.