undefined

points

[-]

> this is where the taylor expression would fail to represent the values well.

"In practice, we find that four Taylor terms (P = 4) suffice for recovering conventional attention with elementwise errors of approximately the same magnitude as Float16 resolution"

by seanhunter8 hours ago|

parent|

[-]

I read that too, but I wondered whether elementwise error is the right metric. Surely the actual error metric should be to evaluate model performance for a conventional transformer model and then the same model with the attention mechanism replaced by this 4th order Taylor approximation?

by vlovich1238 hours ago|

parent|

[-]

Bounded error weights by definition is a more strict evaluation criterion than “performance” metrics through running the model.

by ehsanu14 hours ago|

parent|

[-]

To spell it out for myself and others: approaching equivalent calculations for each individual attention block means we also approach equivalent performance for the combination of them. And with an error bar approaching floating point accuracy, the performance should be practically identical to regular attention. Elementwise errors of this magnitude can't lead to any noteworthy changes in the overall result, especially given how robust LLM networks seem to be to small deviations.

by mapontosevenths9 hours ago|

prev|

[-]

> This just... softens attentions ability to attend?

I think this does soften, but not linearly. That is to say the fixed state size limitation means that it softens more as it gets further into the past.

by tehsauce8 hours ago|

prev|

[-]

Right, and when they compare to floating point accuracy they seem to be using the number of decimals supported by the mantissa, but the exponent is important no?

by seanhunter8 hours ago|

parent|

[-]

When someone says the error is of a certain magnitude they mean the absolute value of the difference between the the two things, so what they're saying is that the values they produced with their approximation are about as wrong as the difference between the actual values and those values cast to float16. The exponent is most definitely important and would be included in that.