undefined

points

[-]

The approach here is very bad for training though, because unlike softmax attention, average-hard attention is not differentiable with respect to the keys and queries, and if you try to fix that e.g. with straight-through estimation, the backward pass cannot be sped up in the same way as the forward pass.

by refulgentis3 hours ago|

prev|

[-]

Training is ruled out (see peer comment), however you may find this fascinating, somewhat rhymes: https://arxiv.org/abs/2603.10055