upvote
The approach here is very bad for training though, because unlike softmax attention, average-hard attention is not differentiable with respect to the keys and queries, and if you try to fix that e.g. with straight-through estimation, the backward pass cannot be sped up in the same way as the forward pass.
reply
Training is ruled out (see peer comment), however you may find this fascinating, somewhat rhymes: https://arxiv.org/abs/2603.10055
reply