While we do not yet know whether exact softmax attention
can be maintained with the same efficiency, it is easy to
approximate it with k-sparse softmax attention: retrieve
the top-k keys and perform the softmax only over those
but if you have played around with training models that use e.g. topk or other hard thresholding operations in e.g. PyTorch (or just think about how many gradients become zero with such an operation) you know that these tend to work only in extremely limited / specific cases, and make training even more finicky than it already is.So the issue isn't if there aren't ways to effectively approximate their approach, from a strictly numerical approximation standpoint, it is that other factors matter much more in optimization when training on actual data.