undefined

points

[-]

This has been my thought for a long time. I think all that matters from attention is that there is crosswise comparison going on.

You need some amount of parallel compute and some amount of global comparison.

And the rest is basically a ways to parameters and scale.

(This is in theory, in practice you can get a lot of small % stability and efficiency improvements that really compound in algorithmic details of model architecture)