You need some amount of parallel compute and some amount of global comparison.
And the rest is basically a ways to parameters and scale.
(This is in theory, in practice you can get a lot of small % stability and efficiency improvements that really compound in algorithmic details of model architecture)