Every architecture improvement is essentially a way to achieve the capability of a single fully-connected hidden layer network n wide. With fewer parameters.
Given these architectures usually still contain fully connected layers, unless they've done something really wrong, they should still be able to do anything if you make the entire thing large enough.
That means a large enough [insert model architecture] will be able to approximate any function to arbitrary precision. As long as the efficiency gains with the architecture are retained as the scale increases they should be able to get there quicker.
All the foundation model breakthroughs are hoarded by the labs doing the pretraining. That being said, RL reasoning training is the obvious and largest breakthrough for intelligence in recent years.
The most important one in that timeframe was clearly reasoning/RLVR (reinforcement learning with verifiable rewards), which was pioneered by OpenAI's Q* aka Strawberry aka o1.