Similarly I’m always surprised that we don’t start by training a small set of layers, stack them and then continue.
One of the main issues is: we don't know how to generate useful computational structure for LLMs - or how to transfer existing structure neatly across architectural variations.
What you describe sounds more like a "progressive growing" approach, which isn't the same, but draws from some similar ideas.