MoE was also pretty straightforward, just a bit surprising how well it worked (that you can get away with just 1/32 active parameters), but most researchers would have come up with it on their own probably.
The true ground breaking papers are the first two you mentioned (transformers and gpt2), and InstructGPT was also very surprising that it worked so well.
Language Models are Few-Shot Learners https://arxiv.org/abs/2005.14165
I also enjoyed the papers for DeepSeek and GLM for an overview of all the tricks you need to make these things work
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models https://arxiv.org/abs/2512.02556
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models https://arxiv.org/abs/2508.06471