Particular architectures don't matter so much yet. It's quite possible that S3-Mamba or xLSTM could be used in lieu of transformers and we would still have LLMs.
2012 really fundamentally changed everything for the AI community, I’d argue because tensorflow/keras/pytorch followed and that made the infrastructure accessible for distributed training.
I disagree. But more critically, I'd argue it's the legacy of the PDP project that led to what became foundation models today.
One interesting thing to note from the PDP handbook are mentions by LeCun and Hinton of what would later be called CNNs, which LeCun claims to have invented. It seems that Hinton deserves just as much credit as LeCun, and in any case these are discussed just as locally connected models using shared weights as an optimization.