If we look beyond written languages which are late inventions of human civilization, oral languages are continuous and build with blocks not words.
Chomskyan school misled the entire field of linguistics for decades by ignoring spoken languages.
Example: a programming language's capability to produce complex software does not come from some inherent quality of language. It comes from binary. 0's and 1's, representing basic logic, and that being built on top of with an abstract "tool" called a language. If the binary logic didn't work, the language wouldn't do anything.
A dolphin can make sounds, and technically has a language, but they can't manipulate or recursively compound concepts (as far as we can tell) in order to create modified ideas. If they could, they probably would have come up with vastly more advanced fishing methods than the (admittedly novel) ones they have now.
As this description is so overly abstract, an exercise for the reader is to try to work through an explanation of how, say, a river delta comes to "learn" about its environment by "reacting" to the influences at its borders, and how it "encodes" whatever it is that it learns in the substrate that it inhabits.
So you're missing a lot of the building blocks that make LLMs. It's not a matter of just having the compute.
Like the best leaps in thinking, once it is made, is is immediately obvious and intuitive.
Residual connections are so simple, so obvious and so vital. Yet nobody came up with them until 2015?
I think as time went on, and hardware got better, it seemed more reasonable to actually think about a viable implementation of what I think was a widespread intuition anyone in ML had that everything's context is everything.
It just seemed like a theoretical thing until hardware caught up. Maybe. Perhaps I'm applying a retrospective excuse to why it took so long.
I don't think it was intuitive to anyone back then, the vanishing gradient problem was a big deal since the dawn of NNs. I'm not sure what you mean by sheer computation, residuals allow you to have deep networks instead of shallow and wide ones. You can have equivalent parameter count.
No, it's not. There are many animals that have extremely complex and even learned behaviour that have literally zero neurons.
Clearly "neurons" is an oversimplification just-so story, not a scientific theory.