Or at least, this is the case if we mean LLM in the classic sense, where the "language" in the middle L refers to natural language. Also note GP carefully mentioned the importance of multimodality, which, if you include e.g. images, audio, and video in this, starts to look like much closer to the majority of the same kinds of inputs humans learn from. LLMs can't go too far, for sure, but VLMs could conceivably go much, much farther.
So right now the limitation is that an LMM is probably not trained on any images or audio that is going to be helpful for stuff outside specific tasks. E.g. I'm sure years of recorded customer service calls might make LMMs good at replacing a lot of call-centre work, but the relative absence of e.g. unedited videos of people cooking is going to mean that LLMs just fall back to mostly text when it comes to providing cooking advice (and this is why they so often fail here).
But yes, that's why the modality caveat is so important. We're still nowhere close to the ceiling for LMMs.
Absolutely. There is only one model that can consistently produce novel sentences that aren't absurd, and that is a world model.
> People won’t even admit their sexual desires to themselves and yet they keep shaping the world
How do you know about other people's sexual desires then, if not through language? (excluding a very limited first hand experience)
Sure. Just like any other information. The system makes a prediction. If the prediction does not use sexual desires as a factor, it's more likely to be wrong. Backpropagation deals with it.
Math is language, and we've modelled a lot of the universe with math. I think there's still a lot of synthesis needed to bridge visual, auditory and linguistic modalities though.