But I do think that there is further underlying structure that can be exploited. A lot of recent work on geometric and latent interpretations of reasoning, geometric approaches to accelerate grokking, and as linear replacements for attention are promising directions, and multimodal training will further improve semantic synthesis.
People won’t even admit their sexual desires to themselves and yet they keep shaping the world. Can ChatGPT access that information somehow?
Or at least, this is the case if we mean LLM in the classic sense, where the "language" in the middle L refers to natural language. Also note GP carefully mentioned the importance of multimodality, which, if you include e.g. images, audio, and video in this, starts to look like much closer to the majority of the same kinds of inputs humans learn from. LLMs can't go too far, for sure, but VLMs could conceivably go much, much farther.
So right now the limitation is that an LMM is probably not trained on any images or audio that is going to be helpful for stuff outside specific tasks. E.g. I'm sure years of recorded customer service calls might make LMMs good at replacing a lot of call-centre work, but the relative absence of e.g. unedited videos of people cooking is going to mean that LLMs just fall back to mostly text when it comes to providing cooking advice (and this is why they so often fail here).
But yes, that's why the modality caveat is so important. We're still nowhere close to the ceiling for LMMs.
Absolutely. There is only one model that can consistently produce novel sentences that aren't absurd, and that is a world model.
> People won’t even admit their sexual desires to themselves and yet they keep shaping the world
How do you know about other people's sexual desires then, if not through language? (excluding a very limited first hand experience)
Sure. Just like any other information. The system makes a prediction. If the prediction does not use sexual desires as a factor, it's more likely to be wrong. Backpropagation deals with it.
Math is language, and we've modelled a lot of the universe with math. I think there's still a lot of synthesis needed to bridge visual, auditory and linguistic modalities though.
E.g. syllogistic arguments based on linguistic semantics can lead you deeply astray if you those arguments don't properly measure and quantify at each step.
I ran into this in a somewhat trivial case recently, trying to get ChatGPT to tell me if washing mushrooms ever really actually matters practically in cooking (anyone who cooks and has tested knows, in fact, a quick wash has basically no impact ever for any conceivable cooking method, except if you wash e.g. after cutting and are immediately serving them raw).
Until I forced it to cite respectable sources, it just repeated the usual (false) advice about not washing (i.e. most of the training data is wrong and repeats a myth), and it even gave absolute nonsense arguments about water percentages and thermal energy required for evaporating even small amounts of surface water as pushback (i.e. using theory that just isn't relevant when you actually properly quantify). It also made up stuff about surface moisture interfering with breading (when all competent breading has a dredging step that actually won't work if the surface is bone dry anyway...), and only after a lot of prompts and demands to only make claims supported by reputable sources, did it finally find McGee's and Kenji Lopez's actual empirical tests showing that it just doesn't matter practically.
So because the training data is utterly polluted for cooking, and since it has no ACTUAL understanding or model of how things in cooking actually work, and since physics and chemistry are actually not very useful when it comes to the messy reality of cooking, LLMs really fail quite horribly at producing useful info for cooking.