upvote
Hey, I know what the article wanted to say, see the last two-ish sentences of my previous response. My point, is that the article might be mis-interpreting what the causes and solutions for the problems it sees. Relying on the brain as an example of how to improve might be a mistaken premise, because maybe the brain isn't doing what the article thinks it's doing. So we're in agreement there, that the brain and LLMs are incomparable, but maybe the parts where they're comparable are more informative on the nature of hallucinations than the author may think.
reply
I think you can confidently say that brains do the following and LLMs don't:

* Continuously updates its state based on sensory data

* Retrieves/gathers information that correlates strongly with historic sensory input

* Is able to associate propositions with specific instances of historic sensory input

* Uses the above two points to verify/validate its belief in said propositions

Describing how memories "feel" may confuse the matter, I agree. But I don't think we should be quick to dismiss the argument.

reply
But the thing is that humans don't hallucinate as much as LLMs, so it's the differences not similarities (such as they are) that are important to understand why that is.

It's pretty obvious that an LLM not knowing what it does or does not know is a major part of it hallucinating, while humans do generally know the limits of their own knowledge.

reply
> An LLM was only every meant to be a linguistics model, not a brain or cognitive architecture.

See https://gwern.net/doc/cs/algorithm/information/compression/1... from 1999.

Answering questions in the Turing test (What are roses?) seems to require the same type of real-world knowledge that people use in predicting characters in a stream of natural language text (Roses are ___?), or equivalently, estimating L(x) [the probability of x when written by a human] for compression.

reply
I'm not sure what your point is?

Perhaps in 1999 it seemed reasonable to think that passing the Turing Test, or maximally compressing/predicting human text makes for a good AI/AGI test, but I'd say we now know better, and more to the point that does not appear to have been the motivation for designing the Transformer, or the other language models that preceded it.

The recent history leading to the Transformer was the development of first RNN then LSTM-based language models, then the addition of attention, with the primary practical application being for machine translation (but more generally for any sequence-to-sequence mapping task). The motivation for the Transformer was to build a more efficient and scalable language model by using parallel processing, not sequential (RNN/LSTM), to take advantage of GPU/TPU acceleration.

The conceptual design of what would become the Transformer came from Google employee Jakob Uzkoreit who has been interviewed about this - we don't need to guess the motivation. There were two key ideas, originating from the way linguists use syntax trees to represent the hierarchical/grammatical structure of a sentence.

1) Language is as much parallel as sequential, as can be seen by multiple independent branches of the syntax tree, which only join together at the next level up the tree

2) Language is hierarchical, as indicated by the multiple levels of a branching sytntax tree

Put together these two considerations suggests processing the entire sentence in parallel, taking advantage of GPU parallelism (not sequentially like an LSTM), and having multiple layers of such parallel processing to hierarchically process the sentence. This eventually lead to the stack of parallel-processing Transformer layers design, which did retain the successful idea of attention (thus the paper name "Attention is all you need [not RNNs/LSTMs]").

As far as the functional capability of this new architecture, the initial goal was just to be as good as the LSTM + attention language models it aimed to replace (but be more efficient to train & scale). The first realization of the "parallel + hierarchical" ideas by Uzkoreit was actually less capable than its predecesssors, but then another Google employee, Noam Shazeer, got involved and eventually (after a process of experimentation and ablation) arrived at the Transformer design which did perform well on the language modelling task.

Even at this stage, nobody was saying "if we scale this up it'll be AGI-like". It took multiple steps of scaling, from early Google's early Muppet-themed BERT (following their LSTM-based ELMo), to OpenAI's GPT-1, GPT-2 and GPT-3 for there to be a growing realization of how good a next-word predictor, with corresponding capabilities, this architecture was when scaled up. You can read the early GPT papers and see the growing level of realization - they were not expecting it to be this capable.

Note also that when Shazeer left Google, disappointed that they were not making better use of his Transformer baby, he did not go off and form an AGI company - he went and created Character.ai making fantasy-themed ChatBots (similar to Google having experimented with ChatBot use, then abandoning it, since without OpenAI's innovation of RLHF Transformer-based ChatBots were unpredictable and a corporate liability).

reply