Well, if in all situations you can predict which word Einstein would probably say next, then I think you're in a good spot.
This "most probable" stuff is just absurd handwaving. Every prompt of even a few words is unique, there simply is no trivially "most probable" continuation. Probable given what? What these machines learn to do is predicting what intelligence would do, which is the same as being intelligent.
The training data..
>predicting what intelligence would do
No, it just predict what the next word would be if an intelligent entity translated its thoughts to words. Because it is trained on the text that are written by intelligent entities.
If it was trained on text written by someone who loves to rhyme, you would be getting all rhyming responses.
It imitates the behavior -- in text -- of what ever entity that generated the training data. Here the training data was made by intelligent humans, so we get an imitation of the same.
It is a clever party trick that works often enough.
If the prompt is unique, it is not in the training data. True for basically every prompt. So how is this probability calculated?
Type "owejdpowejdojweodmwepiodnoiwendoinw welidn owindoiwendo nwoeidnweoind oiwnedoin" into ChatGPT and the response is "The text you sent appears to be random or corrupted and doesn’t form a clear question." because the prompt doesnt correlate to training data.
But the human brain (or any other intelligent brain) does not work by generating probability distribution of the next word. Even beings that does not have a language can think and act intelligent.
Wait what? So a robot who is accurately copying the actions of an intelligent human, is intelligent?
If it's just basically being a puppet, then no. You tell me what claude code is more like, a puppet, or a person?
But that is the key insight, how can you tell when an imitation of intelligence becomes the real thing?
If the idea is that something cannot accurately replicate the entirety of intelligence without being intelligent itself, then perhaps. But that isn't really what people talk about with LLMs given their obvious limitations.
(†And even then is kind of overly-dismissive and underspecified. The "most probable word" is defined over some training data set. So imagine if you train on e.g. mathematicians solving problems... To do a good job at predicting [w/o overfitting] your model will have to in fact get good at thinking like a mathematician. In general "to be able to predict what is likely to happen next" is probably one pretty good definition of intelligence.)
It just changes the probability distribution that it is approximating.
To the extent that thinking is making a series of deductions from prior facts, it seems to me that thinking can be reduced to "pick the next most probable token from the correct probability distribution"...
(With this perspective, I can feel my own brain subtly oferring up a panoply of possible responses in a similar way. I can even turn up the temperature on my own brain, making it more likely to decide to say the less-obvious words in response, by having a drink or two.)
(Similarly, mimicry is in humans too a very good learning technique to get started -- kids learning to speak are little parrots, artists just starting out will often copy existing works, etc. Before going on to develop further into their own style.)
I've never seen any evidence that thinking requires such a thing.
And honestly I think theoretical computational classes are irrelevant to analysing what AI can or cannot do. Physical computers are only equivalent to finite state machines (ignoring the internet).
But the truth is that if something is equivalent to a finite state machine, with an absurd number of states, it doesn't really matter.
As typically deployed [1] LLMs are not turing complete. They're closer to linear bounded automaton, but because transformers have a strict maximum input size they're actually a subset of the weaker class of deterministic finite automaton. These aren't like python programs or something that can work on as much memory as you supply them, their architecture works on a fixed maximum amount of memory.
I'm not particularly convinced turing complete is the relevant property though. I'm rather convinced that I'm not turing complete either... my head is only so big after all.
[1] i.e. in a loop that appends output tokens to the input and has some form of sliding context window (perhaps with some inserted instructions to "compact" and then sliding the context window right to after those instructions once the LLM emits some special "done compacting" tokens).
[2] Common sampling procedures make them mildly non-deterministic, but I don't believe they do so in a way that changes the theoretical class of these machines from DFAs.
You can not be convinced Turing complete is relevant all you want - we don't know of any more expansive category of computable functions, and so given that an LLM in the setup described is Turing complete no matter that they aren't typically deployed that way is irrelevant.
They trivially can be, and that is enough to make the shallow dismissal of pointing out they're "just" predicting the next token meaningless.
Also people definitely talk about them as "thinking" in contexts where they haven't put a harness capable of this around them. And in the common contexts where people do put harness theoretically capable of this around the LLM (e.g. giving the LLM access to bash), the LLM basically never uses that theoretical capability as the extra memory it would need to actually emulate a turing machine.
And meanwhile I can use external memory myself in a similar way (e.g. writing things down), but I think I'm perfectly capable of thinking without doing so.
So I persist in my stance that turing complete is not the relevant property, and isn't really there.
But it is trivially possible to give systems-including-LLMs external storage that is accessible on demand.
The base models are trained to do this. If a web page contains a problem, and then the word "Answer: ", it is statistically very likely that what follows on that web page is an answer. If the base model wants to be good at predicting text, at some point learning the answer to common question becomes a good strategy, so that it can complete text that contains these.
NN training tries to push models to generalize instead of memorizing the training set, so this creates an incentive for the model to learn a computation pattern that can answer many questions, instead of just memorizing. Whether they actually generalize in practice... it depends. Sometimes you still get copy-pasted input that was clearly pulled verbatim from the training set.
But that's only base models. The actual production LLMs you chat with don't predict the most probable word according to the raw statistical distribution. They output the words that RLHF has rewarded them to output, which includes acting as an assistant that answers questions instead of just predicting text. RLHF is also the reason there are so many AI SIGNS [1] like "you're absolutely right" and way more use of the word "delve" than is common in western English.
"just the most probable word" is a pretty powerful mechanism when you have all of human knowledge at your fingertips.
I say that people "reduce it" that way because it neatly packs in the assumption that general intelligence is something other than next token prediction. I'm not saying we've arrived at AGI, in fact, I do not believe we have. But, it feels like people who use that framing are snarkily writing off something that they themselves to do not fully comprehend behind the guise of being "technically correct."
I'm not saying all people do this. But I've noticed many do.
Further, some solutions are like running a maze. If you know all the wrong turns/next words to say and can just brute force the right ones you might find a solution like a mouse running through the maze not seeing the whole picture.
Whether this is thinking is more philosophical. To me this demonstrates more that we are closer to bio computers than an LLM is to having some sort of divine soul.
The power of LLMs is that by only selecting sequences of words that fit a statistical model, they avoid a lot of dead ends.[^1]
I would not call that, by itself, thinking. However, if you start with an extrapolation engine and add the ability to try multiple times and build on previous results, you get something that's kind of like thinking.
[1]: Like, a lot of dead ends. There are an unfathomable number of dead ends in generating 500 characters of code, and it is a miracle of technology that Claude only hit 30.
But that does not mean that the results cannot be dramatic. Just like stacking pixels can result in a beautiful image.
These models actually learn distributed representations of nontrivial search algorithms.
A whole field of theorem provingaftwr decades of refinements couldn’t even win a medal yet 8B param models are doing it very well.
Attention mechanism, a bruteforce quadratic approach, combined with gradient descent is actually discovering very efficient distributed representations of algorithms. I don’t think they can even be extracted and made into an imperative program.
Great! It will now correctly structure chess games, but we've created no incentive for it to create a game where white wins or to make the next move be "good"
Ok, so now you change the objective. Now let's say "we don't just want valid games, we want you to predict the next move that will help that color win"
And we train towards that objective and it starts picking better moves (note: the moves are still valid)
You might imagine more sophisticated ways to optimize picking good moves. You continue adjusting the objective function, you might train a pool of models all based off of the initial model and each of them gets a slightly different curriculum and then you have a tournament and pick the winningest model. Great!
Now you might have a skilled chess-playing-model.
It is no longer correct to say it just finds a valid chess program, because the objective function changed several times throughout this process.
This is exactly how you should think about LLMs except the ways the objective function has changed are significantly significantly more complicated than for our chess bot.
So to answer your first question: no, that is not what they do. That is a deep over simplification that was accurate for the first two generations of the models and sort of accurate for the "pretraining" step of modern llms (except not even that accurate, because pretraining does instill other objectives. Almost like swapping our first step "predict valid chess moves" with "predict stockfish outputs")
All your brain is doing is bouncing atoms off each other, with some occasionally sticking together, how can it be really thinking?
See how silly it sounds?
Be on the lookout for folks who tell you these machines are limited because they are "just predicting the next word." They may not know what they're talking about.