If you were to obtain exactly the same output for a given input prompt, regardless of context, then that would mean that the context is being ignored, which is indistinguishable from the session not maintaining any context such that each prompt is in a brand new empty context.
Now what some people want is requirements like:
- The different wording of a prompt with exactly the same meaning should not change anything in the output; e.g. whether you say "What is the capital of France" or "What is France's capital" the answer should be verbatim identical.
- Prior context should not change responses in ways that don't have any interaction with the context. For instance, a prompt is given "what is 2 + 2", then the answer should always be the same, except if the context instructs the LLM that 2 + 2 is to be five.
These kinds of requirements betray a misunderstanding of what these LLMs are.
„The context is the input“ betrays a misunderstanding of what (artificial) intelligence systems are aiming for.
We have observed situations where agentic LLM traces on verifiable problems with deterministic (greedy) decoding lead to either completely correct or completely wrong solutions depending on the minutes on the clock which are printed as coincidental output of some tool that the LLM used.
I think there may be some mild fixes to current models available , for example it is worrying that the attention mechanism can never fully disregard any token in the input, because the softmax will always assign a > 0 weight everywhere (and the NN has no way of setting a logit to -infinity). This directly causes that it is extremely difficult for the LLM to fully ignore any part of the context reliably.
However Yann LeCun actually offers some persuasive arguments that autoregressive decoding has some limitations and we may need something better.
I see this a lot. I kinda' doubt the "simple" part, but even beyond that, is there any evidence that statistical predictor can't be a universal answering machine? I think there's plenty of evidence that our thinking is at least partially a statistical predictor (e.g. when you see a black sheep you don't think "at least one side of this sheep is black", you fully expect it to be black on both sides)
I'm not saying that LLMs _are_ universal answering machines. I'm wondering why people question that they are/they can become one, based on the argument that "fundamentally they are statistical predictors". So they are. So what?
If it does, statistical predictors can't help you because they're not always correct or even meaningful (correlation does not imply causation).
If it doesn't then, by all means, enjoy your infinite monkeys
They do not. Refusing to bend your requirements to a system that can't satisfy them is not evidence of misunderstanding the system.
And if you tack on "with X 9s of reliability" then it is something LLMs can do. And in the real world every system has a reliability factor like that.
There are going to be false positives: text that is subtly different from a previous response is misidentified as a duplicate such that the previous response is substituted for it, frustrating the user.
Why and how is this a problem?
If 'preceding context' doesn't cause different results, it means you can simply discard the context. Why do I want that? It's not how I expect a tool to work (I expect vim responds differently to my input after I switch to the insert mode). It's absolutely not how I expect intelligence to work either. It sounds like the most extreme form of confirmation bias.
This is a common AI benchmark and has been for years before GPT-2 even existed. LLMs need to not get distracted by irrelevant facts and there are tests that measure this. It's the motivation for attention mechanisms, which are the breakthrough that enabled LLMs to scale up.