upvote
> I've never heard the caveat that it can't be attributable to misinformation in the pre-training corpus.

If the LLM is accurately reflecting the training corpus, it wouldn’t be considered a hallucination. The LLM is operating as designed.

Matters of access to the training corpus are a separate issue.

reply
I believe it was a super bowl ad for gemini last year where it had a "hallucination" in the ad itself. One of the screenshots of gemini being used showed this "hallucination", which made the rounds in the news as expected.

I want to say it was some fact about cheese or something that was in fact wrong. However you could also see the source gemini cited in the ad, and when you went to that source, it was some local farm 1998 style HTML homepage, and on that page they had the incorrect factoid about the cheese.

reply
> If the LLM is accurately reflecting the training corpus, it wouldn’t be considered a hallucination. The LLM is operating as designed.

That would mean that there is never any hallucination.

The point of original comment was distinguishing between fact and fiction, which an LLM just cannot do. (It's an unsolved problem among humans, which spills into the training data)

reply
> That would mean that there is never any hallucination.

No it wouldn’t. If the LLM produces an output that does not match the training data or claims things that are not in the training data due to pseudorandom statistical processes then that’s a hallucination. If it accurately represents the training data or context content, it’s not a hallucination.

Similarly, if you request that an LLM tells you something false and the information it provided is false, that’s not a hallucination.

> The point of original comment was distinguishing between fact and fiction,

In the context of LLMs, fact means something represented in the training set. Not factual in an absolute, philosophical sense.

If you put a lot of categorically false information into the training corpus and train an LLM on it, those pieces of information are “factual” in the context of the LLM output.

The key part of the parent comment:

> caused by the use of statistical process (the pseudo random number generator

reply
OK if everyone else agrees with your semantics then I agree
reply
The LLM is always operating as designed. All LLM outputs are "hallucinations".
reply
The LLM is always operating as designed, but humans call its outputs "hallucinations" when they don't align with factual reality, regardless of the reason why that happens and whether it should be considered a bug or a feature. (I don't like the term much, by the way, but at this point it's a de facto standard).
reply
not that the internet had contained any misinformation or FUD when the training data was collected

also, statments with certainty about fictitious "honey pot prompts" are a problem, plausibly extrapolating from the data should be more governed by internal confidence.. luckily there are benchmarks now for that i believe

reply