upvote
This counts only incorrect answers though. A model can get 0% hallucination rate just by refusing to answer all questions.
reply
Isn't that precisely the reason why we introduced the term hallucination? Because llms have historically always made up bullshit of they cannot answer directly... If they now nailed this to maybe the model not respond instead of responding incorrectly, then a lot of previously unusable usecases would become feasible.

So I feel like that's exactly the right metric and the way to track it wrt hallucinations.

reply
I had a buddy in high school that was notorious for doing the same thing. (He's now a senior director at a Big 4 consultancy. :) )
reply
I think that's what the Omniscience Index is for:

https://artificialanalysis.ai/evaluations/omniscience#aa-omn...

It rewards correct answers and penalizes hallucinations, and finally no reward for refusing to answer.

It's interesting just how poorly some popular Chinese models fare in this regard, like GLM 5.1 or DeepSeek 4 Pro.

Gemini 3.x has truly remarkable knowledge given how it leads in this benchmark despite being (quite a bit) more prone to hallucinate than Claude Opus.

reply
> by refusing to answer all questions.

Cool, precisely the thing other AI is too stupid to do when they don't have the necessary knowledge.

reply
Yes. A model that can answer "I don't know" would be much more trustable than the current used car salesman we have now.
reply
Its very annoying this has been in the capability of models since the very beginning. It could check how probable its token values are and if those fall below a certain threshold either say "I don't know", or output the most probable (well, more like least improbable) tokens but give a very clear, very strong warning that it is a shot in the dark and likely to contain hallucinations.

But no, Google and OpenAI would rather always have an answer ready and tell you to mix glue into your pizza toppings :)

reply
It can't, because top n isn't always reliable.

Hallucination detection is an open problem. If it were that simple, people would indeed "just" do it.

Basically the problem is that LLMs aren't trained on things they don't know; an alternative way of saying this is that they're not trained on things they're not trained on, which is obviously true.

When you RL a model and it answers incorrectly, you don't teach it to answer "I don't know", you teach it to answer correctly instead. This makes it very hard for it to realize when it doesn't know things.

reply
Yeah, I never understood why the top n statistics weren't included in the chat interfaces, to color the text!
reply
I don't have much to add other than this observation that we seem to have moved away from eating one small rock per day for nutritional value, and adding gasoline in spaghetti.

The glue on pizza reference brought back memories :)

reply