So I feel like that's exactly the right metric and the way to track it wrt hallucinations.
https://artificialanalysis.ai/evaluations/omniscience#aa-omn...
It rewards correct answers and penalizes hallucinations, and finally no reward for refusing to answer.
It's interesting just how poorly some popular Chinese models fare in this regard, like GLM 5.1 or DeepSeek 4 Pro.
Gemini 3.x has truly remarkable knowledge given how it leads in this benchmark despite being (quite a bit) more prone to hallucinate than Claude Opus.
Cool, precisely the thing other AI is too stupid to do when they don't have the necessary knowledge.
But no, Google and OpenAI would rather always have an answer ready and tell you to mix glue into your pizza toppings :)
Hallucination detection is an open problem. If it were that simple, people would indeed "just" do it.
Basically the problem is that LLMs aren't trained on things they don't know; an alternative way of saying this is that they're not trained on things they're not trained on, which is obviously true.
When you RL a model and it answers incorrectly, you don't teach it to answer "I don't know", you teach it to answer correctly instead. This makes it very hard for it to realize when it doesn't know things.
The glue on pizza reference brought back memories :)