https://artificialanalysis.ai/evaluations/omniscience?models...
(had to add it to the chart, wasn't displayed by default. is it the lowest rate in the datasetor no?)
So I feel like that's exactly the right metric and the way to track it wrt hallucinations.
https://artificialanalysis.ai/evaluations/omniscience#aa-omn...
It rewards correct answers and penalizes hallucinations, and finally no reward for refusing to answer.
It's interesting just how poorly some popular Chinese models fare in this regard, like GLM 5.1 or DeepSeek 4 Pro.
Gemini 3.x has truly remarkable knowledge given how it leads in this benchmark despite being (quite a bit) more prone to hallucinate than Claude Opus.
But no, Google and OpenAI would rather always have an answer ready and tell you to mix glue into your pizza toppings :)
The glue on pizza reference brought back memories :)
Running Step 3.5 Flash locally for example, it's an amazingly capable model all things considered, but it's token efficiency is so bad that it gets out performed by most others wall-clock time (even with my MTP-support for it hacked in to llama.cpp: despite being trained on three heads, MTP 2 is the sweet spot, and only gets it from 20tk/s to 30tk/s on my Spark)
The DeepSeek models and Qwen 3.5 Plus are also good examples of this: compared to Opus, and especially GPT 5.5 they use many more tokens to get to the same answers.
I'm really hoping that Qwen 3.7 is better in this regard, can't wait to try it out
(ps. running DeepSeek v4 Flash on my Spark is absolutely wild, thanks antirez if you see this haha)
Nvidia models are even worse than Qwen! https://sql-benchmark.nicklothian.com/#token-efficiency-and-... (mouse over the cells for token counts and click for traces)
Gemma 4 is good for this, as AA notes:
> Gemma 4 31B is notably token efficient, using 39M output tokens to run the Intelligence Index vs 98M for Qwen3.5 27B (Reasoning). This is ~2.5x fewer output tokens for a model scoring 3 points lower. For context, the other models at the 42-point intelligence level also use significantly more tokens: MiniMax-M2.5 (56M), DeepSeek V3.2 (Reasoning, 61M), and GLM-4.7 (Reasoning, 167M)
https://artificialanalysis.ai/articles/gemma-4-everything-yo...
Note that a perfect "non-hallucination rate" is rather meaningless as such tests can contain human hallucinations.
It means the model aligns with the possibly-true, possibly-false beliefs of the group that made the test.
Or would you describe your methodology as more like picking a random sentence fragment as an input value then generating completions from your existing corpus without any post-input "learning" process related to the rest of the source material?