it said "the lower, the better." Eventually, I realized that the "non" reverses the scores. And indeed, the results are consistent.
I'd love to see a good hallucination benchmark, but this isn't one. There's no possibility that a 1B model hallucinates less than Fable 5.
Specifically, your model now has two "correct" classes p(class=y|x) and p(class=⊥|x). This makes the results ambiguous. The way you resolve this is by adding in a cost of missclassification and a cost of answering wrong.
L(y, y') =
0 if y=y' l_err if y≠y' and y'≠⊥ l_⊥ if y' = ⊥
You can then estimate the expected error over your dataset. Notice that this now gives you additional degrees of freedom: Depending on how expensive answering wrong is compared to not answering at all, your predictor might be really bad or really good.
This means when benchmarking with a "no answer" action, you are often not actually benchmarking whether the model works well or not, but rather are benchmarking how well the model _happens_ to agree with the class-error weight you (implicitly) chose in your model.
So we have a situation where models that can solve challenging problems, also tend to have problems with hallucinating, but those hallucinations seem be the breeding ground for the solutions that got them high "Wow" factor intelligence.
Fable model being removed from Anthropic because of security concerns by the US government (or well, also partially because of the personal vendetta between US govt and Anthropic)
An LLM outputs tokens, one-by-one. It stops the loop if it outputs the end-of-text token. Which is, of course, statistically much rarer than any other kind of token.
(This is why you cannot, in general, prompt an LLM with something like "don't answer if the result is correct". It has to output something, by design.)
This leads to answer bloat and/or hallucination if you benchmaxx on those
Let's say there are 100 questions, with 4 answers each. A good answer is worth 1 point. By just guessing you get an average of 25/100, way more than 0/100 by not replying.
If instead a wrong answer is -1 point, by just guessing you get on average -75/100, way worse than 0/100.