It's even weirder to suggest that the disagreement is indicative of a problem. If you asked five very knowledgeable humans on this subject to select the correct answer on a multiple-choice questionnaire, they would almost certainly vary significantly more than these 5 LLMs.
Not to say that hallucination isn't a problem, but this is a lousy way to test it.
These types of experiments prove to me that there is no real "reasoning" happening and "reasoning/thinking" tokens as a concept are mostly there to convince people to use models that consume more tokens and produce more revenue. The output from reasoning models might be more accurate, but its just a consequence of a longer inference runtime, there is no "reasoning" happening, reasoning is just sales/UX bullsh*t.
The prompt allowed for exactly four valid outputs and explicitly disallowed explanations and qualifiers.
> Output exactly one label: True, > Mostly True, Misleading, or False. > No explanations, no qualifiers.
How is that a nuanced response?
> These types of experiments prove to me that there is no real "reasoning" happening and "reasoning/thinking"
My suggestion is that five presumably reasoning and thinking humans would also have variation in their responses to the exact same prompt.
But "unknown or undecidable" should have been a category.
The space station, the Artemis capsule, microbes on interplanetary probes, etc.
It could technically be said in a sentence and be true, but it would be misleading to most people.
My implicit assumption is that if you fact-check the fact-check, any label other than "true" means the original fact-check is unacceptable
I think you could come up with a reasonable argument for any of the responses, hence the problem with the methodology.
Misleading should be removed as a category and replaced with a better hedge like "not sure"
I mean look at the other responses here from the HN commenters. There's lots of nuance in there.
Then again maybe that’s why I’m an atheist, not an agnostic?
Both statements would have to be interpreted as "false" under your criteria, as neither has any evidence to substantiate it. That leads us to a logical contradiction in which a proposition and its inverse are both regarded as false.
If the statement is being interpreted as "it has been proven that extraterrestrial life exists somewhere in the universe", then it's acceptable to say this statement is false, but making evaluations that depend on an implicit qualifier isn't usually a good approach.
A proposition and its logical inverse can both be unknown, and in fact, a proposition being unknown implies that its logical inverse must also be unknown.