do you not see how that creates extremely misleading and valueless results? you are coercing the results into what you want to see.
Don't even need politics for it, there is no point in probing a mathematical black box for "how many soldiers died in the year X in war Y".
Any original source is preferable to a blurry "summary" of unknown sources, and this is why the article has a valuable point.
There's also no point in asking "Is Paris in France" either, if you substitute city and country with real data. An encyclopedia or manual check of different sources such as maps, while not infallible, is a better source.
If you already know the country Paris belongs to, there's no point in asking, anyway.
Especially in niche subjects.
For factual claims, I've fared better with Wikipedia and looking up the sources linked there.
Anyway, as AI text and media generation erodes the credibility of all online sources, these questions about source checking matter less and less: what if the source itself is a long and convincing-sounding text with poor sources?
This problem existed before already, but it boils down to a simple fact:
logic or maths alone cannot derive an authority that verifies claims about the real world other than weighting texts.
The question "what is the current population if Paris" can be answered by LLMs, but basically only by weighting sources, and assigning some credibility to them.
There's no real point in getting some weighted average of sources on this question, but so far, it doesn't hurt either.
Although inheriting the messiness of the real-world, the majority of these claims are objective enough to be classifiable by human experts with access to research. Plan to human-label the 1,000 claims and publish a follow-up research. Will consider adding an "I don't know" bucket too, as well as a clear instructions about the meaning of each of the 4 buckets.
Models give much better answers when they can "think out loud" before answering, and storing that rationale will make it easier to understand why they picked different answers for ambiguous questions.
Good pattern: {"explanation": <short explanation for your answer>, "answer": <your final answer: true|false|i don't know>}
Bad pattern: {"answer": <your answer here>, "explanation": <short explanation for your answer>}
Real-world systems need to be able to say "I don't know." This is a test about misinformation after all, and overconfident responses contribute to that.
Teasing out the difference between "avoid" and "unknown" could be a different research question