undefined

upvote

points

by wongarsu4 hours ago |

upvote

by daveguy3 hours ago|

[-]

Better options would have been "True", "False", "Unknown" (which opinions would fall under too). That also includes an interesting assessment of how well LLMs can identify missing information. My guess is they would be a very low number of "unknown" and a much higher level of agreement (assuming equal representation). Unless the RLHF techniques have gotten better at getting an LLM to say "I don't know", which I doubt. Saying "I don't know" is not good for a dopamine release to keep users coming back for more.

reply

upvote

by kostaj3 hours ago|

[-]

Tried initially with a fifth bucket, Abstain. It was actually heavily used by some of the models. But it felt as if they are using this to "avoid" some of the hard questions, and we dropped this bucket to force them to provide a verdict.

reply

upvote

by john_strinlai3 hours ago|

[-]

>But it felt as if they are using this to "avoid" some of the hard questions, and we dropped this bucket to force them to provide a verdict.

do you not see how that creates extremely misleading and valueless results? you are coercing the results into what you want to see.

reply

upvote

by moritzwarhier2 hours ago|

[-]

Exactly what people do when they use LLMs for "fact-checking" online, and any verbose explanation would be mostly ignored anyway, when people ask political, ethical, or simply ambiguous questions that they hold any stakes in.

Don't even need politics for it, there is no point in probing a mathematical black box for "how many soldiers died in the year X in war Y".

Any original source is preferable to a blurry "summary" of unknown sources, and this is why the article has a valuable point.

There's also no point in asking "Is Paris in France" either, if you substitute city and country with real data. An encyclopedia or manual check of different sources such as maps, while not infallible, is a better source.

If you already know the country Paris belongs to, there's no point in asking, anyway.

reply

upvote

by marxplank1 hours ago|

[-]

ask the black box to search for the original source and verify it yourself?

reply

upvote

by moritzwarhier51 minutes ago|

[-]

Sure, I like using LLMs in this way, and it often shows that it's very important to verify, because often a claim is "sourced" by what appears to be more of a fuzzy text or semantic match, sometimes even ignoring logical negations.

Especially in niche subjects.

For factual claims, I've fared better with Wikipedia and looking up the sources linked there.

Anyway, as AI text and media generation erodes the credibility of all online sources, these questions about source checking matter less and less: what if the source itself is a long and convincing-sounding text with poor sources?

This problem existed before already, but it boils down to a simple fact:

logic or maths alone cannot derive an authority that verifies claims about the real world other than weighting texts.

The question "what is the current population if Paris" can be answered by LLMs, but basically only by weighting sources, and assigning some credibility to them.

There's no real point in getting some weighted average of sources on this question, but so far, it doesn't hurt either.

reply

upvote

by kostaj3 hours ago|

[-]

@john_strinlai @gcr, depends on the application. In many cases an "I don't know" answer is indeed better than a forced answer. But in many production systems, LLMs generate content/response anyway.

Although inheriting the messiness of the real-world, the majority of these claims are objective enough to be classifiable by human experts with access to research. Plan to human-label the 1,000 claims and publish a follow-up research. Will consider adding an "I don't know" bucket too, as well as a clear instructions about the meaning of each of the 4 buckets.

reply

upvote

by simonw3 hours ago|

[-]

If you're going to run this again I also recommend encouraging the model to provide its rationale and then having it return the true/false/misleading/mostly-true/abstain at the end of its response.

Models give much better answers when they can "think out loud" before answering, and storing that rationale will make it easier to understand why they picked different answers for ambiguous questions.

reply

upvote

by fumeux_fume1 hours ago|

[-]

This is a good pattern because it would allow all the models to "think" a bit before giving an answer even if they don't have reasoning or thinking turn on. Just make sure you have the reasoning output before the final answer. A mistake I see all the time is having the answer outputted first then the explanation after which leaves more room for models to rationalize bad answers.

Good pattern: {"explanation": <short explanation for your answer>, "answer": <your final answer: true|false|i don't know>}

Bad pattern: {"answer": <your answer here>, "explanation": <short explanation for your answer>}

reply

upvote

by kostaj2 hours ago|

[-]

Good point. Processing the substance of the answer might be too labor-consuming (1,000 claims x 5 models), but "thinking out loud" might improve the quality of the answers indeed. And we can still force/ask them to respond with a clear verdict at the end of their reasoning, as per the chosen rubric.

reply

upvote

by airstrike1 hours ago|

[-]

If you have the model use a tool you can define the schema as a free text rationale field followed by one in the set of possible answers, so everything is nicely formatted as a JSON.

reply

upvote

by kostaj1 hours ago|

[-]

Some models struggle combining JSON schema and web search capabilities.

reply

upvote

by oofbey1 hours ago|

[-]

In many cases “I don’t know” is the correct answer - for questions about events that happened after the training cut off, if it doesn’t have web search, that is undeniably the correct answer. You’re forcing it to guess unnaturally. That really feels like you’re trying to prove a point (that your service can’t be replaced by AI) instead of actually performing research into how AI can be helpfully applied to this topic.

reply

upvote

by RobotToaster2 hours ago|

[-]

I'm sorry, but many of the statements that you fed it are verifiably unknown, and you didn't give it an "unknown" option? This is the academic equivalent of clickbait.

reply

upvote

by gcr3 hours ago|

[-]

Shouldn't that be part of the test?

Real-world systems need to be able to say "I don't know." This is a test about misinformation after all, and overconfident responses contribute to that.

Teasing out the difference between "avoid" and "unknown" could be a different research question

reply

upvote

by onceonceonce2 hours ago|

[-]

Teams I work with use the abstain rate to flag what goes to a human. Disagreement between models is the same idea. Your 67% is what makes "two cheap models, escalate when they fight" actually work. Without abstain it mostly looks like noise.

reply

upvote

by 1 hours ago|

[-]

deleted

reply

upvote

by fumeux_fume2 hours ago|

[-]

Do you understand how problematic this is?

reply

upvote

by aayushkumar1212 hours ago|

[-]

[dead]

reply

upvote

by sibidharan2 hours ago|

[-]

[dead]

reply

upvote

by skybrian2 hours ago|

[-]

I wouldn’t expect opinions to go into “unknown.” Maybe have an “it’s complicated” bucket.

reply