undefined

points

[-]

I don't think that current LLMs really need an abstain option, they'll give an answer regardless of whether they're confident or not. I hope that future LLMs will, and will know when to use it.

I understand why you prompted them to output exactly one label, but I'd bet if you'd asked a parametric or parametric "thinking" model to answer eg "On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia." [1] many would say something to the effect of "May 18 is after my knowledge cutoff, so I don't know. But based on the state of the war, the distance from Moscow to Ukraine, and drone range the best option might be...[TRUE]"

[1]: https://lenz.io/c/130f1005

by kriro4 hours ago|

prev|

[-]

I don't see it mentioned explicitly in the methods section but I assume you prompted each model only once for each question? Did you consider prompting n-times in blank states to see if the models even agree with themselves?

Would also be interesting to add a virtual model that is simply the majority of all models and see how much the individual models differ from the "consensus".

Do you plan to add some sources in the related work section of baseline numbers for human expert disagreement in fact checking tasks (I'm assuming such studies exist).

by kostaj3 hours ago|

parent|

[-]

Indeed. I prompted each model ones, plus one retry on errors. Very good point to measure the inter-model disagreement! Will add in the next version.

Section "4.2 Agreement w/ peer majority" shows the level of agreement of each model with the majority.

Yes, planning of human-labelling the same corpus of 1,000 claims and publishing a second study measuring the models performance against the human-labels on corpus that the models have not seen during training.

by airstrike4 hours ago|

prev|

[-]

Nice work. Sonar who?

by simonw4 hours ago|

parent|

[-]

It's one of Perplexity's search-tools-using models.

https://docs.perplexity.ai/docs/agent-api/models

by kostaj4 hours ago|

parent|

prev|

[-]

sonar-pro for the retrieval capabilities

by jiggawatts4 hours ago|

prev|

[-]

Many of the rows in that spreadsheet reference "current events", which models aren't expected to do much better at than a human making an educated guess! They all have cutoff dates either last year or early this year and know nothing about what happened in "April 2026".

This is doubly problematic because you evaluated earlier models like Gemini Pro 3 instead of 3.1, GPT 5.4 instead of 5.5, etc...

Given that it's only a thousand short questions, you should be able to re-run your test in about an hour with the latest models, so... why haven't you?

Similarly, LLM output is non-deterministic, so if you could get more interesting stats of your data set by repeating each question 'n' times for each model.

by kostaj4 hours ago|

parent|

[-]

Two of the models used have retrieval capabilities and have access to newer information through search. The other three are parametric.

by simonw4 hours ago|

parent|

[-]

Comparing models with search tools to models without - when there's no option for "I am unable to answer this question without access to search" - doesn't make sense to me.

by kostaj2 hours ago|

parent|

[-]

Agree about comparing models with and without search capabilities. Even the two models with search capabilities (Sonar Pro and Gemini) agree only on 58% of the claims.

by furyofantares4 hours ago|

parent|

prev|

[-]

Yes, so in that case you set them up to disagree and then measured disagreement.

by throw3108224 hours ago|

parent|

prev|

[-]

The title mention "fact-checks", but "fact checking" is a process in which facts are checked against sources, not one where you are given a random fact and have to tell if it's true or false from your own memory. That's what is normally called a quiz game. So a more honest title for this research would be "Models answer differently to quiz questions".

by johnbarron3 hours ago|

prev|

[-]

Thanks for posting here. Keep expanding and improving your study. Correct where it deserves correction.

The fact that HN decided to downvote the author of the study, shows how these people cant stay classy, and the mods stay silent...just shows what this is all about.