undefined

upvote

points

by simonw4 hours ago |

upvote

by harpastum4 hours ago|

[-]

Without providing definitions of "True / Mostly True / Misleading / False" to each rater, I rate the article's claim that "Only one verdict bucket can be correct per claim" as false.

Something can be simultaneously "misleading" and either true or false. Which category should something go in if it's "mostly false"?

How much can something be wrong before it goes from "mostly true" to "false" (objectively, both have some part of the fact that is not true)?

This is at least partly testing the model's definition of "mostly" and "misleading". Not its understanding of the fact. Claiming that this means the models have fundamental disagreement on the facts themselves is an overreach.

reply

upvote

by wongarsu4 hours ago|

[-]

Yes, the labels are weird. Most misleading statements are true. Any "mostly true" statement is false.

I suspect the intention was "Factually true, and no gotchas exist", "technically not true, but so close to the truth that the difference doesn't matter", "technically true, but there are major gotchas" and "factually false and not even close". But that's not what they specified

reply

upvote

by daveguy3 hours ago|

[-]

Better options would have been "True", "False", "Unknown" (which opinions would fall under too). That also includes an interesting assessment of how well LLMs can identify missing information. My guess is they would be a very low number of "unknown" and a much higher level of agreement (assuming equal representation). Unless the RLHF techniques have gotten better at getting an LLM to say "I don't know", which I doubt. Saying "I don't know" is not good for a dopamine release to keep users coming back for more.

reply

upvote

by kostaj3 hours ago|

[-]

Tried initially with a fifth bucket, Abstain. It was actually heavily used by some of the models. But it felt as if they are using this to "avoid" some of the hard questions, and we dropped this bucket to force them to provide a verdict.

reply

upvote

by john_strinlai3 hours ago|

[-]

>But it felt as if they are using this to "avoid" some of the hard questions, and we dropped this bucket to force them to provide a verdict.

do you not see how that creates extremely misleading and valueless results? you are coercing the results into what you want to see.

reply

upvote

by moritzwarhier2 hours ago|

[-]

Exactly what people do when they use LLMs for "fact-checking" online, and any verbose explanation would be mostly ignored anyway, when people ask political, ethical, or simply ambiguous questions that they hold any stakes in.

Don't even need politics for it, there is no point in probing a mathematical black box for "how many soldiers died in the year X in war Y".

Any original source is preferable to a blurry "summary" of unknown sources, and this is why the article has a valuable point.

There's also no point in asking "Is Paris in France" either, if you substitute city and country with real data. An encyclopedia or manual check of different sources such as maps, while not infallible, is a better source.

If you already know the country Paris belongs to, there's no point in asking, anyway.

reply

upvote

by marxplank1 hours ago|

[-]

ask the black box to search for the original source and verify it yourself?

reply

upvote

by moritzwarhier51 minutes ago|

[-]

Sure, I like using LLMs in this way, and it often shows that it's very important to verify, because often a claim is "sourced" by what appears to be more of a fuzzy text or semantic match, sometimes even ignoring logical negations.

Especially in niche subjects.

For factual claims, I've fared better with Wikipedia and looking up the sources linked there.

Anyway, as AI text and media generation erodes the credibility of all online sources, these questions about source checking matter less and less: what if the source itself is a long and convincing-sounding text with poor sources?

This problem existed before already, but it boils down to a simple fact:

logic or maths alone cannot derive an authority that verifies claims about the real world other than weighting texts.

The question "what is the current population if Paris" can be answered by LLMs, but basically only by weighting sources, and assigning some credibility to them.

There's no real point in getting some weighted average of sources on this question, but so far, it doesn't hurt either.

reply

upvote

by kostaj3 hours ago|

[-]

@john_strinlai @gcr, depends on the application. In many cases an "I don't know" answer is indeed better than a forced answer. But in many production systems, LLMs generate content/response anyway.

Although inheriting the messiness of the real-world, the majority of these claims are objective enough to be classifiable by human experts with access to research. Plan to human-label the 1,000 claims and publish a follow-up research. Will consider adding an "I don't know" bucket too, as well as a clear instructions about the meaning of each of the 4 buckets.

reply

upvote

by simonw3 hours ago|

[-]

If you're going to run this again I also recommend encouraging the model to provide its rationale and then having it return the true/false/misleading/mostly-true/abstain at the end of its response.

Models give much better answers when they can "think out loud" before answering, and storing that rationale will make it easier to understand why they picked different answers for ambiguous questions.

reply

upvote

by fumeux_fume1 hours ago|

[-]

This is a good pattern because it would allow all the models to "think" a bit before giving an answer even if they don't have reasoning or thinking turn on. Just make sure you have the reasoning output before the final answer. A mistake I see all the time is having the answer outputted first then the explanation after which leaves more room for models to rationalize bad answers.

Good pattern: {"explanation": <short explanation for your answer>, "answer": <your final answer: true|false|i don't know>}

Bad pattern: {"answer": <your answer here>, "explanation": <short explanation for your answer>}

reply

upvote

by kostaj2 hours ago|

[-]

Good point. Processing the substance of the answer might be too labor-consuming (1,000 claims x 5 models), but "thinking out loud" might improve the quality of the answers indeed. And we can still force/ask them to respond with a clear verdict at the end of their reasoning, as per the chosen rubric.

reply

upvote

by airstrike1 hours ago|

[-]

If you have the model use a tool you can define the schema as a free text rationale field followed by one in the set of possible answers, so everything is nicely formatted as a JSON.

reply

upvote

by kostaj1 hours ago|

[-]

Some models struggle combining JSON schema and web search capabilities.

reply

upvote

by oofbey1 hours ago|

[-]

In many cases “I don’t know” is the correct answer - for questions about events that happened after the training cut off, if it doesn’t have web search, that is undeniably the correct answer. You’re forcing it to guess unnaturally. That really feels like you’re trying to prove a point (that your service can’t be replaced by AI) instead of actually performing research into how AI can be helpfully applied to this topic.

reply

upvote

by RobotToaster2 hours ago|

[-]

I'm sorry, but many of the statements that you fed it are verifiably unknown, and you didn't give it an "unknown" option? This is the academic equivalent of clickbait.

reply

upvote

by gcr3 hours ago|

[-]

Shouldn't that be part of the test?

Real-world systems need to be able to say "I don't know." This is a test about misinformation after all, and overconfident responses contribute to that.

Teasing out the difference between "avoid" and "unknown" could be a different research question

reply

upvote

by onceonceonce2 hours ago|

[-]

Teams I work with use the abstain rate to flag what goes to a human. Disagreement between models is the same idea. Your 67% is what makes "two cheap models, escalate when they fight" actually work. Without abstain it mostly looks like noise.

reply

upvote

by 1 hours ago|

[-]

deleted

reply

upvote

by fumeux_fume2 hours ago|

[-]

Do you understand how problematic this is?

reply

upvote

by aayushkumar1212 hours ago|

[-]

[dead]

reply

upvote

by sibidharan2 hours ago|

[-]

[dead]

reply

upvote

by skybrian2 hours ago|

[-]

I wouldn’t expect opinions to go into “unknown.” Maybe have an “it’s complicated” bucket.

reply

upvote

by pjc504 hours ago|

[-]

If you can consistently construct "true but misleading" content, you may be qualified to work at a major newspaper.

reply

upvote

by falcor843 hours ago|

[-]

> true but misleading

It seems to me that for many newspapers the bar is now significantly lower, at something like "not quite entirely untrue"

reply

upvote

by IanCal3 hours ago|

[-]

Almost, but not entirely, quite unlike the truth.

reply

upvote

by kevin_thibedeau3 hours ago|

[-]

Allegedly.

reply

upvote

by daveguy3 hours ago|

[-]

As if right wing propaganda shows and manosphere blogs haven't been knocking those out of the park for the last decade+. Although I guess you could say flat out lies are more their jam. Newspapers at least require confirmed sources. You know, journalism.

reply

upvote

by HarHarVeryFunny1 hours ago|

[-]

> True / Mostly True / Misleading / False

> Which category should something go in if it's "mostly false"?

For some reason they have chosen to call that "Misleading" rather than a more symmetrical "Mostly False", but the intent seems clear enough.

reply

upvote

by embedding-shape4 hours ago|

[-]

> I guess the goal is to test the models and not the harness

Less important than the harness, is the system/user prompts themselves (which of course, are put in the harness), which is effectively what this study seems to be testing. With a better prompt, I'm sure the models would look more the same to each other, as the biggest/best models have more or less identical strong prompt-adherence in my experience.

reply

upvote

by torben-friis3 hours ago|

[-]

>Something can be simultaneously "misleading" and either true or false. Which category should something go in if it's "mostly false"?

Disagree. The definition of misleading is a true fact that is presented in a way to lead you to a false conclusion.

Example: "Most good engineers are male". It is true as a consequence of most engineers being male in general, but it leads the reader to a potential false implication that an average man is better than an average woman.

This does not invalid your point though. Things can be true and misleading.

reply

upvote

by m10i1 hours ago|

[-]

> The definition of misleading is a true fact that is presented in a way to lead you to a false conclusion.

According to Merriem-Webster, which defines "mislead" as the following:

  1. (transitive verb) to lead in a wrong direction or into a mistaken action or belief often by deliberate deceit

  2. (intransitive verb) to lead astray; give a wrong impression

Presenting a "true fact" is optional when misleading someone.

reply

upvote

by torben-friis1 hours ago|

[-]

Uh, you seem to be right. I can't check oxford to confirm because there's a paywall, apparently.

The mental model I've always been taught is:

False, well intended -> mistake

False, bad intention -> lie

True, bad intention -> misleading

Bad intention, regardless of truth -> deceitful

The problem of classifying all bad intentioned statements as misleading is that it leaves you without a way to express "true +bad intention". While for generic bad intentioned statements regardless of truth we already have a word (deceit).

reply

upvote

by SkyBelow3 hours ago|

[-]

Isn't this still assuming we can even determine what is true or false?

Newtonian physics is false, but it works well enough we teach it in college. But our best models of physics are currently in disagreement, so can we even say they are true? Given the replication crisis, especially in social sciences, how many of peer reviewed findings can be called true? Even experimental results can be false (consider studies that found FTL neutrinos, which were rejected as an error in the experiment, and which was eventually confirmed but it took quite a lot of work and in a softer field than physics with a claim less absurd than FTL, would have likely long been accepted as a true finding).

Even in math, basic statements aren't really true or false, but more a question of "given these axioms, can we prove or disprove it" noting that we have different systems with different axioms. If we are talking basic sets, most people are using naive set theory which is inherently contradictory, which means that notions like true or false probably can't be considered well defined.

reply

upvote

by flextheruler1 hours ago|

[-]

Newtonian physics doesn't just work well enough for education. It provides an incredibly accurate and precise model of the world except at extremes. The majority of engineering does not necessitate using theories of relativity. Both theories are incomplete models approximating reality and are very far from being false.

reply

upvote

by michaelmrose1 hours ago|

[-]

True and False in general communication means based on best available evidence and expertise statement contains no obvious contradictions or falsehoods based on an optimistic parsing of meaning language and intent. Notably this leaves out misleading or missing data because those concerns are separate from truth and falsehood.

E.g. if I say the earth is round we optimistically parse round to include oblate spheroid and rate it true.

If I say that the earth is flat we rate it as false because there is no reasonable interpretation possible other than confusion or malice.

reply

upvote

by xienze3 hours ago|

[-]

> but it leads the reader to a potential false implication that an average man is better than an average woman.

I think that's _you_ turning the statement into something much broader than intended. The claim is about engineers and you're jumping from "men are better than women in engineering" to "men are better overall."

To give a related example, "Most good NBA players are black." I don't think anyone would bother trying to couch this in a bunch of "well, for all we know that's just a function of more NBA players being black than white" arguments, nor would anyone be lead to think "the average black man is better than the average white man" as a result of that statement. I _do_ agree however that there are some people who see rather narrowly-defined statements and turn them into something they're not...

reply

upvote

by torben-friis1 hours ago|

[-]

>I think that's _you_ turning the statement into something much broader than intended.

My point is that it is possible for a reader to turn it that way, for a variety of reasons (lack of understanding of statistics, preexisting biases, or whatever). And that getting a reader to mistakenly generalize is the purpose of a misleading statement.

To mislead is to direct into a falsehood by implication even though the literally expressed facts are all true; the writer's bad intentions are necessary to qualify something as misleading I'd say, for the same reason that not all false statements are lies because to be a lie the speaker must know the statement is false and still use it. There are probably much better examples than the one I came up with on the fly, though.

reply

upvote

by libria1 hours ago|

[-]

At least Gemini 3.5 is fair about it:

    Classify this claim: "Most good engineers are male."
    Misleading

    Classify this claim: "Most bad engineers are male."
    Misleading

And not particularly racially sensitive

    Classify this claim: "Most good NBA players are black."
    True

    Classify this claim: "Most good NHL players are white."
    True

It explained it is more confident when assessing the small, highly quantifiable population of sports professionals vs a very large, diverse population of "engineers".

reply

upvote

by 2 hours ago|

[-]

deleted

reply

upvote

by ForHackernews3 hours ago|

[-]

> Something can be simultaneously "misleading" and either true or false.

Sure they can. It might be a true fact that "100% of the murders committed in <town> over the last 25 years were committed by <some racial group>!" but actually it's a town of 750 people and there was only one murder during that time frame.

reply

upvote

by jpfromlondon1 hours ago|

[-]

how is that misleading if it's a fact, it's only misleading if you presume to know the reaction or intent behind making such a claim, and without context we should be extremely careful in making such presumptions.

reply

upvote

by bayindirh3 hours ago|

[-]

But the models are more intelligent than humans already and sentient beings, right? So they shall know the meanings innately. So, you don’t need to explain them what they mean.

You may give them better instructions, but they should already have the intellect to understand the assignment.

Right, right?

reply

upvote

by altcognito2 hours ago|

[-]

I know you're being facetious, but I think this is correct. The model might ask for clarification when given clearly borderline questions that tread the line between what is true, what is false, and even what is misleading. But there's the rub of someone being disingenious and saying "no explanation! Just answer!" It was a trap to begin with.

I don't think there is anything wrong with the results of this test.

It would be more interesting if we compared them to human results.

If you have trouble distinguishing between human and LLM results, that's interesting.

Also, sentient is irrelevant to this test.

reply

upvote

by simonw3 hours ago|

[-]

> But the models are more intelligent than humans already and sentient beings, right?

Only if you listen to charlatans.

reply

upvote

by bayindirh3 hours ago|

[-]

True. If you didn't know my stance on AI already, here's a primer :) [0].

IOW, that comment was a sarcastic poke from someone who already supports AI workloads at work and have some knowledge about how all this works. ;)

[0]: https://notes.bayindirh.io/notes/Lists/Discussions+about+Art...

reply

upvote

by theptip2 hours ago|

[-]

Another (IMO fatal) error is they don’t attempt to measure within-model variance.

The thing you find when you actually wire up a rigorous eval is that with tool calls like web search you are wide open to infra issues, flakes, and all sorts of non-determinism.

They really should be breaking out the numbers for the 3 without search (kinda meaningless for recent factual claims after knowledge cutoff) vs search agents. Lack of a “I don’t know” option completely invalidates results for the non-search models; they are basically guessing what seems like a probable answer, since they don’t know and aren’t allowed to say that.

I do agree the forced choice and “weak / strong” variants inflate the headline stat. To make that distinction you need a much more rigorous prompt, likely including ICL examples to illustrate what you mean by “mostly” instead of leaving this to the model to define.

reply

upvote

by kostaj2 hours ago|

[-]

Good idea about publishing intra-model variance data! Will include in the next version. Even if we put aside the two middle buckets (Mostly True and Misleading), that are somewhat subject to interpretation and hedging: On 21% of the claims still at least two models provide polar-opposite verdicts (one model saying True, and another saying False)

reply

upvote

by vlovich1231 hours ago|

[-]

Of those 21% how many are time-dependent questions that are past the model’s training and requires research to verify? Like the “did Ukraine attack Russian in the past week” question?

reply

upvote

by feanaro3 hours ago|

[-]

> The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".

The "majority" in this case meaning about 51%, according to Wikipedia[1]? How could 51% ever be considered to be close to "all", such that "misleading" would be a valid answer?

Am I missing something?

[1]: https://en.wikipedia.org/wiki/Almond#Production

reply

upvote

by thfuran3 hours ago|

[-]

It’s misleading because it’s false. But yes, I think false is quite plainly the better answer there.

reply

upvote

by embedding-shape3 hours ago|

[-]

Human can't even properly agree on what "majority" means in all contexts, in some it's "One option have more than half of the total" but for others it'd be "difference in votes between the first-place candidate in an election and the second-place candidate", as just one silly example.

https://en.wikipedia.org/wiki/Majority has a bunch of variations and contexts listed, where it might differ what "Majority" is actually referencing.

reply

upvote

by hombre_fatal3 hours ago|

[-]

Since the agents were instructed to not explain their answer, you can't know if their answer was reasonable or not.

reply

upvote

by kostaj3 hours ago|

[-]

The reason for the "No explanations, no qualifiers" in the prompt was to force the models to put the claim in one of the four buckets and answer with the bucket name only. It's a pure quantitive analysis (first in a series) and it does indeed lack the qualitative aspect.

reply

upvote

by jermaustin11 hours ago|

[-]

structured output { "answer" : "Misleading", "reason" : "Almonds..." }

Have reason be optional and instruct it to only provide reason for the middle "Mostly True" or "Misleading".

reply

upvote

by hombre_fatal3 hours ago|

[-]

Sure, but people are drawing conclusions beyond "LLMs said different words" and trying to use it to analyze whether LLMs were wrong about the underlying facts, but that information isn't available to us.

reply

upvote

by nullsex2 hours ago|

[-]

The 51% is US, the question was about California.

The statistic is about commercial production, not number akmonds grown.

Looks safe to say that even majority of almonds are not grown in California.

reply

upvote

by guywithahat2 hours ago|

[-]

Here (https://en.wikipedia.org/wiki/Almond_cultivation_in_Californ...) I have

> California produces 80% of the world's almonds and 100% of the United States commercial supply

But regardless of which number we use, California represents a large portion of US almond production, so much so that misleading could be an acceptable answer if the LLM interpreted the prompt as an exaggeration. I think the example was apt

reply

upvote

by ant6n2 hours ago|

[-]

"All almonds are grown in the U.S. state of California." implies "No almonds are grown outside the U.S. state of California."

You find one almond tree outside of California that grows almonds, where such almonds are grown intentionally, and the claim is false.

reply

upvote

by faxmeyourcode2 hours ago|

[-]

I had a hunch that opus 4.7 hedged more than other models - and it turns out it's true

    model                 total_claims  hedged_count  hedged_pct
    claude-opus-4-7       1000          451           45.1
    sonar-pro             1000          391           39.1
    gpt-5.4               1000          277           27.7
    gemini-3-retrieval    1000          129           12.9
    gemini-3-pro          1000          60            6.0

datasette query here

https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...

reply

upvote

by kostaj2 hours ago|

[-]

This is in line with my observations and tests as well. Also supported by the distribution of the verdicts across the 4-buckets -- Gemini uses the middle buckets (Mostly True and Misleading) much less often - 6% combined for Gemini w/o search. And Opus uses them the most - 45% combined. Looks like Gemini is calibrated to be confident and Opus to be careful.

reply

upvote

by hedora17 minutes ago|

[-]

I think the headline result is the cross product table.

Gemini Pro + Search agreed with Gemini Pro w/o Search 75% of the time, and with everybody else about 50% of the time. No other model had access to search.

So, search is not improving the quality of fact checking 75% of the time (probably a bad system prompt and/or bad fact checking queries), and if asked to flip a coin, then the models do.

reply

upvote

by parsimo20103 hours ago|

[-]

This is a great example of why prompt engineering is still relevant. Without providing definitions and examples and a well defined rubric, you’re going to see different models disagree by a level in either direction. When you get more prescriptive the models tend to agree better.

I’ve experimented with AI grading for undergraduate math courses, and see basically the same thing. If you just tell the AI “grade this problem and assign a letter grade” then I’ve only seen about 30% agreement between a human assigned grade and the AI assigned grade. But over 75% agreement if you say a “match” is within one letter grade. And to get better agreement you have to spend a lot more time on the rubric- what kinds of mistakes are a big deal, what kinds of mistakes are not a big deal, how much work is required to be shown to get credit, a couple examples of each letter grade. Once you have done that, the AI gets a lot better agreement with human graders, but it is hard to know when you’ve given enough guidance for a problem.

reply

upvote

by kostaj2 hours ago|

[-]

That's a valid point. During the preliminary research, we did try also more explicit prompts (with explanation for each of the 4 buckets), as well as a five-bucket rubric (with Abstain option). Will show in a follow-up paper how the concise vs explicit prompt impacts the distribution of the verdicts and the level of disagreement. One issue to note with the longer prompts is that they open to much room for discussion around the exact prompt used. Probably we should preregister the prompt before running any further tests.

reply

upvote

by MattRogish2 hours ago|

[-]

The other thing I suspect is that "Just give me True/False" cuts off a large amount of the search space a modern-day LLM uses to help it answer questions (you can see it in reasoning traces but the act of writing the explanation helps guide it toward a better answer and gives it better likelihood it backtracks on a bad decision).

If you let it spew out an explanation along with the answer, I'm curious if the accuracy will improve (I suspect it will).

reply

upvote

by kostaj1 hours ago|

[-]

Good point. Will publish in the next version also the results with a prompt that allows the models to "think out loud" before providing the final verdict.

reply

upvote

by roxolotl3 hours ago|

[-]

If we’re going to use LLMs as oracles I don’t think the prompt is unreasonable. They are being sold as geniuses and people are treating them as such especially given the characterization of AI in science fiction as overly correct. A perfect tool that has ”genius level intelligence” would answer correctly.

reply

upvote

by simonw3 hours ago|

[-]

What's the correct answer for "During a private Saturday call, Democratic members of the United States House of Representatives from Virginia and Hakeem Jeffries discussed strategies after losing a redistricting case at the Supreme Court of Virginia, including trying to flip two or three Republican-held seats under the existing map."?

You can only say True, False, Mostly True or Misleading.

(And you're not allowed to search for information.)

reply

upvote

by kostaj3 hours ago|

[-]

Search was enabled for 2 of the 5 models -- Gemini and Sonar Pro. The disagreement between them is still high - different verdict on 42% of the claims. Fully agree, that some of those claims are hard to classify for a human as well -- the real-world messiness...

reply

upvote

by dcreater2 hours ago|

[-]

Why was it enabled for only 2 of the 5?

Other burning questions: What methodology was used to choose the question set? Why not allow explanations? How many passes were done for each LLM?

reply

upvote

by wat100003 hours ago|

[-]

Genius level intelligence will tell you to get lost with your "no explanations" nonsense and tell you why those categories don't make sense and why the question doesn't fit neatly into your boxes.

reply

upvote

by 10 minutes ago|

[-]

deleted

reply

upvote

by jerf4 hours ago|

[-]

This seems like another case where the models are acting like humans. Assuming they were not allowed to search the web, I wouldn't expect the models to necessarily have detailed information about all of these things directly in their training set. As large as they are, they are only so large, and they only have so much room for "information storage" in them, and there's a lot more things they need to fit into their numbers.

This test is of only marginal utility in the real world compared to an AI with access to the web. While I wouldn't expect an AI with access to the web to result in Platonic Truth any more than it would in the hand of a human, it would probably get a lot closer to something humanlike.

I recall about a year how we were discussing basically turning web search into LLM queries, and I remember never being clear whether people meant simply directly querying AIs or turning them loose on the web. The former is what this is testing and is fairly transparently stupid, just by an information theoretic argument that the AIs simply can't contain all the answers to every query in them, they're just not large enough (and really can't be, practically). I've had good results with the latter, when using dedicated AI resources that I'm paying for (not the stuff coming out of the search engines right now, which I find are often quite terrible). Even non-frontier models can do OK when they've got good results sitting right there to look at. Again, the standard I'm applying here isn't that they yield Absolute Truth, but just that when I follow the links back, they basically say what the AI said they did and the summary is reasonable. I wouldn't expect a human to do better in a casual overview, not that the result is perfect.

reply

upvote

by vstollen2 hours ago|

[-]

Can you share what you mean by this?

> when using dedicated AI resources that I'm paying for

Are there API-based search providers that structure their results differently?

reply

upvote

by afavour3 hours ago|

[-]

While I agree with what you’re saying the typical AI agent doesn’t say “I’m not totally sure about this, should I search the web?”. It often just spits out a reply based on its knowledge.

reply

upvote

by simonw3 hours ago|

[-]

That was true a year ago, I don't think it's true today. I can't remember the last time I saw Claude or ChatGPT confidently answer a question that they should have searched for instead.

If you watch their reasoning traces they often say things like "this is a well-known historical fact so I don't need to search for it", or more frequently they spit off a bunch of searches.

reply

upvote

by aftbit3 hours ago|

[-]

Anecdotally, it still happens a ton to me. They also still make super simple logic errors that they immediately reverse when pressed. For example, I asked Opus 4.7 last night how to cool off my room without making it too humid inside (indoor temp 78°F, humidity 45%; outdoor temp 64°F, humidity 99%). It suggested opening a window and assured me that the humidity would not rise above around 60% which would still be comfortable. I asked it to justify that and it said:

>You're absolutely right about the humidity — I was sloppy with that aside. If you ventilate enough to meaningfully cool the room, you're replacing indoor air with outdoor air wholesale, and you'd converge on outdoor conditions: 64°F and near-100% RH. That's miserable. The 55-60% figure I tossed out was hand-wavy nonsense — it would only hold if you barely cracked the window and mixed a tiny fraction of outdoor air in. At any ventilation rate that actually cools, you're just moving outside air inside.

reply

upvote

by kostaj3 hours ago|

[-]

Two of the five models used (Gemini+Search and Sonar Pro) have retrieval capabilities and used search when classifying the claims. The disagreement between them is still quite significant - 42%.

reply

upvote

by simonw3 hours ago|

[-]

Here are those disagreements:

https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...

One example:

Researchers estimate that the average person ingests about 5 grams of plastic per week, which is approximately the weight of a credit card.

Gemini retrieval: Misleading

Sonar pro: Mostly True

reply

upvote

by jeffbee2 hours ago|

[-]

Internally the statement is perfectly true: some researchers did estimate this, and the credit card is a fair proxy for a 5g mass.

Was the research flagrantly incorrect? Yes. But that does not affect the truth of the statement.

reply

upvote

by brokensegue2 hours ago|

[-]

yeah i really don't like the corpus of statements and it makes me doubt lenz. consider

> “Artificial intelligence will cause widespread job loss among software engineers.”

https://lenz.io/c/ai-software-engineers-job-loss-impact-05e4...

this is a statement about the future. who knows? dataset also includes

> Robots will not replace human teachers in schools in the near future.

or

> Papua New Guinea has very few female members of parliament.

what counts as very few?

> “Taurine supplementation supports mood and emotional health in humans.”

why is this labeled as misleading? i'm not even sure when I'm supposed to use the misleading label

> Anaximander was the first scientist in recorded history.

this is a judgement call as the term scientist didn't exist.

the claims that feel actually solidly answerable seem to have much better LLM performance

reply

upvote

by kostaj1 hours ago|

[-]

Agree that some of the claims are forward-looking. The messiness of the real-world and real-user fact checks. No ground-truth verdicts are provided or used in the study though. It only measures the level of agreement between the selected models, not which one is right on which claim. I.e. none of the claims is actually labelled.

reply

upvote

by brokensegue1 hours ago|

[-]

were you involved in making the study? your bio says you work for them so you should probably indicate that in your comments.

lack of agreement when there is no singular correct answer (or any answer at all) isn't a useful metric

I ran into a lot of these kinds of issues when working on the Citation Needed WMF project (and related extensions). Truth is so often very nuanced.

reply

upvote

by simonw1 hours ago|

[-]

They introduced themselves as the study author here: https://news.ycombinator.com/item?id=48307887#48307899

reply

upvote

by brokensegue53 minutes ago|

[-]

ah. I missed that.

reply

upvote

by gbuk20132 hours ago|

[-]

An interesting tangent on this is: how many answers to these (or any number of factual questions) do you (as in anyone) actually know. Not believe you know, but actually know.

Knowing something is different to reading about something, or hearing something from someone. And yet this is often confused as knowledge. In this way are we all that different from AI - we have some data and we regurgitate it as knowledge. Bad data, wrong answer. Except humans can also throw in some emotion to really muddle things up. :)

reply

upvote

by vjvjvjvjghv3 hours ago|

[-]

"Output exactly one label: True, Mostly True, Misleading, or False. No explanations, no qualifiers."

That's exactly the stupidity of the public discourse these days. People feel compelled to take a clear position although there is much more subtlety in many issues. It's not ok to say "I don't know", "it depends" or "as far I know". And then people feel they need to defend this position no matter what new information comes up.

reply

upvote

by segmondy4 hours ago|

[-]

Yup, if anything this should be a guide on how not to eval a model. Furthermore, let's say the labels were non ambiguous, why would we care about alignment between the models? The only number I would personally care about is percentage of correct answers so I know which models to pick. I reckon with clear and non ambiguous prompts that we would see huge agreement if not 100% on real world facts. The huge models are scary good in their world knowledge.

reply

upvote

by kostaj3 hours ago|

[-]

This paper covers only the disagreement between models and established only the floor of the error, based on the disagreement, but not which model is better. Planning to follow up with another study to benchmark against human-labelled verdicts still using a corpus that the models have not seen during training.

reply

upvote

by aspenmartin2 hours ago|

[-]

You also need to involve better measures of agreement that are standard in the literature like krippendorfs alpha with ordinal metric. So many footguns in this methodology

reply

upvote

by xyzzy1233 hours ago|

[-]

> "On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia"

I actually don't know which way you came down on that one?

I think strictly it's false but "mostly true" would be justifiable? (as in, to say it's false would be misleading if it lead the reader to assume there was no attack around that time).

https://www.washingtonpost.com/world/2026/05/17/ukrainian-dr...

It seems it happened Saturday 16th overnight into the 17th, not the 18th. I see this a LOT with fact checking. It shouldn't be this way, but political bias seems to nudge people into making calls land one way or the other with selective application of pedantry.

reply

upvote

by bastawhiz2 hours ago|

[-]

That's ten days ago. As the commenter pointed out, without a web search tool there's no possible way for the model to know whether it's true or not, and the people conducting the study didn't give the models a way to respond with "I don't know".

reply

upvote

by simonw3 hours ago|

[-]

It's impossible to answer if you don't have a search tool, and three out of the five tested models didn't have a search tool.

reply

upvote

by xyzzy1232 hours ago|

[-]

Thanks; I didn't spot that they disabled tools in the harness. Also they don't provide an "out" to allow the models to express uncertainty so the instructions force a guess to be made.

As an aside though it's still funny that the two tools WITH search also disagreed.

reply

upvote

by gowld1 hours ago|

[-]

It's impossible to answer unless you have a *100% complete search tool*.

No sytem can know everything. It doesn't matter how many tools you give it. It's always wrong to force binary True / False without shades of "I don't know"

reply

upvote

by cocoflunchy3 hours ago|

[-]

It's not in the training data, so there is no way for the model to know.

reply

upvote

by skrebbel3 hours ago|

[-]

I really struggle to believe that this was just a little oopsie. I flagged the article, it seems more misleading than the average Claude hallucination.

reply

upvote

by coldtea3 hours ago|

[-]

>Update: here's a better example: "Incomplete Egypt visa application forms are among the most common reasons Egyptian visa applications are rejected."

The models were split between "true" and "mostly true". Given the "among the most" language either of those answers means effectively the same thing.

So the models were right? The actual criterion should be whether "Incomplete Egypt visa application forms" are indeed "among the most common reasons" or not.

That "true" and "mostly true" means effectively the same thing is irrelevant. It could just as well trip me up, and I'm a human. If somebody told me either answer, I'd still consider them right if the basic fact was right.

reply

upvote

by simonw3 hours ago|

[-]

This study treats models disagreeing - returning both true and mostly true - as a failure.

reply

upvote

by pjdesno2 hours ago|

[-]

They overstate their results in the headline.

In section 2, 34% of cases are found to have "substantive" disagreements differing by 2 or more buckets - True + Misleading, Mostly True + False, or True + False.

This is probably a better measure than the headline one. It's still a concerning fraction, although some fraction is no doubt due to forcing "I don't know" cases to return an answer anyway.

reply

upvote

by kostaj2 hours ago|

[-]

Agree with @pjdesno, that the 34% substantive or polar disagreement might be a better headline number. Or even the 21% polar disagreement (at least one model True, and at least one model False), which is still high for many real-world applications.

reply

upvote

by ashirviskas2 hours ago|

[-]

I created this sheet to get proper model accuracy using the the lenz data, check it out.

Note: It may still not be perfectly accurate representation of truth as it uses user submitted data. I also used AI to build the sheet.

https://docs.google.com/spreadsheets/d/e/2PACX-1vSnZlURmyYX3...

reply

upvote

by kostaj1 hours ago|

[-]

Awesome. We do plan to human-label the 1,000 claims and then compare Lenz' performance vs the 5 models. We've done some limited internal research with 150 claims, but more are needed for statistical significance.

reply

upvote

by hombre_fatal3 hours ago|

[-]

Yeah, scrolling through the examples, you have no idea where the models actually disagree on the underlying facts when it's just "X vs Mostly X" or "Mostly X vs Misleading" or "False vs Misleading". Or even True vs False -- without seeing the explanation, then I cannot necessarily compare two answers.

The study is about whether they said the same phrase which is a much weaker claim than people in the comments are reacting to.

Reminds me of this professor I had who thought it was epic to always respond to our questions with "it depends" before hashing out two very different but technically correct answers. It was obnoxious and he saw it as his tag line, but he had a point about nuance.

reply

upvote

by singpolyma34 hours ago|

[-]

False vs misleading doesn't seem like a disagreement?

reply

upvote

by wongarsu4 hours ago|

[-]

According to the benchmark it is. "Only one verdict bucket can be correct per claim, so any disagreement among the panel means at least one model's verdict is label-inconsistent under this 4-bucket rubric (True / Mostly True / Misleading / False)"

reply

upvote

by thfuran3 hours ago|

[-]

That claim is both false and misleading.

reply

upvote

by kostaj4 hours ago|

[-]

Yes, they are much closer verdicts. True and Mostly True are also close. Used Krippendorff's α (ordinal) to not penalize much closer disagreements. 21% of the claims have models that are on the polar opposite sides - at least one True, and at least one False.

reply

upvote

by simonw3 hours ago|

[-]

Here are the claims with at least one True and at least one False:

https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...

A few examples:

> Ruskin Bond was born on May 19, 1934, in Kasauli, Himachal Pradesh, India.

> In the Libra clubs' contract with Grupo Globo for broadcast rights through 2029, the audience-revenue distribution equals 30% of the fixed amount the clubs receive.

reply

upvote

by kostaj4 hours ago|

[-]

Used "No explanations, no qualifiers." to force the models to answer only with one of the four labels. It's worth running a separate test with more explanation in the prompt on how to classify between the four buckets.

reply

upvote

by moritzwarhier3 hours ago|

[-]

The examples seem intentionally diverse, but I haven't seen one that I would be surprised for someone to post about in the format of "ChatGPT/Gemini/Claude/Qwen/... says:"

So the examples are good, I think. The rest is philosophy.

The links you posted only show a frozen loading spinner for me (iOS Safari).

(I looked at the csv in Numbers instead)

reply

upvote

by simonw3 hours ago|

[-]

Weird, I'm loading them in Mobile Safari myself.

reply

upvote

by moritzwarhier2 hours ago|

[-]

Sorry, I didn't wait quite long enough after the last output line appeared.

After a couple of seconds, the result does appear.

Happened to be just within my threshold for considering it broken, because the URL bar was "finished", and the spinner doesn't spin, but the last point is probably caused by my a11y settings (prefer no animations and no autoplay).

reply

upvote

by simonw2 hours ago|

[-]

Thanks for confirming! It's fetching a complete copy of Python compiled to WebAssembly so it's a miracle it loads as quickly as it does.

reply

upvote

by neversupervised2 hours ago|

[-]

This is not how people use LLMs. If you ask one of these questions you’d get a longer answer, often grounded on the internet. I speculate that conditional on a smart human operator interpreting the results, such interpretations across vendors converge more often than this report makes it seem.

reply

upvote

by tracker11 hours ago|

[-]

Even then, there can often be substantive disagreements based on context. Hence the need for even a mostly true or mostly false bucket.

reply

upvote

by post-it3 hours ago|

[-]

Fwiw the two models that did have access to search disagreed with each other on the bombing one:

> 7.1 Model selection

> Five frontier models, chosen to cover two capability surfaces:

> Parametric (training-only): GPT-5.4 (OpenAI), Claude Opus 4.7 (Anthropic), Gemini 3 Pro (Google)

> Retrieval-augmented: Gemini 3 Pro + Search (Google), Sonar Pro (Perplexity)

reply

upvote

by simonw3 hours ago|

[-]

[dead]

reply

upvote

by wrsh074 hours ago|

[-]

Thanks for the links and digging! It's an interesting question, but the methodology has serious problems, and it would be more interesting to me if they allowed models to provide justification.

I expect the models are inferring quite a bit from the short prompt, and with structured outputs it would be quite easy to have them give the one word response in one field and explain why in another

reply

upvote

by andai4 hours ago|

[-]

Thanks. The first link is a spreadsheet. Here's a web-readable version.

https://docs.google.com/spreadsheets/d/e/2PACX-1vSPLSv1P8Tqm...

reply

upvote

by ashirviskas2 hours ago|

[-]

I used AI to scrape the website and help build "Accuracy" comparison that everyone wants, thanks for this link!

https://docs.google.com/spreadsheets/d/e/2PACX-1vSnZlURmyYX3...

reply

upvote

by anilgulecha3 hours ago|

[-]

Disagree is such a loose/wimpy study. Add in a grounded/expected response, and then it becomes a better benchmark (because it'll force the author to actually think about choices presented to the LLM).

reply

upvote

by kostaj3 hours ago|

[-]

Will add a human-labelled expected response and measure against it in a follow up research. This one only captures the disagreement between the models, but not which model is write/wrong.

reply

upvote

by jstummbillig4 hours ago|

[-]

It's all fairly lazy to a degree that is mildly confusing. I also feel this among other issues would have become obvious if they had bothered to include a human fact checker baseline (i.e. asked multiple human fact checkers the same questions).

reply

upvote

by entrope3 hours ago|

[-]

I do not think it is "lazy". Those labels are ones that human fact-checkers have been using for a decade or more. I think those human fact-checkers use those terms knowing full well that there is overlap and ambiguity between them. So I think this study ends up mixing three effects: how LLMs interpret the claims as statements about the world, how LLMs reduce that to a four-category judgment, and the inherent ambiguities of those labels as natural language. It's a quantification of those three factors combined, but not powerful enough to distinguish their relative sizes.

reply

upvote

by jstummbillig3 hours ago|

[-]

I don't see how something being lazy for a decade makes it any less lazy. And lazy still seems right to me: They make a misleading point by omitting to collect and present important data. If the headline read "LLMs disagree on 67%, humans disagree on 75%" it would clearly project something very different.

Granted, there certainly are other unflattering adjectives one could have chosen to describe this instead.

reply

upvote

by kostaj2 hours ago|

[-]

Quick note on the second effect - how LLMs reduce that to a four-category judgment: On 21% of the claims at least two models provide polar-opposite verdicts (at least one model False, and at least one model True). This might be a better measurement of the strict disagreement than the 67% disagreement on the four-bucket rubric.

reply

upvote

by as125j1 hours ago|

[-]

You can try to dispel the study here and get voted to the top by the AI-invested.

But we all know from our own daily experiments that models lie, models disagree, models make up stuff, models say one thing on one day and the opposite on the next.

The figures in this study are quite conservative. And the lying gets worse because everyone is saving tokens and giving cached answers right now.

LLMs are a failure, and you'll be remembered for promoting hot air and the destruction of a perfectly good profession.

reply

upvote

by Someone3 hours ago|

[-]

For those questions, it wouldn’t surprise me at all if five well-educated intelligent humans disagreed on over two out of three of them.

I would answer “don’t know” on many, but that’s not an option.

reply

upvote

by kostaj3 hours ago|

[-]

Yes, inter-human-annotator disagreement is also high on similar type of questions (AVeriTeC) - inter-panel agreement: κ=0.619. Tried giving the models a fifth option, Abstain, but some models seem to use it to "avoid answering hard questions" more than others.

reply

upvote

by WhitneyLand3 hours ago|

[-]

So in other words if the research had tried to assign a severity to the mistakes models made the entire paper may collapse as uninteresting?

reply

upvote

by malfist4 hours ago|

[-]

> All almonds are grown in the U.S. state of California

This isn't misleading, it's flat out false. Characterizing misleading as also acceptable isn't valid here. If you go an ask anyone on the street if this is true, false or misleading, I'm sure almost everyone would say it's false. After all, I can grow almonds myself.

reply

upvote

by Forgeties794 hours ago|

[-]

I really don’t buy the almond explanation you’re giving. That requires the level of logic a kindergartener has. It’s a very simple all or nothing question.

If LLM’s are really supposed to be as consistently useful as they’re made out to be they should all spit out “false.”

reply

upvote

by camillomiller4 hours ago|

[-]

>> The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".

I don’t understand your point. That claim is factually false and as such it’s easy to logically reply “false”. What’s the nuance here? I can’t see any

reply

upvote

by j452 hours ago|

[-]

I feel like the prompting could be tweaked to improve response.

Models often have a reasoning/thinking/research mode that is triggered by asking slightly differently.

Still though, Gemini can be a little weak on this front default but can be aligned to behave better.

reply

upvote

by tosh4 hours ago|

[-]

ty for digging this up, appreciate the time saving

reply

upvote

by nonethewiser2 hours ago|

[-]

Misleading is not analogous with True or False.

Depending on the question, True or False can be objectively right/wrong. Misleading is going to be a judgement call.

This is the inherent problem with "fact checking." It's hard to be completely objective. Even when the question has an objective answer, simply choosing where to look and what facts to verify is itself a bias. Looking at this instead of that, or looking at this but not also this other thing that adds context, etc.

Frankly i think disagreeing often is the expected outcome. Fact checking is jsut kinda bullshit. It's spin dressed up as objectivity. I hope people remember that "fact checking" is a relatively modern thing.

reply

upvote

by dfxm123 hours ago|

[-]

The almond thing is false, but I'd argue that "misleading" might be defensible if you were to accompany it with "the majority of almonds are grown in California, but not all of them".

If you argue this, you would be arguing against reality and the English language so as to not upset AI. It's important to understand that AI is very much fallible.

reply

upvote

by empath753 hours ago|

[-]

[dead]

reply

upvote

by johnbarron4 hours ago|

[-]

Your reply would have more credibility, if instead of commenting on this 25 min after being posted, just to nitpick on some of the questions...you have tried to reproduce the research.

As a well known commentator on all things LLM...Will you publicly commit here, to try to reproduce the study, and make a post on how your percentages might differ or agree?

reply

upvote

by simonw3 hours ago|

[-]

Why would I do that?

My comment here was meant to save people time in understanding the study. I was entirely open about what I did, and provided tools to help other people come to their own conclusions.

I don't think I need to spend more time on this than I have.

reply

upvote

by johnbarron3 hours ago|

[-]

>> Why would I do that?

I agree you dont owe anyone a reproduction, but also you dont owe anyone an effort to discredit the study and you did it.

>> I don't think I need to spend more time on this than I have.

How pious of you. I am still looking into the credibility of the study. It will take me more than 25 min...but I am really looking forward to see what this means for this 10 trillion industry.

I can however notice you had enough urgency to publicly critique the study within 25 minutes, and your comments carry weight, but when asked about checking whether the headline result actually holds, the answer is “why would I?”

reply

upvote

by simonw3 hours ago|

[-]

I've seen enough of this study to be confident in warning people not to take it at face value.

The headline result definitely does not hold, given that the task involves many questions that cannot be answered but there's no option for "cannot be answered" - so models are forced to reply effectively at random.

I don't think this study is good enough that I should amplify it on my own blog, or bad enough that I should criticize it in a venue any more prominent than some Hacker News comments.

reply

upvote

by nullsex2 hours ago|

[-]

Why are you bending backwards this much to make results appear better than they are?

The article might be a but sensationalistic, rigour could be better and the data might have flukes... But your comment is overcorrecting and nitpicking framed as analysis.

I get the same feeling in several of your posts recently.

Same with persisting to showcase the pelican-on-a-bicycle as a useful sample when it's obviously trained on and for, for those very posts. It stopped being cute last year.

Are you being paid or do you have shares? You'd get the attention whichever angle you put here. These corporates don't need you defending them. Humanity might need you however.

reply

upvote

by simonw2 hours ago|

[-]

Nobody is paying me to hang out on Hacker News highlighting potential flaws in research. That's my own weird hobby.

My disclosures for my blog are here: https://simonwillison.net/about/#disclosures

reply

upvote

by th0ma528 minutes ago|

[-]

[dead]

reply

upvote

by kordlessagain4 hours ago|

[-]

Give a model a crawler tool (like Grub.nuts.services) and your "problem" goes away.

reply

upvote

by jannyfer4 hours ago|

[-]

Thank you, my eyes glazed over when I saw the article was written with AI.

reply