upvote
Because nearly all benchmarks measure "accuracy" by giving you a point for a correct answer, and 0 points for everything else. If you have 100 questions you are 10% certain on, answering "I don't know" to all of those leads to 0 points, answering all of them as if you are confident leads to an expected value of 10 points. So that's what most AIs are trained to do

AA-Omniscience is the only AI benchmark I know of where randomly guessing gets you a lower average score than answering all questions with "I don't know"

reply
AA-Omniscience Index gives +100 for correct, 0 for "I don't know" and -100 for incorrect.

For your scenario the confident confident strategy will give average of -90. Saying I dont't know to all will give 0.

A lot of models have negative AA-Omniscience Index.

They also do have AA-Omniscience Accuracy and AA-Omniscience Hallucination Rate that handle "I don't knows" differently.

https://artificialanalysis.ai/evaluations/omniscience

reply
It should be 1 for correct, 0 for don't know and -1 for wrong.

They are much better incentives. In real life a wrong answer is much more damaging than a don't know.

reply
"AA-Omniscience Index (higher is better) measures knowledge reliability and hallucination. It rewards correct answers, penalizes hallucinations, and has no penalty for refusing to answer. Scores range from -100 to 100, where 0 means as many correct as incorrect answers, and negative scores mean more incorrect than correct."

https://artificialanalysis.ai/evaluations/omniscience

reply
See, this, to me, seems obvious, but I’m sure it’s more challenging/complex than I can imagine (I am NOT an expert on AI in any way imaginable). But there has to be a solution. Just yesterday I was asking Gemini to tell me about a certain college professor, and it gave me a list of facts about them. And it was perfect. Then, out of curiosity, I followed up with “tell me more about him!” and it spit out several more bits of information about this person that were entirely hallucinated (e.g., gave them credit for writing papers they didn’t write, said they won awards that actually someone else won). I know this is all complex and certainly beyond my limited skill set, but goodness, we’ve got to get this figured out with so many people depending on and trusting these things nowadays. It’s quite scary.
reply
I bet most of these issues are essentially system prompt/harness issues.

If your example had "Validate any details before sharing them with the user, with multiple sources" as the system prompt, it was using a model that is strong at following system prompts precisely and had access to some basic tools, then it'd spend maybe minutes more, but the answer would have been way more accurate.

But no, Google want "the new search results" (LLM hallucinations) to be on top, so we end up with "sounds plausible" answers instead "Collection of evidence from reliable/semi-reliable" or similar, which sucks. We could have quality, but it's too expensive/slow, so we get slop instead, just to maximize for speed and convenience.

reply
Errors multiply though, you might just get more plausible sounding errors than actual facts.

Like when agent 1 says X, agent 2 verifies it as Y and the original question ends up being some weird amalgamation of Z with additional ”this is really true” statements sprinkled on top.

I agree Google responses hurt more than help, but I’ve also gotten identical outcomes of 40min self-reasoning Opus threads (it’s less common obviously).

reply
> Like when agent 1 says X, agent 2 verifies it as Y and the original question ends up being some weird amalgamation of Z with additional ”this is really true” statements sprinkled on top.

Yeah, seems what grounds agents right now is quite literally human thoughts and text, so if you're doing something like that, you really need to pass the original user prompt through the entire way, for every "child" to keep in mind the final thing, otherwise it does seem to spiral out of control.

reply
Maybe some extra buckets could be added like depending on whether the answer ought to be known. Or, quality of the justification. “I don’t know and here’s a good reason why” is much better than “idk.” Correctly identifying that something is fundamentally unknown/unknowable is probably better than a simply-correct answer, even, right?
reply
It should be -1, -.1, 1 because I don't know is slightly negative.
reply
Interesting, I was about to say -1, 0.9, 1.0, because I don't know is almost as useful as the correct answer!
reply
And also because it creates "one neat trick" where it can answer "I don't know" for many/most things and still get credit.
reply
> In real life a wrong answer is much more damaging than a don't know.

I don't know. Is it?

reply
The main problem here is that hallucination suppression doesn’t generalise. We can penalise models for incorrect answers on a wide range of questions, but this doesn’t lead to the emergence of a coherent worldview, which, coupled with logical abilities, is the only true remedy against hallucinations. With current architectures, hallucinations will likely persist on open-domain tasks forever.
reply
> We can penalise models for incorrect answers on a wide range of questions, but this doesn’t lead to the emergence of a coherent worldview, which, coupled with logical abilities, is the only true remedy against hallucinations

I don't think anyone is trying to add "a coherent worldview" by reducing hallucinations, not sure how that even realistically could be aim.

What people want, is for the models to stop giving confident answers that are clearly incorrect. Yes, it won't lead to "a coherent worldview", but it'll at least stop wasting people's time if the model said "You know what, what you said doesn't make sense / isn't clear, is what you mean .... ?" or even "I'm not sure" or "I don't know".

Currently, if you have the wrong starting point, ask the model to do something, they more often than not just go ahead and do that, misunderstandings or not. They seem optimized to never push back, unless you prompt for that, and most seem to favor "I'm just gonna assume X" rather than taking a step back and figuring out how to not assume. Again, unless you prompt against that behaviour/steering it into a different workflow.

reply
Model outputs don't have a confidence score.
reply
I don't think I claimed so either? Or maybe I misunderstand the point you're trying to make.
reply
even if they did it it wouldn't be of much use because correct or not the output was the likely output 100% of the time.
reply
I think the trouble is in the outputs of the LLM and how it's interpreted by the tooling. The output is a distribution of probabilities of all possible next tokens. Even if the probability of every token is very low, the output gets normalized so that the sum of all probabilities is 1. So after that step, it's hard to see if the model was strongly preferring certain tokens or if you're just looking at amplified noise.

Training an extra "don't know" token means you have to build a moat between every other token. Between "yes" and "no", you don't have a muddled noisy area where both "yes" and "no" have relatively high probabilities, you need a new peak where "don't know" is higher. Then you just have new muddled areas between "yes" and "don't know", and "don't know" and "no". That requires even more finesse to train another answer in between.

Instead, you could check whether multiple options are about equally likely. But then you have to check if they are actually synonyms, like are the top two choices "Genève" and "Geneva", which is a good sign that the model knows the answer? Or are the top two "yes" and "no"?

reply
It’s not as simple. I trained an LLM before on exactly this, to scratch the itch of this question.

The task was simple, using the MS-MARCO[0] dataset which contains queries, search results, answers, I made a training set that has:

1. Questions paired with real results supporting them (mixed with some irrelevant results), and a correct answer

2. Questions paired only with irrelevant results, with the answer “No answer present”

The dataset was huge (close to 1M samples), and I trained using different techniques, from SFT (just mimicking the dataset) to DPO (good answer contrasted with a bad answer for the same user query) to GRPO (verifier that checks my annotations whether an answer was present or not)

Lo and behold, this didn’t reduce hallucination, rather made it much worse. Now the model started claiming “No answer present” even when it is, or even when the question didn’t need search results in the first place (simple stuff like what is X+Y).

Now you could argue that my training was basic compared to what frontier labs could do. Yet I think it hints at a more profound limitation. LLMs are finicky and don’t have a neat understand of things from first principles (list of search results, check relevance of result to user query, if answers are below a certain threshold of relevance then don’t consider them to answer …).

tl;dr: not as simple as one might think, perhaps not attainable at all.

0: https://huggingface.co/datasets/microsoft/ms_marco

reply
Thank you for sharing! Based on your experience, do you think a two-model system might fare better? For example, two models in serial where the second model is trained to "sniff out" potential hallucinations and fact check them (and possibly iterate with the first model)?
reply
If you could write that reward function you wouldn't need an LLM, you'd just query the reward function to answer any question. You can create a benchmark and check that automatically, but you can't solve this in the general case. The model can do well on the benchmark but still give overconfident answers in areas the benchmark doesn't cover.

You can definitely tune a model to say "I don't know" more often but it will cost you performance, the model will reject some questions that it could answer meaningfully. In the degenerate case the model could collapse predicting that sequence always or almost always.

reply
I guess so. Just to be clear, I was talking about post-training methods for reasoning models here, not pre-training. I think "model as a judge" should actually do okay as a "sentiment analysis" style reward for expressing uncertainty. So if none of the thousands of reasoning traces you generate reach the validated answer, you run a judge to rate uncertainty and put those reasoning traces back into the training pool.

But I guess my logic breaks down here a bit, because if there is such a thing as a validated answer, then the correct answer is in fact never uncertainty. The correct answer is to continue post training until the model gets it right. So perhaps the real answer is to create RLVR tasks where the valid answer is "I don't know" and nothing else, like this benchmark does. Or maybe that doesn't work either, no matter how many you create.

I feel as though there is some kind of philosophical lesson to be had from how hard hallucinations are to get rid of. Maybe, similarly to humans, successful models are often "arrogant" in a sense. Perhaps you just never solve an Erdös problem without some degree of self deception that it's possible for you to do so. In this line of thinking, greatness in humans is actually not related to humility, but just being so good that you actually get things right when you try. Expressing humility is of course something great people tend to do, but I'm referring to what happens under the hood.

If you squint a bit, that's kinda the trend with models. The useful ones are not that much less likely to hallucinate, they are just good enough that they tend to get it right. This comparison is of course probably not even remotely correct, but at least it's fun to anthropomorphize a bit.

reply
If we had a theoretical technique to identify the true and objective reality we'd use it in the courts and laboritories. There is no such technique, but what we do have is 2 techniques that seem work:

1) Has a certain standard of evidence been met?

2) Are the related arguments free of logical inconsistencies?

We can train the LLMs to do 2, and maybe even 1 to some extent (exactly what quality of evidence a computer can practically gather is limited). But that isn't going to get rid of hallucinations, for the same reason courts are hit-and-miss or the conclusions of studies often aren't very reliable. These techniques help, but sometimes they still get people to say things that, on close inspection, turn out to be nonsense. And those best-effort approaches are too much to expect for most questions an LLM will be handed which are informal, low stakes and don't need strong supporting evidence or logical rigour.

I think it is underestimated how many LLM-style hallucinations people themselves have. It just isn't obvious because most humans have a strategy of only repeating what the herd says after it has been socially vetted, which makes their individual eccentricities less obvious.

TLDR; I don't think it looks like an easy problem for RLVR, it looks technically unsolvable. Even making progress requires a philosophical breakthrough on the nature of truth so that the objective function can be established.

reply
Well, I'd argue that this depends on the field you're investigating. Sometimes you have a way to identify objective reality and sometimes you don't. In mathematics the majority of the field is verifiable in this way. Coding a bit less as it's intersubjective, as and the ideal methodology is subject to taste.

But even in muddy fields of reality like medicine, there are objective facts to be found. When someone comes into an ER with chest pain, you often find a true, undeniable reason for why that is happening. If their lung has collapsed, a coronary artery is clogged or the aortic artery is dissecting, even if you don't find that out it tends to be clear in retrospect. The area of reality that becomes muddy is when use proxy signals to try to figure out who gets promoted to expensive/harmful examinations we can make final conclusions from, or the cases that don't fit cleanly into one bucket or the other. But very often, the gold standard truly is golden.

Of course, many realms of reality cannot be verified in this way. But I'd argue that there are quite a few that can.

reply
> In mathematics the majority of the field is verifiable in this way.

Does mathematics count as not a hallucination though? Particularly in pure mathematics they take a certain pride coming up with wild concepts as unrooted as possible in anything relevant to human existence. The name of the game is purely about maintaining internal logical consistency - which is something an AI can do while hallucinating.

AI hallucinations in maths might be logically consistent or not be. But in that particular case it starts to get a bit iffy what we call it when someone imagines something that doesn't exist. This gets back to the thing where we can train AIs to be logically consistent, but we can't force that consistency to be grounded in any particular universe. Ie, it'll hallucinate but in a very well rationalised way - coincidentally mimicking how a number of mathematicians seem to approach life.

This is the central issue; there is a very real trade-off between facts and verifiablity. Mathematics is perfectly verifiable because it is fact free. We don't have a reliable general system to verify facts. We do have reliable systems for checking arguments (logic).

reply
Mmmm, not sure I agree with this, although this is a topic where we would have to do a lot of groundwork to formulate our positions precisely in order to ensure we're actually discussing the same thing. My counterargument is that verified mathematics does exist. A lot of mathematical models of physics predicted the existence of stuff that experiments later verified, the higgs boson, antimatter and gravitational waves comes to mind. Terrence Tao did in fact make MR machines go faster simply by finding better maths, and the tumors those machines see can be cut out and touched.

Yes, there are mathematical concepts that seem to exist purely in the realm of mathematics, but maths often touches reality in a consistent way that reflect experimental results. This seems to imply that there is more to mathematics than just internal consistency. And the parts that do not correspond to any observation right now, might just reach out and touch reality in the future. It is possible to create logically consistent systems that have nothing to do with reality, but this is not the mathematics that most mathematicians are thinking about.

Observation is the final arbiter of fact. Maybe we don't have a general system to verify ALL facts, but many facts are 100% verifiable, although not most of them. "Beyond reasonable doubt" is of course the highest level of fact as far as the scientific method is concerned, but some facts are so far beyond reasonable doubt that you might as well just call them true. In the average living human body, there is a particular clump of tissues that consistently corresponds a concept most experts would describe as a "heart", and it does in fact pump blood. True fact.

reply
But if an LLM says "I don't know" should you pay for the tokens?
reply
Why not? It did the work. Why should you expect it to be omniscient?

We can rank them based on how much they know and people will gravitate towards those that do know more.

It's a market after all.

reply
If it’s a market, wouldn’t the incentive be to lie about knowing and thus to keep the hallucinations?
reply
If you had an llm that could accurately predict when a claim is uncertain it would be very popular, I think. I would pay for that kind of reliability tbh
reply
This would break reality. There’s some underlying physical law that prevents the existence of any algorithm of truth.
reply
> There’s some underlying physical law that prevents the existence of any algorithm of truth

Haven't heard about that law, but seems unlikely we can come up with ("discover") any sort of law that uses a concept ("truth") humans can't even agree what it means, and that's not for a lack of trying, we've been trying to figure it out for millenniums already with no end in sight.

reply
If you accept certain axioms a priori, it’s fine. If you simply let the machine intelligence take it for granted that induction works because nature is uniform and give it some way to test its predictions, it would have all the building blocks it needs to reason out a lot of very useful information. Which as the parent comment points out, people would absolutely pay a lot of money for.
reply
Up to the point where consumers notice and decide to stop using these models because of it.

Might be why we're already rarely seeing models output an "I don't know".

reply
According to your logic the market will produce an LLM that consists only of 'PRINT "I don't know."'.
reply
"I don't know" has positive value, presumably you could prompt further to learn more about where it got stuck. It also increases the value of correct answers, by improving confidence that answers are actually correct.

"Confidently incorrect" has negative value. At best, a human realizes the answer is wrong and At worst, the incorrect information makes is not identified and can cause untold damage. By having the potential to be so severely wrong, it lessens the value of correct answers because there is a lower confidence value on their output.

reply
Depends on what your understanding of the product is.

If someone sold you a "Solved all your problems" machine, and it suddenly doesn't solve all your problems, then probably no, you shouldn't pay.

But the way I'm being sold LLMs, is basically "A text generator that gives your plausible-sounding human text that sometimes hallucinates and gets things wrong, based on your input", then regardless of what the outcome is, I still made use of the "Input > Output" part, which is what I bought into, so I should still pay for that.

Now of course bunch of people will say they been sold the former, but the companies themselves seem to be selling the latter. That's my perspective from a person who doesn't follow "influencers" and what not though, which seem to be selling the public on the former rather than the latter.

reply
Let's pretend I am someone who has heard people talk about ChatGPT, but have on idea what it actually is. I go to the website and am not presented with any information, just a prompt. So I ask it what it is and what it can do for me.

My ask:

> In a couple sentences, explain to me the product I'm being sold with ChatGPT. What does it do for me?

The Reply from ChatGPT:

> ChatGPT is a conversational AI that helps you think, create, learn, analyze, and get things done faster. You can use it to answer questions, draft and edit writing, summarize information, brainstorm ideas, learn new topics, write code, plan projects, and increasingly act as an assistant that can search for information, work with documents, generate images, and help complete tasks.

> In simple terms: you're buying access to an AI that turns natural language into useful work—saving time, expanding your capabilities, and giving you an always-available collaborator for both everyday tasks and specialized knowledge work.

This sounds much more like the former, a "solve all your problems" machine.... not a plausible-sounding text generation machine.

Only two weeks ago Sam Altman said their new data center "could" be where cancer gets cured[0]. It is only the people who deeply understand AI who see it as a text generator of plausible-sounding text. That isn't what the marketing department, the CEO, or the product itself seem to be saying. I'm using OpenAI as the example here, but the others don't seem much different.

[0] https://www.youtube.com/watch?v=9-tOtbDDrJA

reply
In this hypothetical case of a us being new users, you now know it's a conversational AI, so you continue asking:

> Can I trust the output you give me?

And I assume it explains what to trust VS not.

I think in the bottom you should also see something like "Any text can contain mistakes" or similar too, which I know is a far cry from what some people push in the press in regards to capabilities, but I still don't see the platforms themselves as lying about this, while I do see a bunch of people constantly over-hyping the possibilities.

reply
I don't think coming at it from the perspective of a new user is that hypothetical. All current users were new users in just the last 3 years. There are still a significant number of people who have heard of it, but haven't used it, or are still very new to it.

I'm not sure why "can I trust the output you give me?" would be a logical followup to the first response it gave me, seeing as it's response didn't say anything about hallucinations or mistakes. It said it could do "useful work" with all kinds of examples, including "specialized knowledge work".

The note under the text field, in gray as to not draw the user's attention, feels more like a CYA line from the lawyers, rather than an instruction they really want users to take to heart. That line also doesn't appear on the main home page. I only shows up after the first prompt is submitted and focus shifts to the conversation. I don't think a CYA line in gray fine print is enough to make users understand it's a plausible-sounding text generation machine instead of an answer machine. Even if I ask that point blank it gives a wordy... yes, but not really, it's being debated by philosophers... response.

reply
The marketing materials are very much the former though. From claude.com:

> If you can dream it, Claude can help you do it. Claude can process large amounts of information, brainstorm ideas, generate text and code, help you understand subjects, coach you through difficult situations, simplify your busywork so you can focus on what matters most, and so much more.

What marketing copy have you read for LLMs that is like you mentioned?

> But the way I'm being sold LLMs, is basically "A text generator that gives your plausible-sounding human text that sometimes hallucinates and gets things wrong, based on your input"

reply
They are selling the former to investors, while selling the latter to us.
reply
I would be very willing to pay more! The choice between “you may get a correct answer, or you may get lied to, without a clear way to distinguish between the two” and “you may get a correct answer, or a clear indication that the answer was not found” is pretty clear. One is a much more useful tool than the other. I don’t see any real incentives for companies making LLMs to keep their AI factually unreliable. (Full disclosure: I work for one, but I’m definitely not in the rooms where such decisions would be made.)
reply
deleted
reply
Would you rather pay for a nonsensical explanation?
reply
'I don't know' is the correct answer for infinitley more questions than those that can be answered.
reply
the problem is the null answer will stop the "markov" chain.

so, thats all.

reply
You dont have to literally send a null token. Train it to generate text that summarizes the evidence that is there but the uncertainty of the final answer to a prompt.
reply
Transformers are not Markovian, their whole point is arguably to be the reverse of Markovian, to efficiently make it so the new tokens are a function of all previous tokens
reply