undefined

points

[-]

This isn't quite the point. When comparing two different models' hallucination rates, the denominator is different. The evaluation works more or less like this: for each question, the model has the option to answer or abstain, so there are three possible outcomes: the model answers and gets it right, the model answers and gets it wrong (hallucination), or the model abstains. The hallucination rate is (model answers wrong) / (model answers wrong or abstains). So if a model A has 50 correct answers, 20 incorrect answers, and 30 abstentions, its hallucination rate is 40%, while a model with 20 correct answers, 20 incorrect answers, and 60 abstentions has a hallucination rate of 25%, even though it hallucinated exactly the same number of times. This is why hallucination rate is incomplete as a metric: it says nothing about the accuracy rate.

by grayhatter1 hours ago|

parent|

[-]

The way you define the evaluation criteria seems very problematic[^1].

I don't understand the point of describing it as 3 possible outcomes. I objected to it because the only reason I would do something like that would be to obscure the severity of the model defects. I'm sure I'm missing something, but the reason I suspect that's how it's done, is to [intentionally] obscure the actual meaningful metrics.

I would expect any engineer to evaluate any model using accuracy, (error rate), and usefulness (definitive answer rate), as strictly independent metrics. Did it answer, and if it answered, did emit incorrect or misleading information and how many quantifiable bits of each.

The false negative rate (model confirmed to contain the requested output/information via other method but was unable to for the given test) is significant, but given a non-definitive answer is significantly different from a definitive and incorrect answer. Why would you want to group hallucinations?

Number/rate of useful answers (correct and incorrect) and error rate (given any answer how often will that answer be defective in some way).

To be clear, I'm differentiating hallucination rate from eagerness to answer, even though they're obviously linked because I believe presenting 20 correct answers, 20 incorrect answers, and 60 abstentions as a hallucination rate of 25% as obviously malicious. If I give you 40 answers, 20 correct and 20 incorrect. the error rate is 50% and if it refused an additional 60 times, it's usefulness rate would be 40%... arguably 20% depending on how strict you choose to be about the definition of useful. The matrix we should be using is a 2x2 true positive, false positive, true negative, false negative. But being that honest that might make the model look bad!

[^1]: just in case it's unclear, I'm using you exclusively rhetorically. I don't think you personally are being misleading, only that you're explaining how it's done... but that's the problem isn't it.

by jpalomaki9 hours ago|

prev|

[-]

As human I also give wrong answers if if I know the right one. Sometimes I also give answers even when I don’t really know them.

When pushed, I then start thinking and realise my mistake. System 1 vs 2?

by big_paps3 hours ago|

parent|

[-]

I realized that people from india often show this kind of behaviour in my experience . They superconfidently give you a wrong answer and walk away or even help by making things worse and then dissappear shrugging .. Are you from india ?

by grayhatter5 hours ago|

parent|

prev|

[-]

That's weird, why do you do that?

When someone asks a question, if I don't know the answer; I say I don't know.

System 1 vs 2 doesn't really matter... I won't use an LLM that's willing to make up random shit. Equally I also won't work with a human who does that. Trust and confidence a system will function correctly is an important quality, in both humans and genai

by sgc16 hours ago|

prev|

[-]

Since models just output the the most probable tokens and you can never accuse them of doing anything other than making it all up, I would like to see these tests run with a prompt that attempts to mitigate hallucination and finishes with something like: "Telling me that you don't have the relevant information or that the task is impossible is extremely useful to me and a valid answer", and see how much that changes the scoring - as well as the usefulness of the answers. There are so many skills like context7 that can be tweaked to improve these results as well.

In other words, you shouldn't choose the model that hallucinates the least without detailed prompting, since a well-crafted agents.md clause should go a long way to improving output, and almost certainly the top scoring order will be different. To the point that I don't find this type of raw comparison useful beyond maybe 'make sure you test that one with more explicit prompts'.

by grayhatter15 hours ago|

parent|

[-]

> In other words, you shouldn't choose the model that hallucinates the least without detailed prompting

You're prompting it wrong is quickly becoming the new, you're holding it wrong.

It's wild how willing software engineers are to blame the user when the actual problem is their own defective design.

Ideally we all, as an industry, will stop accepting this as reasonable excuse for the demonstrated incompetence

by ordersofmag7 hours ago|

parent|

[-]

It's not that you're prompting it wrong. It's that you're judging the output against a standard (human intelligence) that just isn't relevant--no matter how much we want it to be and no matter how much the fluency of the output tricks us into thinking there's a human-like mind behind it.

Now granted, if the boat salesmen were pushing hard on the idea that the boat would fly and even put little wings on the side and I bought the boat I might get really angry when I found out that it didn't fly. And I might angrily storm into the salesroom yelling about how the design is defective. But if someone pointed out 'hey, it's a boat perhaps you should stick to sailing around in it and stop getting your undies in a bundle about it not flying' the correct response is probably to take a closer look, ignore the salesmen, and cruise around the lake. LLM's are quite handy at some things and have some weird limits. Learn the limits, enjoy your time at sea.

by grayhatter5 hours ago|

parent|

[-]

> It's not that you're prompting it wrong. It's that you're judging the output against a standard (human intelligence) that just isn't relevant

It's not that you're holding it wrong, you're just wrong for expecting it to work the way it's described (able to one shot most problems these days).

by epihelix11 hours ago|

parent|

prev|

[-]

[dead]

by luuundonjk12 hours ago|

prev|

[-]

there is a difference between a human knowingly bullshitting and being confident because he misremembers something

by master-lincoln10 hours ago|

parent|

[-]

there is a difference in their intent, but not necessarily in the effect.