upvote
"AA-Omniscience Index (higher is better) measures knowledge reliability and hallucination. It rewards correct answers, penalizes hallucinations, and has no penalty for refusing to answer. Scores range from -100 to 100, where 0 means as many correct as incorrect answers, and negative scores mean more incorrect than correct."

https://artificialanalysis.ai/evaluations/omniscience

reply
See, this, to me, seems obvious, but I’m sure it’s more challenging/complex than I can imagine (I am NOT an expert on AI in any way imaginable). But there has to be a solution. Just yesterday I was asking Gemini to tell me about a certain college professor, and it gave me a list of facts about them. And it was perfect. Then, out of curiosity, I followed up with “tell me more about him!” and it spit out several more bits of information about this person that were entirely hallucinated (e.g., gave them credit for writing papers they didn’t write, said they won awards that actually someone else won). I know this is all complex and certainly beyond my limited skill set, but goodness, we’ve got to get this figured out with so many people depending on and trusting these things nowadays. It’s quite scary.
reply
I bet most of these issues are essentially system prompt/harness issues.

If your example had "Validate any details before sharing them with the user, with multiple sources" as the system prompt, it was using a model that is strong at following system prompts precisely and had access to some basic tools, then it'd spend maybe minutes more, but the answer would have been way more accurate.

But no, Google want "the new search results" (LLM hallucinations) to be on top, so we end up with "sounds plausible" answers instead "Collection of evidence from reliable/semi-reliable" or similar, which sucks. We could have quality, but it's too expensive/slow, so we get slop instead, just to maximize for speed and convenience.

reply
Errors multiply though, you might just get more plausible sounding errors than actual facts.

Like when agent 1 says X, agent 2 verifies it as Y and the original question ends up being some weird amalgamation of Z with additional ”this is really true” statements sprinkled on top.

I agree Google responses hurt more than help, but I’ve also gotten identical outcomes of 40min self-reasoning Opus threads (it’s less common obviously).

reply
> Like when agent 1 says X, agent 2 verifies it as Y and the original question ends up being some weird amalgamation of Z with additional ”this is really true” statements sprinkled on top.

Yeah, seems what grounds agents right now is quite literally human thoughts and text, so if you're doing something like that, you really need to pass the original user prompt through the entire way, for every "child" to keep in mind the final thing, otherwise it does seem to spiral out of control.

reply
Maybe some extra buckets could be added like depending on whether the answer ought to be known. Or, quality of the justification. “I don’t know and here’s a good reason why” is much better than “idk.” Correctly identifying that something is fundamentally unknown/unknowable is probably better than a simply-correct answer, even, right?
reply
It should be -1, -.1, 1 because I don't know is slightly negative.
reply
Interesting, I was about to say -1, 0.9, 1.0, because I don't know is almost as useful as the correct answer!
reply
And also because it creates "one neat trick" where it can answer "I don't know" for many/most things and still get credit.
reply
> In real life a wrong answer is much more damaging than a don't know.

I don't know. Is it?

reply