upvote
It's a gibberish input detection benchmark, and does not measure output hallucinations.
reply