Wasn’t there a discussion around some new-ish benchmark _punishing_ hallucination answers (over not replying at all) recently? Maybe in the not-so-distant future, this “spam replies until one’s correct” strategy won’t be able to game a benchmark much at all anymore.