> Note GPT 5.5 Pro is at the top of the leaderboard only because it blew through $100 budget after only completing four cases, so 2/4 is 50%. And, a couple of other results, both Qwen models, are skewed upward in the detect % ranking because of failure to complete all cases.
And, it does feel wrong that the unrealistically expensive model that no one in their right mind would use for anything but the most critical tasks (and even then, a committee of ten of the best alternatives would cost half as much) is at the top. But, GPT 5.5 Pro did find a bug nobody else found among the four cases it got to, hinting at some real difference. It may be closer to Mythos than others, but at an absurd price. It'd cost tens of thousands of dollars to audit all the files in a large codebase, versus maybe fifty bucks for MiMo or DeepSeek.