undefined

points

[-]

Mentioned directly under the table:

> Note GPT 5.5 Pro is at the top of the leaderboard only because it blew through $100 budget after only completing four cases, so 2/4 is 50%. And, a couple of other results, both Qwen models, are skewed upward in the detect % ranking because of failure to complete all cases.

by SwellJoe5 hours ago|

prev|

[-]

Yeah, I'm not super happy with the chart sorting order, but trying to balance all the information is challenging. I chose not to include partials (right place, inaccurate bug description, so it smelled something funny but didn't quite understand it) in the sort order, but maybe should.

And, it does feel wrong that the unrealistically expensive model that no one in their right mind would use for anything but the most critical tasks (and even then, a committee of ten of the best alternatives would cost half as much) is at the top. But, GPT 5.5 Pro did find a bug nobody else found among the four cases it got to, hinting at some real difference. It may be closer to Mythos than others, but at an absurd price. It'd cost tens of thousands of dollars to audit all the files in a large codebase, versus maybe fifty bucks for MiMo or DeepSeek.