Mythos is the 100% against which the other models are compared.
Although the benchmark had 100$ budget cap and rudimentary tooling so probably a bit less than 100%.
GPT-5.5-pro attemted only 4 problems out of 9 before the budget ran out and got 2 of them right.
It's a shame that the author didn't try GPT-5.5-pro on all 9 just for completeness, pehaps on subscription to save money.
I've been doing more benchmarks with additional tools, with no silver bullet revealing itself thus far.
If anyone wants to fund the other five cases (~$125), I'll run them. I find that an unrealistic cost, though...simply not useful data. I'm certainly not going to spend $23 per file to audit a project with hundreds or thousands of files. I don't know anyone who would.
Also note that it was $100 cap per model, and the next most expensive model was GPT 5.5 at a 20th the price per case, about ten bucks for the whole batch.
I think on sub tokens might be 100 times cheaper.
The quota is also generous in my opinion. I can vibecode a lot most days of the week and not run out.