But Simon says he runs these through the API without tool access specifically to prevent that sort of "cheating". I.e. it's an LLM benchmark not an LLM+Harness benchmark.
Not really a criticism but an interesting point that you would never expect a human to make that mistake even in a bad drawing.