upvote
I just realized that this also means that the benchmark is in practice unverified by third parties, as all tasks are not verified to be solvable through the JSON interface. Essentially there is no guarantee that it is even possible to understand how to complete every task optimally through the JSON interface alone.

I assume you did not develop the puzzles by visualizing JSON yourselves, and so there might be non obvious information that is lost in translation to JSON. Until humans optimally solve all the puzzles without ever having seen the visual version, there is no guarantee that this is even possible to do.

I think the only viable solution here is to release a version of the benchmark with a vision only harness. Otherwise it is impossible to interpret what LLM progress on this benchmark actually means.

reply
Oookay. I actually tried the harness myself, and there was a visual option. It is unclear to me if that is what the models are using on the official benchmark, but it probably is. This probably means that much of my critique is invalid. However, in the process of fiddling with the harness, building a live viewer to see what was happening, and playing through the agent API myself, I might have found 3-4 bugs with the default harness/API. Dunno where to post it, so of all places I am documenting the process on HN.

Bug 1: The visual mode "diff" image is always black, even if the model clicked on an interactive element and there was a change. Codex fixed it in one shot, the problem was in the main session loop at agent.py (line 458).

Bug 2: Claude and Chatgpt can't see the 128x128 pixel images clearly, and cannot or accurately place clicks on them either. Scaling up the images to 1028x1028 pixels gave the best results, claude dropped off hard at 2048 for some reason. Here are the full test results when models were asked to hit specific (manually labeled) elements on the "vc 33" level 1 (upper blue square, lower blue square, upper yellow rectangle, lower yellow rectangle):

Model | 128 | 256 | 512 | 1024 | 2048

claude-opus-4-6 | 1/10 | 1/10 | 9/10 | 10/10 | 0/10

gemini-3-1-pro-preview | 10/10 | 10/10 | 10/10 | 10/10 | 10/10

gpt-5.4-medium | 4/10 | 8/10 | 9/10 | 10/10 | 8/10

Bug 3: "vc 33" level 4 is impossible to complete via the API. At least it was when I made a web-viewer to navigate the games from the API side. The "canal lock" required two clicks instead of one to transfer the "boat" when water level were equilibriated, and after that any action whatsoever would spontaneously pop the boat back to the first column, so you could never progress.

"Bug" 4: This is more of a complaint on the models behalf. A major issue is that the models never get to know where they clicked. This is truly a bit unfair since humans get a live update of the position of their cursor at no extra cost (even a preview of the square their cursor highlights in the human version), but models if models fuck up on the coordinates they often think they hit their intended targets even though they whiffed the coordinates. So if that happens they note down "I hit the blue square but I guess nothing happened", and for the rest of the run they are fucked because they conclude the element is not interactive even though they got it right on the first try. The combination of an intermediary harness layer that let the models "preview" their cursor position before the "confirmed" their action and the 1024x1024 resolution caused a major improvement in their intended action "I want to click the blue square" actually resulting in that action. However, even then unintended miss-clicks often spell the end of a run (Claude 4.6 made it the furthest, which means level 2 of the "vc 33" stages, and got stuck when it missed a button and spent too much time hitting other things)

After I tried to fix all of the above issues, and tried to set up an optimal environment for models to get a fair shake, the models still mostly did very badly even when they identified the right interactive elements...except for Claude 4.6 Opus! Claude had at least one run where it made it to level 4 on "vc 33", but then got stuck because the blue squares it had to hit became too small, and it just couldn't get the cursor in the right spot even with the cursor preview functionality (the guiding pixel likely became too small for it to see clearly). When you read through the reasoning for the previous stages though, it didn't truly fully understand the underlying logic of the game, although it was almost there.

reply