Arc AGI seems to test that as well. Every game is a rectangular grid to make it as easy as possible yet the AIs still fail.
I'm fairly certain the way forward isn't through agents directly interfacing with UIs but through agents using scripts and other tools to interact with the interface. That's why harnesses are so critical to performance on tasks like this.
I would like a version of Arc AGI that tests the agent's ability to dynamically create these harnesses.
Meanwhile AI agents are expected to guess pixels and fail each time.
It's not a crazy idea. Have the older model interview the newer one and then ask both (or maybe a third referee model) which one they think is smarter. Repeat 100x with different seeds. The percentage of times both sides agree the newer model won is the score.
Hehe