upvote
> However, if it can't figure out to render the json to a visual on its own does it really qualify as AGI? I'd still say the benchmark is doing its job here.

Can you render serialized JSON text blob to a visual with your brain only? The model can't do anything better than this - no harness means no tool at all, no way to e.g. implement a visualizer in whatever programming language and run it.

Why don't human testers receive the same JSON text blob and no visualizers? It's like giving human testers a harness (a playable visualizer), but deliberately cripples it for the model.

reply
Huh. I thought it wasn't supposed to receive any instructions tailored to the task but I didn't understand it to be restricted from accessing truly general tools such as programming languages. To do otherwise is to require pointless hoop jumping as frontier models inevitably get retrained to play games using a json (or other arbitrary) representation at which point it will be natural for them and the real test will begin.
reply
This is my understanding as well, I thought tools where allowed.
reply