undefined

points

[-]

Theoretically, you can’t benchmaxx ARC-AGI, but I too am suspect of such a large improvement, especially since the improvement on other benchmarks is not of the same order.

by moffkalast2 hours ago|

parent|

[-]

https://arcprize.org/arc-agi/1/

It's a sort of arbitrary pattern matching thing that can't be trained on in the sense that the MMLU can be, but you can definitely generate billions of examples of this kind of task and train on it, and it will not make the model better on any other task. So in that sense, it absolutely can be.

I think it's been harder to solve because it's a visual puzzle, and we know how well today's vision encoders actually work https://arxiv.org/html/2407.06581v1

by boplicity7 hours ago|

prev|

[-]

Benchmark maxing could be interpreted as benchmarks actually being a design framework? I'm sure there are pitfalls to this, but it's not necessarily bad either.

by energy1235 hours ago|

prev|

[-]

Francois Chollet accuses the big labs of targeting the benchmark, yes. It is benchmaxxed.

by tasuki5 hours ago|

parent|

[-]

Didn't the same Francois Chollet claim that this was the Real Test of Intelligence? If they target it, perhaps they target... real intelligence?

by ainch3 hours ago|

parent|

[-]

He's always said ARC is a necessary but not sufficient condition for testing intelligence afaik

by CamperBob25 hours ago|

parent|

prev|

[-]

I don't know what he could mean by that, as the whole idea behind ARC-AGI is to "target the benchmark." Got any links that explain further?

by layer85 hours ago|

parent|

[-]

The fact that ARC-AGI has public and semi-private in addition to private datasets might explain it: https://arcprize.org/arc-agi/2/#dataset-structure

by blinding-streak7 hours ago|

prev|

[-]

I assume all the frontier models are benchmaxxing, so it would make sense