upvote
They would score much worse on the private set than the public set. And they haven't done this for any of the other ARC-AGI benchmarks, so why would they do it for this one?
reply
Wrong question. I suggest:

1) Do models generalize?

2) If they do, and they generalize from this, is that a win?

Chollet was one of the first “they do not generalize” evangelists. I’d be curious to hear what he thinks now, because a) most disagree with him, and b) this test seems designed to get models that can generalize better at visual long context problem solving and agency, exactly where the bleeding edge is right now for needs with agentic systems.

reply
Yeah, so you are agreeing that the benchmarks are useless because they don't answer those questions.
reply
Can AI models generalize+ at any long context problem solving and agency regardless of modality? I think the answer is no, and this is why they are not yet AGI.

+ generalize being the key word.

reply