Once the AIs solve this, there will be another ARC-AGI. And so on until we can't find any more problems that can be solved by humans and not AI. And that's when we'll know we have AGI.
Kinda crazy that Yudkowsky and all those rationalists and enthusiasts spent over a decade obsessing over this stuff, and we've had almost 80 years of elite academics pondering on it, and none of them could come up with a meaningful, operational theory of intelligence. The best we can do is "closer to AGI" as a measurement, and even then, it's not 100% certain, because a model might have some cheap tricks implicit to the architecture that don't actually map to a meaningful difference in capabilities.
Gotta love the field of AI.
It doesn't prove anything of the sort. ARC-AGI has always been nothing special in that regard but this one really takes the cake. A 'human baseline' that isn't really a baseline and a scoring so convoluted a model could beat every game in reasonable time and still score well below 100. Really what are we doing here ?
That Francois had to do all this nonsense should tell you the state of where we are right now.
It's a "let's find a task humans are decent at, but modern AIs are still very bad at" kind of adversarial benchmark.
The exact coverage of this one is: spatial reasoning across multiple turns, agentic explore/exploit with rule inference and preplanning. Directly targeted against the current generation of LLMs.
Anyway, from the article:
> As long as there is a gap between AI and human learning, we do not have AGI.
This seems like a reasonable requirement. Something I think about a lot with vibe coding is that unlike humans, individual models do not get better within a codebase over time, they get worse.
But vibe coding also tends to produce somewhat poor architecture, lots of redundant and intermingled bits that should be refactored. I think the model is worse the worse code it has to work with, which I presume is only in part because it's fundamentally harder to work with bad code, but also in part because its context is filled with bad code.
By updating the tests specifically in areas AI has trouble with, it creates a progressive feedback loop against which AI development can be moved forward. There's no known threshold or well defined capability or particular skill that anyone can point to and say "that! That's AGI!". The best we can do right now is a direction. Solving an ARC-AGI test moves the capabilities of that AI some increment closer to the AGI threshold. There's no good indication as to whether solving a particular test means it's 15% closer to AGI or .000015%.
It's more or less a best effort empiricist approach, since we lack a theory of intelligence that provides useful direction (as opposed to a formalization like AIXI which is way too broad to be useful in the context of developing AGI.)
It used to be easy to build these tests. I suspect it’s getting harder and harder.
But if we run out of ideas for tests that are easy for humans but impossible for models, it doesn’t mean none exist. Perhaps that’s when we turn to models to design candidate tests, and have humans be the subjects to try them out ad nauseam until no more are ever uncovered? That sounds like a lovely future…
Thats not intelligence though. Even if it may appear to be. Does it matter? Thats another question. But certaintly is not a representation of intelligence.
I think one major disconnect, is that for most people, AGI is when interacting with an AI is basically in every way like interacting with a human, including in failure modes. And likely, that this human would be the smartest most knowledgeable human you can imagine, like the top expert in all domains, with the utmost charisma and humor, etc.
This is why the "goal post" appears to be always moving, because the non-commoners who are involved with making AGI and what not never want to accept that definition, which to be fair seems too subjective, and instead like to approach AGI like something different, it can solve some problems human's can't, when it doesn't fail, it behaves like an expert human, etc.
Even if an AI could do any intellectual task about as well as a highly competent human could, I believe most people would not consider it AGI, if it lacks the inherent opinion, personality, character, inquiries, failure patterns, of a human.
And I think that goes so far as, a text only model can never meet this bar. If it cannot react in equal time to subtle facial queues, sounds, if answering you and the flow of conversation is slower than it would be with a human, etc. All these are also required for what I consider the commoner accepting AGI as having been achieved.