These 'tests' are not labeled AGI by magic but because they are designed specificly for testing certain things a question answer test cant solve.
Gemini and OpenAI are at 80-90% at ARC-AGI-2 and its quite interesting to see the difference of challange between 2 and 3.
AGI progress means btw. general. So every additional dimension an agent can solve pushes that agent to be more general.
When you run out of such tests then it's evidence that you have reached AGI. The point of these tests is to define AGI objectively as the inability to devise tests that humans have superiority on.
I believe the CEO of ARC has said they expect us to get to ARC-AGI-7 before declaring AGI.
They'll specifically work to pass the next version of ARC-AGI, by evaluating what kind of dataset is missing that if they trained on would have their model pass the new version.
They ideally don't directly train on the ARC-AGI itself, but they can train in similar problems/datasets to hope to learn the skills that than transfer to also solving for the real ARC-AGI.
The point is that, a new version of ARC-AGI should help the next model be smarter.
LLMs weren’t supposed to solve 1, they did, so we got 2 and it really wasn’t supposed to be solvable by LLMs. It was, and as soon as it started creeping up we start hearing about 3: It’s Really AGI This Time.
I don’t know what Francois’ underlying story is, other than he hasn’t told it yet.
One of a few moments that confirmed it for me was when he was Just Asking Questions re: if Anthropic still used SaaS a month ago, which was an odd conflation of a hyperbolic reading of a hyperbolic stonk market bro narrative (SaaS is dead) and low-info on LLMs (Claude’s not the only one that can code) and addressing the wrong audience (if you follow Francois, you’re likely neither of those poles)
At this point I’d be more interested in a write up from Francois about where he is intellectually than an LLM that got 100% on this. It’s like when Yann would repeat endlessly that LLMs are definitionally dumber than housecats. Maybe, in some specific way that makes sense to you. You’re brilliant. But there’s a translation gap between Mount Olympus and us plebes, and you’re brilliant enough to know that too. So it comes across as trolling and boring.