These 'tests' are not labeled AGI by magic but because they are designed specificly for testing certain things a question answer test cant solve.
Gemini and OpenAI are at 80-90% at ARC-AGI-2 and its quite interesting to see the difference of challange between 2 and 3.
AGI progress means btw. general. So every additional dimension an agent can solve pushes that agent to be more general.
When you run out of such tests then it's evidence that you have reached AGI. The point of these tests is to define AGI objectively as the inability to devise tests that humans have superiority on.