upvote
> which is why only "ARC-AGI Certified" results using a secret problem set really matter. The 84.6% is certified and that's a pretty big deal.

So, I'd agree if this was on the true fully private set, but Google themselves says they test on only the semi-private:

> ARC-AGI-2 results are sourced from the ARC Prize website and are ARC Prize Verified. The set reported is v2, semi-private (https://storage.googleapis.com/deepmind-media/gemini/gemini_...)

This also seems to contradict what ARC-AGI claims about what "Verified" means on their site.

> How Verified Scores Work: Official Verification: Only scores evaluated on our hidden test set through our official verification process will be recognized as verified performance scores on ARC-AGI (https://arcprize.org/blog/arc-prize-verified-program)

So, which is it? IMO you can trivially train / benchmax on the semi-private data, because it is still basically just public, you just have to jump through some hoops to get access. This is clearly an advance, but it seems to me reasonable to conclude this could be driven by some amount of benchmaxing.

EDIT: Hmm, okay, it seems their policy and wording is a bit contradictory. They do say (https://arcprize.org/policy):

"To uphold this trust, we follow strict confidentiality agreements. [...] We will work closely with model providers to ensure that no data from the Semi-Private Evaluation set is retained. This includes collaborating on best practices to prevent unintended data persistence. Our goal is to minimize any risk of data leakage while maintaining the integrity of our evaluation process."

But it surely is still trivial to just make a local copy of each question served from the API, without this being detected. It would violate the contract, but there are strong incentives to do this, so I guess is just comes down to how much one trusts the model providers here. I wouldn't trust them, given e.g. https://www.theverge.com/meta/645012/meta-llama-4-maverick-b.... It is just too easy to cheat without being caught here.

reply