I tell this as a person who really enjoys AI by the way.
As a measure focused solely on fluid intelligence, learning novel tasks and test-time adaptability, ARC-AGI was specifically designed to be resistant to pre-training - for example, unlike many mathematical and programming test questions, ARC-AGI problems don't have first order patterns which can be learned to solve a different ARC-AGI problem.
The ARC non-profit foundation has private versions of their tests which are never released and only the ARC can administer. There are also public versions and semi-public sets for labs to do their own pre-tests. But a lab self-testing on ARC-AGI can be susceptible to leaks or benchmaxing, which is why only "ARC-AGI Certified" results using a secret problem set really matter. The 84.6% is certified and that's a pretty big deal.
IMHO, ARC-AGI is a unique test that's different than any other AI benchmark in a significant way. It's worth spending a few minutes learning about why: https://arcprize.org/arc-agi.
So, I'd agree if this was on the true fully private set, but Google themselves says they test on only the semi-private:
> ARC-AGI-2 results are sourced from the ARC Prize website and are ARC Prize Verified. The set reported is v2, semi-private (https://storage.googleapis.com/deepmind-media/gemini/gemini_...)
This also seems to contradict what ARC-AGI claims about what "Verified" means on their site.
> How Verified Scores Work: Official Verification: Only scores evaluated on our hidden test set through our official verification process will be recognized as verified performance scores on ARC-AGI (https://arcprize.org/blog/arc-prize-verified-program)
So, which is it? IMO you can trivially train / benchmax on the semi-private data, because it is still basically just public, you just have to jump through some hoops to get access. This is clearly an advance, but it seems to me reasonable to conclude this could be driven by some amount of benchmaxing.
EDIT: Hmm, okay, it seems their policy and wording is a bit contradictory. They do say (https://arcprize.org/policy):
"To uphold this trust, we follow strict confidentiality agreements. [...] We will work closely with model providers to ensure that no data from the Semi-Private Evaluation set is retained. This includes collaborating on best practices to prevent unintended data persistence. Our goal is to minimize any risk of data leakage while maintaining the integrity of our evaluation process."
But it surely is still trivial to just make a local copy of each question served from the API, without this being detected. It would violate the contract, but there are strong incentives to do this, so I guess is just comes down to how much one trusts the model providers here. I wouldn't trust them, given e.g. https://www.theverge.com/meta/645012/meta-llama-4-maverick-b.... It is just too easy to cheat without being caught here.
The pelican benchmark is a good example, because it's been representative of models ability to generate SVGs, not just pelicans on bikes.
This may not be the case if you just e.g. roll the benchmarks into the general training data, or make running on the benchmarks just another part of the testing pipeline. I.e. improving the model generally and benchmaxing could very conceivably just both be done at the same time, it needn't be one or the other.
I think the right take away is to ignore the specific percentages reported on these tests (they are almost certainly inflated / biased) and always assume cheating is going on. What matters is that (1) the most serious tests aren't saturated, and (2) scores are improving. I.e. even if there is cheating, we can presume this was always the case, and since models couldn't do as well before even when cheating, these are still real improvements.
And obviously what actually matters is performance on real-world tasks.
No, the proof is in the pudding.
After AI we're having higher prices, higher deficits and lower standard of living. Electricity, computers and everything else costs more. "Doing better" can only be justified by that real benchmark.
If Gemini 3 DT was better we would have falling prices of electricity and everything else at least until they get to pre-2019 levels.
Man, I've seen some maintenance folks down on the field before working on them goalposts but I'm pretty sure this is the first time I saw aliens from another Universe literally teleport in, grab the goalposts, and teleport out.
This is from the BLS consumer survey report released in dec[1]
[1]https://www.bls.gov/news.release/cesan.nr0.htm
[2]https://www.bls.gov/opub/reports/consumer-expenditures/2019/
Prices are never going back to 2019 numbers though
First off, it's dollar-averaging every category, so it's not "% of income", which varies based on unit income.
Second, I could commit to spending my entire life with constant spending (optionally inflation adjusted, optionally as a % of income), by adusting quality of goods and service I purchase. So the total spending % is not a measure of affordability.
This part of a wider trend too, where economic stats don't align with what people are saying. Which is most likley explained by the economic anomaly of the pandemic skewing peoples perceptions.