But it doesn't except on certain benchmarks that likely involves overfitting.
Open source models are nowhere to be seen on ARC-AGI. Nothing above 11% on ARC-AGI 1. https://x.com/GregKamradt/status/1948454001886003328
I have used a lot of them. They’re impressive for open weights, but the benchmaxxing becomes obvious. They don’t compare to the frontier models (yet) even when the benchmarks show them coming close.
This could be a good thing. ARC-AGI has become a target for America labs to train on. But there is no evidence that improvements on ARC performance translate to other skills. In fact there is some evidence that it hurts performance. When openai trained a version of o1 on ARC it got worse at everything else.
Has the difference between performance in "regular benchmarks" and ARC-AGI been a good predictor of how good models "really are"? Like if a model is great in regular benchmarks and terrible in ARC-AGI, does that tell us anything about the model other than "it's maybe benchmaxxed" or "it's not ARC-AGI benchmaxxed"?
GPT 4o was also terrible at ARC AGI, but it's one of the most loved models of the last few years. Honestly, I'm a huge fan of the ARC AGI series of benchmarks, but I don't believe it corresponds directly to the types of qualities that most people assess whenever using LLMs.
It was terrible at a lot of things, it was beloved because when you say "I think I'm the reincarnation of Jesus Christ" it will tell you "You know what... I think I believe it! I genuinely think you're the kind of person that appears once every few millenia to reshape the world!"
because arc agi involves de novo reasoning over a restricted and (hopefully) unpretrained territory, in 2d space. not many people use LLMs as more than a better wikipedia,stack overflow, or autocomplete....