Also people use "saturated" too liberally. The top left corner 1 cent per task is saturated IMO. Since there are billions of people who would perfer to solve arc 1 tasks at 52 cents per task. Arc 2 a human would make thousands of dollars a day with 99.99% accuracy
I'd say it's a combination of
A) Before, new model releases were mostly a new base model trained from scratch, with more parameters and more tokens. This takes many Months. Now that RL is used so heavily, you can make infinitely many tweaks to the RL setup, and in just a month get a better model using the same base model.
B) There's more compute online
C) Competition is more fierce.
so we'll keep seeing more frequent flag planting checkpoint releases to not allow anyone to be able to claim SOTA for too long
A couple of western models have dropped around the same time too but I don't think the "strides on benchmarks" are that impressive when you consider how much tokens are being spent to make those "improvements". E.g. Gemini 3.1 Pro's ARC-AGI-2 score went from 33.6% to 77.1% buuut their "cost per task" also increased by 4.2x. It seems to be the same story for most of these benchmark improvements and similar for Claude model improvements.
I'm not convinced there's been any substantial jump in capabilities. More likely these companies have scaled their datacenters to allow for more token usage
and I'm sure others I've missed...
Happy to learn more about this if anyone has more information.
But scaling pre-training is still worth it if you can afford it.
Then a few days later, the model/settings are degraded to save money. Then this gets repeated until the last day before the release of the new model.
If we are benchmaxing this works well because its only being tested early on during the life cycle. By middle of the cycle, people are testing other models. By the end, people are not testing them, and if they did it would barely shake the last months of data.
It's performance in antigravity has also actually improved since launch day where it was giving non-stop typescript errors (not sure if that was antigravity itself).