undefined

upvote

points

by esafak7 hours ago |

upvote

by emp173447 hours ago|

[-]

Remember when ARC 1 was basically solved, and then ARC 2 (which is even easier for humans) came out, and all of the sudden the same models that were doing well on ARC 1 couldn’t even get 5% on ARC 2? Not convinced these benchmark improvements aren’t data leakage.

reply

upvote

by culi2 hours ago|

[-]

Look at the ARC site. The scores of these models is plotted against their "cost per task". All of these huge jumps come along with massive increases in cost per task. Including Gemini 3.1 Pro which increased by 4.2x

reply

upvote

by casey23 hours ago|

[-]

ARC 2 was made specifically to artificially lower contemporary LLM scores, therefore any kind of model improvements will have outsized effects

Also people use "saturated" too liberally. The top left corner 1 cent per task is saturated IMO. Since there are billions of people who would perfer to solve arc 1 tasks at 52 cents per task. Arc 2 a human would make thousands of dollars a day with 99.99% accuracy

reply

upvote

by z3t42 hours ago|

[-]

How much do I get if I solve this? :D

https://arcprize.org/play

reply

upvote

by alisonkisk3 hours ago|

[-]

You are saying something interesting but too esoteric. Can you explain for beginners?

reply

upvote

by redox997 hours ago|

[-]

I don't think there's much recursive improvement yet.

I'd say it's a combination of

A) Before, new model releases were mostly a new base model trained from scratch, with more parameters and more tokens. This takes many Months. Now that RL is used so heavily, you can make infinitely many tweaks to the RL setup, and in just a month get a better model using the same base model.

B) There's more compute online

C) Competition is more fierce.

reply

upvote

by m_ke6 hours ago|

[-]

this is mostly because RLVR is driving all of the recent gains, and you can continue improving the model by running it longer (+ adding new tasks / verifiers)

so we'll keep seeing more frequent flag planting checkpoint releases to not allow anyone to be able to claim SOTA for too long

reply

upvote

by culi2 hours ago|

[-]

I feel like they're actually dropping slower. Chinese models are dropping right before lunar new year as seems to be an emerging tradition.

A couple of western models have dropped around the same time too but I don't think the "strides on benchmarks" are that impressive when you consider how much tokens are being spent to make those "improvements". E.g. Gemini 3.1 Pro's ARC-AGI-2 score went from 33.6% to 77.1% buuut their "cost per task" also increased by 4.2x. It seems to be the same story for most of these benchmark improvements and similar for Claude model improvements.

I'm not convinced there's been any substantial jump in capabilities. More likely these companies have scaled their datacenters to allow for more token usage

reply

upvote

by ankit2195 hours ago|

[-]

not much to do with self improvement as such. openai has increased its pace, others are pretty much consistent. Google last year had three versions of gemini-2.5-pro each within a month of each other. Anthropic released claude 3 in march 24, sonnet 3.5 in june 24, 3.5 new in oct 24, and then 3.7 in feb 25, where they went to 4 series in May 25. then followed by opus 4.1 in august, sonnet 4.5 in oct, opus 4.5 in nov, 4.6 in feb, sonnet 4.6 in feb itself. Yes, they released both within weeks of each other, but originally they only released it together. This staggered release is what creates the impression of fast releases. its as much a function of training as a function of available compute, and they have ramped up in that regard.

reply

upvote

by oliveiracwb6 hours ago|

[-]

With the advent of MoEs, efficiency gains became possible. However, MoEs still operate far from the balance and stability of dense models. My view is that most progress comes from router tuning based on good and bad outcomes, with only marginal gains in real intelligence

reply

upvote

by ainch3 hours ago|

[-]

It's becoming impossible to keep up - in the last week or so we've had: Gemini 3 Deep Think, Gemini 3.1 Pro, Claude Sonnet 4.6, GPT-5.3-Codex Spark, GLM-5, Minimax-2.5, Step 3.5 Flash, Qwen 3.5 and Grok 4.20.

and I'm sure others I've missed...

reply

upvote

by nikcub6 hours ago|

[-]

and anyone notice that the pace has broken xAI and they were just dropped behind? The frontier improvement release loop is now ant -> openai -> google

reply

upvote

by gavinray4 hours ago|

[-]

xAI just released Grok 4.20 beta yesterday or day before?

reply

upvote

by dist-epoch5 hours ago|

[-]

Musk said Grok 5 is currently being trained, and it has 7 trillion params (Grok 4 had 3)

reply

upvote

by svara4 hours ago|

[-]

My understanding is that all recent gains are from post training and no one (publicly) knows how much scaling pretraining will still help at this point.

Happy to learn more about this if anyone has more information.

reply

upvote

by dist-epoch3 hours ago|

[-]

You gain more benefit spending compute on post-training than on pre-training.

But scaling pre-training is still worth it if you can afford it.

reply

upvote

by gmerc5 hours ago|

[-]

That's what scaling compute depth to respond to the competition look like, lighting those dollars on fire.

reply

upvote

by toephu25 hours ago|

[-]

This is what competition looks like.

reply

upvote

by 6 hours ago|

[-]

deleted

reply

upvote

by PlatoIsADisease7 hours ago|

[-]

Only using my historical experience and not Gemini 3.1 Pro, I think we see benchmark chasing then a grand release of a model that gets press attention...

Then a few days later, the model/settings are degraded to save money. Then this gets repeated until the last day before the release of the new model.

If we are benchmaxing this works well because its only being tested early on during the life cycle. By middle of the cycle, people are testing other models. By the end, people are not testing them, and if they did it would barely shake the last months of data.

reply

upvote

by KoolKat232 hours ago|

[-]

I have a relatively consistent task that it completed with new information on weekdays at the edge of its intelligence. Interestingly 3.0 flash was good when it came out, took a nose dive a month back and is now excellent, I actually can't fault it it's so good.

It's performance in antigravity has also actually improved since launch day where it was giving non-stop typescript errors (not sure if that was antigravity itself).

reply

upvote

by boxingdog6 hours ago|

[-]

[dead]

reply