undefined

points

[-]

"It is a bit arbitrary, but I think this is what they're tracking."

I don't know if they can get their numbers right this way, but this seems a way more useful metric, than theoretic capabilities.

by cyanydeez5 hours ago|

parent|

[-]

ok, but arn't you just measuring efficiency and not the big I in AGI improvements.

by Leynos2 hours ago|

parent|

[-]

It also measures task coherence—ability to plan, form contingencies, recover from errors, mitigate accumulation of errors, and reconcile findings across a long context window.

by jsnell4 hours ago|

parent|

prev|

[-]

No? I think you're misunderstanding what is being measured.

It is purely a test of capabilities (can it do a thing that takes a human $X hours), not efficiency (how fast will it do it).

by lukan4 hours ago|

parent|

prev|

[-]

Yes, but this study was not about that and "just efficiency" is actually what most people are after.

At least I want AI to solve my problems, not score high on a academic leaderboard.

by jrumbut4 hours ago|

prev|

[-]

Without knowing more about their methodology, it seems like a lot of the recent improvements have involved the AI itself taking time to complete the task.

At first the models turned a 5 minute task into a 5 second task (by 5 seconds I mean a very short amount of time, not precisely 5 seconds). Then they turned a 15 minute task into a 5 second task.

Opus 4.6 completes 8 hour tasks all the time but (at least in my experience) it isn't spitting the answer out in 5 seconds anymore. It's using chain of thought and tools and the time to completion is measured in minutes or maybe hours.

In my experiments with local LLMs, a substantial part of the gap between frontier and local (for everyday use) is in tooling and infrastructure.

That is why I am sympathetic to the idea we are leveling off. But to bring in the air speed example from the article, I don't think we've reached the equivalent of the ramjet yet. I suspect in the coming years there will be new architectures, new hardware, and new ways to get even more capable models.

by Leynos2 hours ago|

parent|

[-]

It measures ability to complete (with a given success rate) a task with a known human benchmark time to complete. I.e., they set the task to human volunteers and timed how long they took the complete that task.

by MadxX794 hours ago|

prev|

[-]

I don't know why people are so impressed by 8h.

I trained an LLM to write the whole Harry Potter series, and that took JK Rowling like 17 years.

For my next point on the graph, I'll train the LLM to write the Bible, something that took humans >1500 years.

by Leynos2 hours ago|

parent|

[-]

Look at the tasks in the benchmark (see §2 https://arxiv.org/html/2503.14499v3)