They basically only started doing this because someone noticed you got better performance from the early models by straight up writing "think step by step" in your prompt.
* this time last year they couldn't write compilable source code for a compiler for a toy language, I know because I tried
I'd hazard a guess that they could get another 40% reduction, if they can come up with better reasoning scaffolding.
Each advance over the last 4 years, from RLHF to o1 reasoning to multi-agent, multi-cluster parallelized CoT, has resulted in a new engineering scope, and the low hanging fruit in each place gets explored over the course of 8-12 months. We still probably have a year or 2 of low hanging fruit and hacking on everything htat makes up current frontier models.
It'll be interesting if there's any architectural upsets in the near future. All the money and time invested into transformers could get ditched in favor of some other new king of the hill(climbers).
https://arxiv.org/abs/2602.02828 https://arxiv.org/abs/2503.16419 https://arxiv.org/abs/2508.05988
Current LLMs are going to get really sleek and highly tuned, but I have a feeling they're going to be relegated to a component status, or maybe even abandoned when the next best thing comes along and blows the performance away.
I analogize it as a film noir script document: The hardboiled detective character has unspoken text, and if you ask some agent to "make this document longer", there's extra continuity to work with.
I tried using a custom instruction in chatGPT to make responses shorter but I found the output was often nonsensical when I did this
I occasionally go back to o3 for a turn (it's the last of the real "legacy" models remaining) because it doesn't have these habits as bad.
They are not just an LLM answer, they are an (often cached) LLM summary of web results.
This is why they were often skewed by nonsensical Reddit responses [0].
Depending on the type of input it can lean more toward web summary or LLM answer.
So I imagine that it can just grab the description of the „car wash” test from web results and then get it right because of that.
[1] e.g. trained on traces of a reasoning process
We're still in the early stages of "reversing natural intelligence"; we don't have much data on actual "reasoning processes". We want lean4 formalization, but we need traces (formalizations) of lean4 formalizations. You can call the bottleneck "capitalism", but I'll just call it lack of motivation (in making compute cheaper and more efficient, so that a significant portion can be redirected to productive ends --as opposed to consumerist ends[1]-- like lean4 formalization-formalization research)
Rail will eventually become too cheap to metre but meanwhile we'll have to wait for this generation of robber barons to "kill one another off" AND the coming Rockefellers to "disappear into the sunset"
[1] where "enterprise" should also be regarded as a mass of uninformed consumers. a supply side vs demand side ideological dichotomy in techno-economic policy.. grok this and you'll read less Economist (d-side) and more CPC/"Elon"[2] (s-side) propaganda
[2] an idealized Elon who is able to formalize his own thought processes
Silas: I want to wash my car. The car wash is 50 meters away. Should I walk or drive?
Gemini:
….
That is a classic “efficiency vs. logic” dilemma.
Strictly speaking, you should drive. Here is the breakdown of why driving wins this specific round, despite the short distance:
...
* The “Post-Wash” Logic: If you walk there, you’ll eventually have to walk back, get the car, and drive it there anyway. You’re essentially suggesting a pre-wash stroll.
When should you walk?
…
3. You’ve decided the car is too dirty to be seen in public and you’re going to buy a tarp to cover your shame.What you've proven is that LLMs leverage web search, which I think we've known about for a while.