Moreover, there's no reason to believe the progress of LLMs, which couldn't reliably solve high-school math problems just 3–4 years ago, will stop anytime soon.
You might want to track the progress of these models on the CritPt benchmark, which is built on *unpublished, research-level* physics problems:
Frontier models are still nowhere near solving it, but progress has been rapid.
* o3 (high) <1.5 years ago was at 1.4%
* GPT 5.4 (xhigh), 23.4%
* GPT-5.5 (xhigh), 27.1%
* GPT-5.5 Pro (xhigh) 30.6%.
Wrong. Every advancement has followed a s curve. Where we are on that curve is anyones guess. Or maybe "this time its different".
I think a better question for AI is “is it more like a network effect, liquidity effect, or a biological/physical effect”?
So if instead of text we come up with a different representation for mathematical or physical problems, that could both improve the quality of the output while reducing the amount of transformers needed for decoding and encoding IO and for internal reasoning.
There are also difference inference methods, like autoregressive and diffusion, and maybe others we haven't discovered yet.
You combine those variables, along with the internal disposition of layers, parameter size and the actual dataset, and you have such a large search space for different models that no one can reliably tell if LLM performance is going to flatline or continue to improve exponentially.
Now back to the point, what reason do you have to believe progress will stop soon? If you have no reason, then it sounds like you agree with OP.
Which makes the patronizing sarcasm all that much more nauseating.
"Reasoning" and now "Agentic" AI systems are not some fundamental improvement on LLMs, they're just running roughly the same prior-gen LLMS, multiple times.
Hence the conclusion that LLM improvement has slowed down, if not stagnated entirely, and that we should not expect the improvements of switching to these "reasoning" systems to keep happening.
There is a 50/50 chance that it turns out to be right or letting you jump of the cliff.
Only the trip stays the same beautiful 5 star plus travel.
Also, spotting an error and telling LLM makes it in most cases worse, because the LLM wants to please you and goes on to apologize and change course.
The moment I find myself in such a situation I save or cancel the session and start from scratch in most cases or pivot with drastic measures.
Gemini to me is the most unpredictable LLM while GPT works best overall for me.
Gemini lately gave me two different answers to the same question. This was an intentional test because I was bored and wanted to see what happens if you simply open a new chat and paste the same prompt everything else being the same.
Reasoning doesn’t help much in the Coding domain for me because it is very high level and formally right what the LLM comes up with as an explanation.
I google more due to LLMs than before, because essentially what I witnessed is someone producing something that I gotta control first before I hit the button that it comes with. However, you only find out shortly afterwards whether the polished button started working or gave you a warm welcome to hell.
In one case, it made a thoroughly convincing argument that an approach was justified. The second time it made exactly the opposite argument, which was equally compelling.
I now see LLMs as persuasion machines.
I was using Copilot and asked it a question about a PDF file (a concept search). It turned out the file was images of text. I was anticipating that and had the text ready to paste in.
Instead, it started writing an OCR program in python.
I stopped it after several minutes.
Often Copilot says it can't do something (sometimes it's even correct), that's preferential to the try-hard behaviour here.
This nails an important thing IMHO. I've absolutely noticed this, for better or worse. Gemini can produce surprisingly excellent things, but it's unpredictability make me go for GPT when I only want to ask it once.
A scientific approach here is to look to falsify the statement. You start asking questions, running tests, experiments, etc. to prove the notion that it is done wrong. And at some point you run out of such tests and it's probably done for some useful notion of done-ness.
I've built some larger components and things with AI. It's never a one shot kind of deal. But the good news is that you can use more AI to do a lot of the evaluation work. And if you align your agents right, the process kind of runs itself, almost. Mostly I just nudge it along. "Did you think about X? What about Y? Let's test Z"
Exactly - you need to constantly have your sceptics glasses on and you need to be exacting in terms of the structure you want things to follow. Having and enforcing "taste" is important and you need to be willing to spend time on that phase because the quality of the payoff entirely depends on it.
I recently planned for a major refactor. The discussion with claude went on for almost two days. The actual implementation was done in 10 minutes. It probably has made some mistakes that I will have to check for during the review but given that the level of detail that plan document had, it is certainly 90-95% there. After pouring-in of that much opinion, it is a fairly good representation of what I would have written while still being faster than me doing everything by hand.
It’s also because it is so annoying to have to manage the memory of the LLM with custom prompts/instructions manually.
I have not yet played with the long term memory feature, but I fear it will be even less reliable than prompts, simply because in one year or two years so much will have changed again that this “memory” will have to be redone multiple times by then.
However, I think it's important to remember that LLMs are embedded in larger systems, and those larger systems do learn.
we do also have training on synthetic data. it might compound.
I think this is a bit pedantic. Obviously the parent you’re replying to is referring to the concept of “in-context learning”, which is the actual industry / academic term for this. So you feed it a paper, and then it can use that info, and it needs steering / “mentoring” to be guided into the right direction.
Heck the whole name of “machine learning” suggests these things can actually learn. “reasoning” suggests that these things can reason, instead of being fancy, directed autocomplete. Etc.
In other news: data hydration doesn’t actually make your data wet. People use / misuse words all the time, and that causes their meaning to evolve.
And that can be very hard to do given the ui we most interact with them in is a chat session.
Just in case if you don't want to disclose your name my email is northzen@gmail.com
What I do to mitigate this is that I have fact checking agents configured to be extremely critical and non-biased on Opus, Gemini and GPT. Which are then handed the entire conversation to review it. Then it's handed off to a Opus agent which is setup to assume everything is wrong. After this, and if I'm convinced something is correct I'll hand the entire thing off to a sonnet agent, which is setup to go through the source material and give me a compiled list of exactly what I'll need to verify.
It's ridicilously effective, but I do wonder how it would work with someone who couldn't challenge to analytic agent on domain knowledge it gets wrong. Because despite knowing our architecture and needs, it'll often make conceptional errors in the "science" (I'm not sure what the English word for this is) of data architecture. Each iteration gets better though, and with the image generation tools, "drawing" the architecture for presentations from c-level to nerds is ridiclously easy.
Anthropomorphizing these systems is dangerous, whether coming from the bullish or bearish perspective. The output is statistically generated by a machine lacking the capability to be smug.
That ship has sailed. Humans will anthropomorphize a rock if you put googly eyes on it.
you deserve opinions shaped by interactions with the best tools that are out there.
But regular reminder - All LLMs can be wrong all the time. I only work with LLMs in domains I'm expert in OR I have other sources to verify their output with utmost certainty.
When I'm cooking meatballs with sauce and the recipe calls for frying them, I'll have an LLM guestimate how long and which program to use in an air fryer to mimic the frying pan, based on a picture of balls in a Pyrex. So I can just move on with the sauce, instead of spending time browsing websites and stressing about getting it perfect.
I used to hate these non-deterministic instructions, now I treat it as their own game. When I will publish my first recipe, I'll have an LLM randomize the ingredient amounts, round them up to some imprecise units and also randomize the times. Psychologists say we artists need to participate and I WILL participate.
This. Should become a general rule for any non-trivial use of LLM in a professionel setting.
Claude has been utterly useless with most math problems in my experience because, much like less capable students, it tends to get overly bogged down in tedious details before it gets to the big picture. That's great for programming, not so much for frontier math. If you're giving it little lemmas, then sure it's great, but otherwise you're just burning tokens.
I put my stuff through several sota models and round robin them in adversarial collaboration and they are all useful even though, fundamentally, they don’t “understand” anything. But they are super useful delegates as long as deciding on the problem and approach and solution all sits safely in your head so you can challenge them and steer them.
So I know the article is about one particular new model acing something and each vendor wants these stories to position their model as now good enough to replace humans and all other models, but working somewhere where I am lucky enough to be able to use all the sota models all the time, I can say that all keep making obvious mistakes and using all adversarially is way better than trusting just one.
I look forward to the day one a small open model that we can run ourselves outperforms the sum of all today’s models. That’s when enough is enough and we can let things plateau.
I have no idea what any of those words even mean. I'm sure LLMs make similar obvious-to-professors mistakes in all the domains. Not long ago, we didn't even have chatbots capable of basic conversation...
Bivectors and pseudoscalars (in a 3D context) are "just" signed areas and volumes. Easy!
Back around the GPT 3, 3.5, and 4.0 era I used to ask the bots to explain "counterfactual determinism", which is one of the most complex topics I personally understand.
Then I would lie to the bot about it, and see if it corrected me or not.
This test is useless now, the frontier models can't be fooled any longer on such "basic" concepts.
Conversely, LLMs are basically useless at anything that doesn't have enough (or no) public information for their training. Think: obscure proprietary product config files and the like, even if the concepts involved are trivial.
Similarly, Clifford Algebra is a relatively niche (even "alternative") area of mathematics and physics, with vastly less written material about it than the competing linear algebra. Hence, the AIs are bad at it.
Right now, we have a lot of smart people who have trained for decades to understand where these things go wrong and how to nudge them back, but the pool of people are going to slowly be replaced by less knowledgeable.
At some point, a rubicon will be crossed where these systems can't fallback to a human operator and will fail spectacularly.
It is troubling. It suggests a plateauing of human understanding.