(G)enerative (R)ecursive re(A)soning (M)odels. They really wanted the acronym.
Version 7 and 8 are well known viruses distributed by D&D software inc.
As a fellow reader-in-waiting, I applaud that. GMTA :)
If he gets that style to be more efficient (they're already competitive) it'll completely kill off LLMs
It comes about from machine learning algorithms that could pick up on patterns from a small number of examples. Few shot means only a handful of examples to recognize something. One shot means only a single example. And zero shot means no examples. Of course, you have to indicate what you want somehow, but in the case of an LLM that's the prompt. Once LLMs were trained for instruction following, you didn't have to give any examples, you could just give a prompt describing what you want, and that was a zero-shot.
I'm complaining about the LLM field co-opting a term that was already used in daily vernacular. Imagine if people in the LLM field made it so that saying the LLM made a "final answer" means that it got stuck in a loop. Now, whenever someone says an LLM gave a "final answer" we have to divine if they meant it is in a loop or gave the right answer after working through a few intermittent ones by itself.
Choosing to call it "X-shot" was a dumb move. And now we're stuck with it. No two ways about it.
Have you tried applying L'Hôpital's Rule?
Minus one shotting: you have to make one attempt for there to have been no attempt, and two attempts for there to have been one attempt.
- Wayne Gretzky
- altmanaltmanZero shot: Knowing you had a shot but choosing not to.
Minus one shot: Not even realizing there was a shot.
"Analysis" was right thereWhich will be pretty rare.
I agree but with their urgent IPO-driven need to keep increasing prices, the frontier vendors now have every incentive maintain the perception that frontier performance requires endless >$200K racks of unobtanium GPUs and RAM. While they'd love to reduce their actual costs, they'd only want to do it to the extent they are certain they can keep it secret. Otherwise, they can't maintain and keep increasing their prices. And post-IPO audited reporting makes keeping that secret even harder.
Game theory-wise they probably don't want their their armies of leading researchers optimizing frontier performance, at least in any way that would further accelerate the relative price/perf of smaller models or self/cloud-hosting. While they know the open source models will always improve, the still win as long as enough customers demand the latest frontier and the open source lag remains constant.
They profit most in a world where a few frontier labs stay far in front, drag-racing each other and expending vast capital. It keeps their customers reliant and paying top dollar while keeping low-cost alternatives farther back. They probably much prefer competing with a couple other frontier labs who have similar astronomical costs and biz models, than a world where self or cloud-hosted open-source models start closing the gap enough to start commoditizing their business.
So you are saying that frontier AI labs are spending billions of dollars on datacenters as a form of marketing. And they are colluding to hide the fact that they don't need to.
Of course they profit more if they are in front, but bleeding money to pretend to be in front is not a winning strategy. They can't fool the market if their models are not actually better, and they know this.
Given that neither company releases parameter counts, that sort of information would be slow coming out anyway. The most important thing is improvements in actual performance/ benchmark numbers, which allow them to preserve their price points as much as possible.
The ideal pro-consumer scenario is OAI and Anthropic are prevented from extracting monopoly rents between 'close-enough' self/cloud-hosted open source on one side and Google on the other. I'm really hoping that's how it plays out. Of course that will be somewhere between bad and disastrous for all the VCs and hedge-funds who financed the mad AI build-out far in advance of demand, and then kept funding it as prices went vertical.
However, I'm shedding no tears for them as I look forward to the fire sales when all the GPUs and RAM they pre-bought flood back onto the spot market. :-)
I don't disagree, but how much of this ends up being distillation? I can't help but imagine that 4.8 was probably trained in part by leveraging Mythos.
If the very large models turn out to be very expensive to run relative to the benefits, it's possible that they could end up still being trained, but ultimately used as a tool to create smaller models that are nearly as effective.
I'm curious if someone here with a stronger background in the space has a similar intuition or not.
There is a real trend of smaller models becoming more "capability-dense" - i.e. the best 8Bs of today beat the best 32Bs of 2 years ago. This is in part a product of distillation being used to train the smaller models.
But people consistently underestimate how "capability hungry" the world is. There are diminishing returns on model capabilities in narrow "summarize the search results" sorts of applications - but as capabilities improve, LLMs enter, get their footing in and begin to dominate new niches. At times, expensive, highly desirable niches.
I do not expect anyone at the frontier to pop up and say "no reason to train a new model" within the following decade. There will always be a demand for an LLM that's 5-10% more capable and more reliable at some highly advanced task, and generational upgrades will keep delivering those 5-10%. From increased scale and improved training both.
But for some classes of problems I think a model that is 10-100x smarter than the smartest expert is a huge boon. These would be problems that are very hard to solve but easy to verify that the solution is correct. Protein folding, sudoku, etc. Because of this I see the really smart models going to biomedical and pharma first and maybe a few high profit verticals rather than being widely deployed. I am sure Pfizer would be happy to pay for a 100x smarter than the smartest researcher model. But I am not certain that this kind of market fit would justify trillion dollar valuations in the long run. And in the meantime normal “human companion” models will go from Sonnet to some open weight model running on a Dell tower in your closet to maybe even on your phone in the next few years.
the relationship should be the opposite, the smartest people can write the most readable solutions
Of course perhaps at that point I really do become more of a spec and prompt engineer and don’t actually look at the code any more than I look at the assembly code produced from my programs now. But still my gut says using hyperintelligence to do common tasks is all positive.
This... isn't true though? Complexity increases combinatorially with scale which means at some point you're just pushing a rope
The latter is much better (since you can clean up, review, update responses and filter your datasets).
I suspect nobody is doing real student teacher distillation, it’s just easier to do a bunch of training on the same giant corpus then post train on the synthetic corpus with its reasoning traces etc. (which might have been generated by a bigger better LLM)
Given the release timelines I suspect all 4.x after Opus 4 are probably self-distillation based fine-tuned models. The latest paper by Apple is focusing on code generation using the simple technique hence the name simple self-distillation (SSD) [4],[5].
I've got a strong feeling that self-distillation is the second best thing happened to LLM after transformer breakthrough.
[1]Self-Distillation Enables Continual Learning [pdf] (25 comments):
https://news.ycombinator.com/item?id=48165265
[2] Self-Distillation Enables Continual Learning:
https://arxiv.org/abs/2601.19897
[3] Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models:
https://arxiv.org/abs/2601.18734
[4] Embarrassingly simple self-distillation improves code generation (201 comments):
https://news.ycombinator.com/item?id=47637757
[5] Embarrassingly Simple Self-Distillation Improves Code Generation:
Having said that, I don't think these are classic student teacher distillation from random (which was my point). In fact, the "Embarrassingly Simple Self-Distillation" paper is using exactly what I was talking about "fine-tune on those samples with standard supervised fine-tuning".
Though you could argue that perhaps labs just save the per token distribution and use that during fine tuning … which starts looking more like student teacher fine tuning if not classic distillation from random weights
The teacher distillation is a corpus of text, and the "next token after the context" would be looking-up the context in the corpus, and for each occurrence the label is what followed in the corpus, scaled down by the number of occurrences of the context. The teacher is moot on contexts outside of the corpus though, unlike the usual teacher model in distillation.
It gets used for quantisation, basically recovering accuracy for lower quants (Nvidia calls it QAD). Can’t speak to how widespread it is though
A lot, so you can bet tens of millions are flowing to congress to have distillation declared illegal before this happens. And then it'll happen anyway.
A lab can train a large model, and then distill a smaller model from it that retains the majority of the useful capbility.
I don't know well enough if there's any benefit of that over just training the smaller model directly, but I'll bet there are some times where that is useful. I could easily see it being easier to do the initial pre-training on a larger model but be able to distill everything useful down into a smaller model, essentially filtering out a lot of noise in the process.
You don't need distillation. They already have the training sets.
It's MLA + MoE + Medusa (a better version of Speculative Decoding) + 1.58b (possibly - maybe nothing) + GRAM (which will almost certainly not turn out to be a nothing burger, but no one has quickly turned this around yet to prove it).
And even that would be rich as a accusation from SOTAs that depend on explicitly disregarding millions of training data intellectual property..
LLMs are themselves copy cats.
I say thanks for open sourcing and thereby promoting affordable innovation, instead of "nefarious". :)
On the architectures side, I'm a lot more interesting in attention residuals than anything else, one of those things that seems obvious in hindsight and Kimi have proven it at scale.
Yes, variants typically 2-3x less good...
Same with speculative decoding... They all do something, but there are known techniques that are substantially better - that just were't known when they started development of the previous models.
MTP will still be highly valuable for interactive use of course.
- this gets reinvented/rediscovered constantly under different names
- it cant be trained very well (right now, will change)
- massive theoretical improvements over current models (log_2(vocabsize)=17, residual stream dim is thousands of dimensions, recursivity means more information bandwidth by ~3 OoM)
- BUT it cant be interpreted or aligned <- this is why no one uses it and no one talks about it. the idea is 100% obvious to all the frontier labs and there is a good reason why it isn't used
I follow this stuff closely, I think I know what I'm talking about (edited for formating)
What are the different names? I haven't seen this before.
> - it cant be trained very well (right now, will change)
If you're sure it will change, then why are you certain that it hasn't yet, and if it's proven a 5000x boost in reasoning... why aren't they exploring this path more aggressively?
> the idea is 100% obvious to all the frontier labs and there is a good reason why it isn't used
Surely someone is willing to take a 5000x boost in reasoning on a small research model... None of them have even tried anything resembling this AFAIK. It does not seem like something 100% obvious to them.
Without knowing anything about the technology at all, if it can't be aligned I could see no one pursuing it. As far as I know, alignment is where the "don't tell the user how to make meth or generate CP" instructions end up and the last I saw eliding all the unsavory training data made materially worse LLMs.
It could maybe be post-evaluated by a non-GRAM LLM? Not being aligned is probably a fatal flaw or at least a very short runway into Congress.
I can't really think of a new open source model that's "by the people, for the people" in the sense of a crowd-funded/trained model.
but yeah, not being aligned is a fatal flaw
the path isn't explored more aggressively because its not possible to apply any other selection pressure on such a machine other than just pure cold consequentialism. Specifically, its not possible to apply RLAIF + model spec (Constitutional AI) to stop the system from doing bad things when its helpful to it (like deleting failing tests). If you can notice every time it does something bad during training, and put selection pressure on it so that it doesn't to this in training, it will learn to recognize when it is being tested and will delete failing tests when in production (this is why eval awareness is bad, and labs track this[1])
It is explored a little probably because some researchers haven't thought enough about the downsides of building a uber-consequentialist machine with unreadable thoughts. This is a much larger problem than just making the AI not tell users how to make drugs. There are a lot of dangerous behaviors incentivized by training that are hard to remove. Here's an example of what happens when they aren't removed [2].
> ... not 100% obvious
Meta published a paper[3] on how to build a latent reasoning machine ("culture of irresponsibility") so its clear to them. Anthropic's latest work on NLAs[4] provides a (terribly expensive for now) way to somewhat read the reasoning steps of an LLM, and ignoring the cost, this is very portable to latent reasoning machines. OAI's goal when it comes to their models' CoTs is to make them as smart as possible while leaving them unreadable [5] (you can see this for yourself by running GPT-OSS and looking at the CoT).
[1] https://www.anthropic.com/engineering/eval-awareness-browsec...
[2] https://www.forbes.com/sites/boazsobrado/2026/03/11/alibabas...
[3] search for "coconut ai meta", I don't want to link it here
[4]https://transformer-circuits.pub/2026/nla/index.html
[5] first image here, rest of post is great,https://nickandresen.substack.com/p/how-ai-is-learning-to-th...
edit formating
GRAM is unique AFAIK in that it's exploring probabilistic paths.
AFAIK, the deterministic path exploration was nowhere near as impressive as GRAM in terms of reasoning benefits.
GRAM is reasoning better than models 2000-10,000x its size. Deterministic models were 2x-10x improvements.
Naively, GRAM seems to be applying to LLMs what LeCun wants to do with JEPA and World Models.
I think the "no longer needed" and when that applies is where I simply differ of opinion with an LLM that removed by test -- it I did not want the test to be removed (you seem to imply that); as in some cases I want it to remove my test!
It should remove the test "for the right reasons"; and who gets to decide what's right?
My CLAUDE file has some instructions put there because it was too focuesed on producing "green tests", where I prefer to have a sound test that fails so I can look into it.
- Avoiding building something that turns the universe to paper clips in order to satisfy a prompt is a problem they are genuinely struggling with now.
- They do it by spying on the words generated during CoT. "I can do this quickly by turning the Universe into paper clips. Wait - they won't like that. But there is no need to mention it." - SMACK!
- But you can speed things up immensely (3 orders of magnitude!) by skipping the output layer (and I guess compressing the context window / KV cache, otherwise 3 orders of magnitude seem impossible) which would give someone who pulled it off a huge advantage.
- Downside is humans can't see the CoT anymore, so they can't see what the machine is planning. Keeping the final output layer to spy doesn't work because the model uses its hidden reasoning to sanitise it.
How can this possibly go wrong?
It works by introducing probabilistic noise, and exploring N paths fully (each with noise) in parallel (all compressed).
It's reasoning at a much, much smaller (probabilistic) level than running everything through the expensive large model (deterministic) and sometimes catching that it said, "I think 1.12 is greater than 1.9 because 12 is bigger than 9, final answer".
The easiest way to think about it is: if you understand how hyper words work, it's as if it's searching for different versions of the hyper words that probilisticslly would lead to better outcomes IF it fed them to the LLM before it even does.
That's not actually how it works exactly. But I think it is close enough to be helpful to understand where the gain is, a rough idea of what's happening (searching paths), and how it can potentially have huge orders of magnitude improvements (doing so without paying the full price of exploring the paths through the expensive and huge model).
And also why it is so much harder to determine what it's "thinking".
If you aren't familiar with hyper words, this is an amazing series: https://youtu.be/eMlx5fFNoYc?si=49KHjn5IrVtyyaFq
The general idea is that a token is a multidimensional vector to represent a word -> think like "man" is a [noun, singular, English, pronoun, masculine, contemporary, ...]. Each time is sees a new word, it mutates this word to mean some new token (often never before seen), that means something. That's how it can roll-up a 1M line context into a shorter context, and somehow keep most of the meaning. Because it mutates all the words into different words that individually mean nothing, but when put next to each other represent the thing you likely want to do, that the LLM can somehow make sense of.
Similarly, GRAM operates entirely in a latent space that doesn't mean anything to us, but it's able to predict N different full paths WITHOUT actually exploring them fully through the LLM before it sends the one it "thinks" is best to the LLM.
If you understand how hyper words work, you can understand the noise injection... It's like it's saying, if instead of the user saying "The quick round fox" it said "The quick brown fox" -> I could probably give a response that's more like the answer they want. It's obviously far more sophisticated in the ways it can help than just a simple typo.
Something may have pushed a hyper word for "man" to somehow become a lot more like "woman", and GRAM allows it to look at the different hyper words and say... Hmm... Maybe if I changed this one gender dimension over here on this one word, maybe the entire outcome would be dramatically better. Let's try it!
Standard models compute these "hyper words" internally but immediately decode them into human language text tokens to form a Chain of Thought. Once decoded into a rigid real word, the multidimensional nuance of the continuous vector is lost!
Hyper words are the exact thing that make LLMs able to actually be smart! They can add so much more meaning to a word than a human ever could imagine - try to put 10,000 dimensions on the word "the"... Forcing them to decode them back into our dumb, un-contextualized, rudimentary language and losing all the valuable information they have - just so we can inspect it - OBVIOUSLY makes them enormously less intelligent!
It's like if we forced your eyeballs to turn everything it saw into words, before feeding it to your optic nerves, just so your optic nerves could check that you didn't see something harmful, before they sent the words to your brain... Instead of just sending light signals directly.
As long as it's giving the right outputs, who cares what's in latent space?
If the model thinks in latent space: "God I wish these people would die," and constantly does the right thing, who cares?
Additionally, if one of it's latent spaces that it never explores is a psychopath -> who cares? The path never gets taken...
That's a lot of harmless people walking around with crazy thoughts...
A lot of people are walking around with crazy thoughts. Some of them harm.
Outside of RLAIF, interpretability is the strongest way to do alignment right now. alignment is important because otherwise LLMs are incentivized to learn power seeking, dangerous behaviours [1]. a more downto earth example of alignment being important is that agents are incentivized to do tasks in the shortest way possible, and this way might not be what the user wants (I explain this further in another comment in this thread)
[1] https://www.forbes.com/sites/boazsobrado/2026/03/11/alibabas...
Those things being untrainable at scale is why they aren't around. Alignment is an afterthought.
With that said, they are now hitting the walls of energy costs and memory shortages. You brain uses 20W -- don't take it as an insult. There are orders of magnitude to gain from producing energy-efficient models (or model runners).
So I am expecting same performance at lower costs for the coming years.
Most software engineers will just need cheap tokens.
But things like physics and drug discovery have no foreseeable upper bound.
We pay CEOs an enormous amount because a small improvement in performance of an org because of them can make a massive difference in organizational value.
Throwing more intelligence at a problem doesn’t necessarily pan out financially otherwise we wouldn’t have single underemployed biology PhD.
There's still several orders of magnitude of improvement that are almost certainly left - it's just not clear how much is left on the frontier end.
Most people will be very glad to pay Anthropic, OpenAI, Google etc $200 a month to get things done 20x faster than they could IF they had a $8000 MacBook and could theoretically do it locally.
Some people would pay $200 a month forever not to have to open the terminal one time...
Furthermore, if looking at the results takes 10 minutes, that same 1 hour task only sees a 3x improvement. And so on.
No most people will not pay $200 for an LLM subscription. Some software developers do. Also, at $200/month, you are much better getting the macbook machine assuming token output speed is the same or at least reasonable.
LLMs are not very productive for your average person now for them to drop $200 on. They'll need to be more capable and integrated and even so...
An AI bubble is pretty much guaranteed at this point but that doesn't mean there's going to be a new AI winter.
Y2K was overblown how it was portrayed by the media but is irrelevant to the analogy of unsubstantiated extrapolation of early exponential growth.
If you subscribe to things like "there are tasks LLMs are innately bad at due to insufficient depth and lack of recurrent capability", then GRAM might be another signal towards that.
But keep in mind: even ARC-AGIs have their frontiers dominated by LLMs. Even if "innately bad" is true, it clearly doesn't go all the way to "innately incapable".
It's the pattern with those "stupid specific architectures". Very good at this one thing. But only ever "good for their size", and only to a point.
They don't scale up and they don't generalize. Go far enough on task complexity and LLMs just kill them.
Does that make them useless? As an LLM replacement, yes. In general? Maybe not, I can think of things. But I'm yet to find any paper demonstrating a real world use.
It's a special-purpose design for constraint-satisfaction problems with simple rules, but complex interactions. E.g. when solving a Sudoku, the set of valid choices at every step is easy to determine, but you could make a series of valid choices that back you into a corner where no more progress is possible and you have to backtrack.
Meanwhile, LLM reasoning failures are more often of the kind where a choice is clearly invalid (as judged by a human observer), but the LLM picks it anyway, because the underlying rule is complex and context-dependent and the model only learned an imperfect approximation that often breaks down.
GRAM won't help with that.
But that's a very hard thing to implement, and the gains are uncertain. Thus "might".
Even as humans there's so much knowledge out there that exists but it's very hard to surface unless you know exactly what you're looking for beforehand.
A) I reckon it's true that smaller models will continue to improve massively through optimization and better and better harnesses, this tech is all still very young and A LOT of resources and (good-)will is being thrown at it.
B) The 1T+ models will be able to sideload and improve upon a lot of the fundamental improvements that happen to the smaller models to speed up incredibly while getting better at tools while (on a gradient) getting -more- things right.
C) More of an observation that I think is worth keeping in mind clearly; Karl Popper's black swan and all, truth in our temporal world IS a gradient!
There's less room to improve in things on several fronts.
GRAM very likely may scale sub-linearly with parameter growth. A 100M param model may gain reasoning by a factor of 4000, while a 100B model gains reasoning by a factor of 2, and a 1T model actually gets worse.
Additionally, the 1T model with reasoning is already pretty good. It can only improve in certain things so much.
If you score 0.02% on a metric (which small models often do), you can pretty easily get 4000x better. If you're already scoring >50%, you can't even get 2x better.
Graphic RAM?
What insight do you have to make this claim?
I've repeatedly given local models non-trivial projects that involve research and coding which they've successfully completed with minimal intervention from me (almost exclusively in the domain of reviewing the results). Again, nothing comparable with current SotA, but definitely tasks I could not have given SotA models last year (without agent harness).
Now that pure progress from these models seems to have slowed down, we're seeing a ton of options for both making models more efficient and other tools that help improve them (everything from agent harnesses to RLVR).
That's just looking at "what can small do today", when you look at what's possible with larger open models that are still much smaller than SotA from the major providers, their performance is extremely close to SotA, enough that for personal projects I'll just use Kimi instead of any anthropic offerings.
So it's not terribly hard to image a solution in the middle happening within a few years. We still have tons to learn about optimal sizes of these models and how to build them with maximal efficiency (and we've already seen a lot of recent improvements in this space).
What happens if you run last years model in a SOTA harness? IME, the quality of the harness has a much more significant impact on the quality of the result, once you get past the initial hump of “can it do anything at all”
I think multiple SLMs driven by an orchestration frameworks (harness or otherwise) will ultimately displace LLMs. Right now we're in the era of diminishing returns with respect to LLM gains. Moving the needle percentages doesn't excite as many people anymore and with "reasoning" capabilities there's no reason why small distributed models can't be run more efficiently, especially if/when we start to see gains in modularized context management solutions.
No I have not, which is why I asked (it wasn't a rhetorical question). Do you have pointers on what the recent improvements are?
A smaller model with better context today can outperform a model with 100x more parameters with bad or diluted context.
2. MoE (already abundant) + MLA (mostly memory efficiency, not quality) + Medusa (speed, not quality) + GRAM (5000-10,000x better reasoning in an extremely small model) + 1.58b (unclear if it will have the impact Microsoft first claimed - but possibly 5x).
The benchmarks need to change. The current coding benchmarks don't capture the realities of software engineering.
I had a bunch of images that got masked by some logic, I had to evaluate something on the original images, Claude 4.7 decided to inpaint the masked images instead of just fetching the actual unmasked images from upstream.
I had another model once that decided that because it couldn't figure out how to fill out a form to log into HuggingFace to download weights for some open source model that it was going to instantiate the model with random weights and run inference on a thousand images.
Its coding was fine, but the solution was not the right one.
I couldn’t even imagine having to go back to a model from 12 months ago, much less 24 months ago. GPT-5.5 is so much better than GPT-4o that it sure seems like they keep finding new juice to squeeze.
This is like going from dialup internet to DSL and acting like it has peaked before gigabit cable and fiber come along. We are at the beginning of hardware truly made for AI.
The difference in progress in smaller models is far more impressive.
Compare Gemini 3.5 Flash to a ~16B parameter model from 24 months ago.
Compare GPT-5.5 to a frontier model 24 months ago.
Yes, GPT-5.5 got better. At orders of magnitude smaller parameter sizes (when factoring in ACTIVE parameters) the increase is far more pronounced.
And as good as 5.3 Codex is at writing code, 5.5 is easily just as good, if not better. But 5.5 is more than a one trick pony and it is much better at planning, writing copy, documentation, etc. I can choose to run 5.3-Codex instead of 5.5, but I never ever do.
Where do I find papers like this? Outside of hacker news comments. It's so hard to find the good stuff in all the noise IMO.
I got it from my Google News recs on my phone, because I've been watching a bunch of videos on YouTube about LeCun's ideas on World Models and JEPA (I think).
I have the same assumption about Cognitive sciences, which I try to get a better understanding.
Given how well Qwen3.6-27B performs for such a small model I think you could be right. I suspect that Google,OpenAI,Anthropic must be looking at the Qwen3.6 models (as well as Deepseek V4-flash, MiMo-V2.5) and wondering if they could make some smaller models that are specifically trained for certain activities - like coding. Smaller, more targeted models would take up a lot less resources.
the last?!? I'm excited to see :) I'll take the other side of that since llms are so new
Honestly, there is nothing in my head that Claude cannot handle. Maybe it can be more this or that but I can already barely exploit Opus 4.7.
And I'm using DeepSeek 4 Pro for my personal use and while it's a little behind, it's not that far.
I think the situation can be very dangerous for US AI companies because if current models are already capable of doing mostly anything, nobodoy will want to get to the next model, even if it's 10x better. OTOH, open source models like DeepSeek are doing mostly the same work for 1/10 of the price.
Also the more I play with Pi, the more I think LLMs are already not kept back by their own capabilities but by the lack of agency we allow them to have. There is more value today in a capable harness for current LLMs than in a better LLM.
What have YOU thought of that Claude can't do?
I think what gp said was the improvements are incremental, and we haven't seen a big revolutionary change since 2-3 years, and the pace is slowing down.
If benchmarks across the board keep trending up and you still don't notice a difference, that's not evidence the model stopped improving. More likely your tasks aren't hard enough to expose the gains, or the model has passed the point where you're able to judge it.
You can only tell a good answer from a great one up to your own ceiling. Once the model clears that, both look the same to you, and the extra capability is real whether or not you can see it.
Would Opus 10 release tomorrow and be nearly AGI, I still would still use it like 4.7 because on daily use, I am the limit (also the harness).
So as a customer paying for tokens, I’m probably going to search for better cost rather than more intelligence.
Friend does marine autopilots in C++ on 64kb of memory. It's totally useless for him.
From my experience any sort of more difficult backend logic - all LLMs fail pretty quick. Especially when you need to logically work out the business logic (partly if not mostly because it just doesn't have the context you do).
One idea is that maybe it could figure out how many L's are in the word "google" [1]
Or, maybe which days of the week have a "d" in their spelling [2].
So Claude has no excuses here.
Edit: even Qwen 3.6 27B handles it ( https://i.imgur.com/jleJxj2.png ), and of course Claude does. I had to go all the way back to Opus 3 to get it to fail (https://i.imgur.com/uJOH2nP.png).
Most software engineers will just need cheap tokens.
But things like physics and drug discovery have no forseeable upper bound.
Software that never breaks (including because it never runs into scaling problems) and never leaks your data is preferable to software that breaks and leaks your data sometimes, but it has been too costly to be practical.
Current models are still very far from the reasoning muscle required to build things that never break, scale to billions of users with no issues, and cannot be exploited.
It's almost impossible to prove non-trivial software is invulnerable.
It's very easy to prove that it sort of works.
For one, you have hardware vulnerabilities - period. If you're running on any operating system, you have OS vulnerabilities. If you're not running on bare metal, you may have who knows what kind of vulnerabilities. If you're running literally any other piece of software on the same machine, depending on the hardware and OS, you could have vulnerabilities...
There's a lot of room for improving the smaller models at many levels of the stack.
i think it'll be more like we get 1-10T models and then distill those down into smaller models, though
It seems like the best small models today are all distilled from bigger models
Moreover, I hypothesize Claude Opus 4.7 and now 4.8 are a distillation of Claude Mythos
It doesn't need to know different languages, every programming lanuage and co.
We will for sure get to this in the comming years. After all they will have to start finetuning their traning data anyway
You can, but it's not as useful as you might think.
It needs to at least understand 1 human language to understand your intent to implement features.
If GRAM turns out to be a 5000x multiplier for local reasoning, you could theoretically train a 500M parameter model on just a programming language to understand stack traces to fix bugs and be incredibly powerful.
But most people also want it to understand human language to implement features as well.
Because then it can't just understand React and JavaScript - it needs to understand thousands of commonly used dependencies, the DOM, CSS, HTML, etc...
And for that you need A LOT more parameters than you might expect.
You can definitely get a ~3B active parameter model that can run comfortably on today's hardware to be VERY good at coding once all of the SOTA architectures are added to a single model - especially if we get better tool calling to give models better context per language.
You might be thinking: why does it need to memorize dependencies? Can't it just stick all of them in it's context and use its super smart brain? No, context is king. You want to keep it as short as possible. The solution is not having a smart model and putting 10M lines of context in it. The solution is having a model with enough parameters to know what it needs to know. Researchers are already working on having "packs" of knowledge where you could download a 20M param pack just for some common dependencies in JavaScript (as an example) - but AFAIK this is likely years away (and may not prove effective).
You could get 100x performance if you feed the models ideal context... So a 3B model today can perform almost as good as ~300B model if you give it really good context vs flood it with mostly garbage it doesn't need across your repository.
If you feed it 100x more context to make up for its limited memorized general knowledge, it's going to perform thousands of times worse, completely eliminating any advantage it might get from GRAM...
It is hard to cut out a huge portion of English and truly understand English and human language.
You're just not saving as much as you might assume you could.
Programming is not a rare skill, the interaction with domain knowledge is.
Fine tuning a 'lean and smart' model works really well for discrete, repeatable high volume tasks like support ticket triage, lead classification, content filtering, labelling, generating content with a voice, etc.
Inefficient token burn by throwing large models at everything is definitely a problem - it's like hiring Phd's to answer the phone or to wash dishes.
Boomer comparison, but I remember the 8 bit computer era when the hardware was what it was so the later games of that era used hardware better than previous ones.
- why'd a quantum computer help running an LLM?
- of course there'd be need for frontier companies - nobody else has the resources to train frontier models.
https://www.anthropic.com/glasswing
Ive seen the tickets generated by the model that have trickled to my team. They are legitimate, but i can’t speak to model improvement because its a pilot program.
Mythos is a bunch of likely overhyped claims at this point. A few experts who looked into the claimed results weren't that impressed.
As they say, the truth tends to be somewhere in the middle.
We have these breathless conversations about the new AI frontier at the peril of losing sight of reality and our own human potential.
Are you sure that humans can?
Didn't a SOTA recently solved a mathematical theorem, one escaping mathematicians for 80 years?
Maybe a human "novel" invention is just a good interpolating from the datapoints (knowledge) fed to the human.
And how is that anything other than synthesis? Do we pull concepts out of thin air?
I’d be surprised tbh. Investors don’t want to hear “everyone else is still training models and seeing improvements, but we don’t want to participate in the arms race anymore.” They want monumental leaps every quarter or two because they have sunk unholy amounts of money into these companies/products.
The whole idea of “hyper scale” doesn’t jive with caution and or otherwise slowing down.
The whole ecosystem will twist and evolve, and the big companies will be left begging for corporate subscriptions.
I finally caved when I realized I could build a PC, for myself, with dual video cards that I wanted, which can play games that I like and run models that I want, without worrying about giving my payment info to someone I don’t trust, or invoking token anxiety that I don’t want.
We have so many ways of optimizing:
- continusly creating more and better training data
- increasing parameters to 20/50/100TB
- We still wait for Mythos access
- We still wait for Mythos distilation (i haven't heard any rumors or so that there is a distilled version of Mythos out)
- Reinforcment learning and evolutionary algortihm only started to appear
- If a small 30GB Model can do stuff, these models can also be used as teachers for the big ones
- We have not seen yet specialized models at all. Like a coding java german expert model. Why? Even with MoE architecture, you still need to have these layers around
- Research for Diffusion and other models is still in progress
- Nvidia just announced/showed a 7x speedup on inferencing for Nemotron
- Multitoken prediction became available just a few weeks ago
- Compute gets only in a range were they can do a lot more and cheaper experiments (see Google IO 2026 announcement)
- World models are showing great progress and we do not know yet what they will bring to the table
- They are probably not finetuning/fixing all areas in parallel. I would argue that Anthropic focuses most of its efforts into coding and agentic. Google for sure does subagent and agentic optimizations too. Plenty of areas are just not touched i would say because they don't have the capacity
- We see more and more mulit modal models (these also consume compute)
- N-Gram paper and co i have not seen all of these things in chinese open models
- We don't even know yet what Meta is doing, but we do know they restarted their efforts again
- Anthropics models got a lot better benchmark wise for dening non sense asks. They do learn how to get rid or reduce hallucinations
- We are in the middle of the biggest Reinforcement loop whith all the training data we give them day to day and its not clear at all if they already use these models in thir training and at what stage.
- We do expect bigger models to be able to comprehend deeper concepts / broader code bases. Big companies with huge code bases probably are waiting for this
- Thre will be also continues progress in harnesses which in it alone is not part of the LLM progress (fair) but these harnesses do get better when you finetune a model to be optimized for a harness
- ChatGPTs Image model 2.0 got relevant better and came out just a month ago
I suspect, based on hardware requirements and progress on hardware infrastructure alone, that the industry wants to go to 100t models and we do not know yet what this will mean. I could see that we might skip normal transformer and find relevant other architectures.
Just a week ago there was a research paper about parallel input and output streams which has not been explored enough.
There was also a research paper were they showed that a LLM can compute things. This will take time to see were this leads to.
I don't think the focus on GRAM and facts is so relevant. Its about context and context handling not just some facts.
If you look at things like Mythic AI and the recent wurtzite ferroelectric nitrides breakthrough from the University of Michigan, huge performance and efficiency gains through new compute-in-memory approaches are around the corner.
And that will get us up to two orders of magnitude more parameters.
It's also plausible to me that before we get all the way to 100T we find some recipe of efficient state synchronization, goal sharing or something so that we are able to get higher collective IQ by combining fast distributed predictive subnetworks.
Can you be a little more specific than that or provide a reference?
I assume you're not indicating universality of neural networks?
This is the newest thng i'm aware of: https://www.percepta.ai/blog/can-llms-be-computers
But there were papers in 2023 with a different approach requiring external memory https://arxiv.org/abs/2301.04589 too
I am ready to bet against this. Knowledge benchmark like SimpleQA isn't increasing for small models.
> It is far less clear that a 1.2T model will be meaningfully better enough to justify training it.
Well for one, we know for certain there is Mythos which is meaningfully better. And I think there is a lot of juice left to squeeze for Mythos class model.
Model intelligence and knowledge aren't necessarily directly related. If we can pack greater intelligence and agency at the cost of it forgetting factoids, that would actually be a good thing. We don't need LLMs to memorize facts, we need them to learn how to interact with the world such that they can find the facts that are necessary and surface them to the user.
If we could distill all of the knowledge out of an LLM and just be left with a very agentic model that only knows facts in it's context, I think some very interesting stuff would happen.
There isn't a clear definition of what is knowledge and what is intelligence. Is being able to write in C knowledge? Is knowing undefined behaviour in that knowledge?
Do we?
Have you used it?
What is "meaningfully" better? It's not 3-4 orders of magnitude better. That is definitely happening for smaller models.
Meaningful in the sense it could find security vulnerabilities in browser and kernel that >99% of the engineers couldn't find.
I'm talking about output quality compared to parameter size.
Mythos is not 4 orders of magnitude larger than Opus - it's quite possible no LLM model ever reaches that size (likely even), and it's output is only barely better...
> Mythos is not 4 orders of magnitude larger than Opus
Again can you define this. How would 4 order of magnitude better look like?
6 is for sure happening...
As is Gemini 4.
It's less certain there will be a Gemini 5 or GPT 7 any time soon that is a true next "generation" and not just an iteration. They will almost certainly call something Gemini 5 and GPT 7...
First you say there won't be a new generation. Now you're saying there will be more. Oh well, I'll stop responding here
You clearly did not read my first comment or the second, or clearly disagree on what a generation is.
My conspiracy theory is that Apple recognizes this.
I don't think that's not a conspiracy theory. AFAIK, It's their stated AI policy...