undefined

upvote

points

by NiloCK21 hours ago |

upvote

by onlyrealcuzzo21 hours ago|

[-]

I won't be surprised if the next gen frontier models are the last.

There's orders of magnitude of low hanging juice to squeeze out of smaller models.

It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years (design not certain, probably unlikely).

It is far less clear that a 1.2T model will be meaningfully better enough to justify training it.

As far as reasoning is concerned, with the recent GRAM release, there may be 4 orders of magnitude of reasoning to tack on to smaller models.

Think about that... Google, OpenAI, Anthropic could train a 30B GRAM-based model in days - and it could potentially have better local reasoning than the best model available today at >1T params... They could upgrade that to a ~600B MoE model in days to have general trivia knowledge rivaling the best models...

You just can't train a 1T+ parameter model that fast. It is a giant if how much GRAM turns out to improve things, but it's unlikely to be trivial or nothing.

Larger models can already sort of tell you anything. They're never going to get everything right unless they stop being LLMs.

There's just not a lot of juice left to squeeze for Gemini to tell you exactly how tall Ke$ha is or when the last time Brittney Spears went to jail was...

reply

upvote

by vlovich12321 hours ago|

[-]

Took me a while to find what you were referring to by gram. Arxiv paper from 9 days ago that's not properly indexed by search engines.

(G)enerative (R)ecursive re(A)soning (M)odels. They really wanted the acronym.

https://arxiv.org/html/2605.19376v1

reply

upvote

by knollimar20 hours ago|

[-]

I prefer GRRM but then that would imply a habit of not actually getting a final result

reply

upvote

by troyvit18 hours ago|

[-]

And then every time I ask it to hurry along it kills a Stark.

reply

upvote

by anakaine17 hours ago|

[-]

Version 8 had serious flaws and wasn't recieved well by users.

reply

upvote

by kshacker14 hours ago|

[-]

I am sorry, but there was no version 7 and 8.

Version 7 and 8 are well known viruses distributed by D&D software inc.

reply

upvote

by b--l15 hours ago|

[-]

Thank you for the gold kind stranger.

reply

upvote

by subscribed3 hours ago|

[-]

Ouch.

As a fellow reader-in-waiting, I applaud that. GMTA :)

reply

upvote

by sharken17 hours ago|

[-]

Claude Opus 4.8 suggests "ReGRAM", which is less bad than GRAM.

reply

upvote

by moomin15 hours ago|

[-]

writing… (17 years)

reply

upvote

by FuriouslyAdrift29 minutes ago|

[-]

Let's not forget about Yann LeCun's current area of research that's completely different from LLMs: Joint Embedding Predictive Architecture (JEPA)

If he gets that style to be more efficient (they're already competitive) it'll completely kill off LLMs

https://openreview.net/pdf?id=BZ5a1r-kVsf

reply

upvote

by areweai20 hours ago|

[-]

That acronym is unacceptable. It's going to impede discussion and cause confusion for a long time if it doesn't die off immediately.

reply

upvote

by sebzim450019 hours ago|

[-]

You think that's bad? I introduce you to LION, (evoLved sIgn mOmeNtum) [1]

[1] https://arxiv.org/pdf/2302.06675

reply

upvote

by ikiris8 hours ago|

[-]

Now I just hear the Voltron intro riff in my head

reply

upvote

by esseph7 hours ago|

[-]

Those flying diecast lions hurt when they hit you as a kid

reply

upvote

by llarota13 hours ago|

[-]

not bad although archived. have any info why?

reply

upvote

by jorvi16 hours ago|

[-]

We're still talking about "zero-shot prompt" when the saying "X-shotted" ["One-shotted the difficult maze"] was already a well-established thing in daily vernacular. So now you constantly have to readjust your brain because whenever you read "zero-shot prompt" your mind goes "uh.. a zero-try attempt is a paradox and cannot exist".

reply

upvote

by lambda11 hours ago|

[-]

Zero-shot, one-shot, few-shot etc. refers to how many examples you have to give.

It comes about from machine learning algorithms that could pick up on patterns from a small number of examples. Few shot means only a handful of examples to recognize something. One shot means only a single example. And zero shot means no examples. Of course, you have to indicate what you want somehow, but in the case of an LLM that's the prompt. Once LLMs were trained for instruction following, you didn't have to give any examples, you could just give a prompt describing what you want, and that was a zero-shot.

reply

upvote

by jorvi4 hours ago|

[-]

You're explaining something to me I already know. Hence the "readjust my brain".

I'm complaining about the LLM field co-opting a term that was already used in daily vernacular. Imagine if people in the LLM field made it so that saying the LLM made a "final answer" means that it got stuck in a loop. Now, whenever someone says an LLM gave a "final answer" we have to divine if they meant it is in a loop or gave the right answer after working through a few intermittent ones by itself.

Choosing to call it "X-shot" was a dumb move. And now we're stuck with it. No two ways about it.

reply

upvote

by selcuka14 hours ago|

[-]

> a zero-try attempt is a paradox and cannot exist

Have you tried applying L'Hôpital's Rule?

reply

upvote

by customguy14 hours ago|

[-]

Zero shotting: there wasn't even an attempt.

Minus one shotting: you have to make one attempt for there to have been no attempt, and two attempts for there to have been one attempt.

reply

upvote

by altmanaltman9 hours ago|

[-]

You miss 100% of the shots you don't take

- Wayne Gretzky

  - altmanaltman

reply

upvote

by acka7 hours ago|

[-]

One shot: Taking a shot, just once.

Zero shot: Knowing you had a shot but choosing not to.

Minus one shot: Not even realizing there was a shot.

reply

upvote

by froh18 hours ago|

[-]

confusing indeed. I wondered "which RAM? nvram? dram? vram? dram? now what's g-ram?"

reply

upvote

by 3form17 hours ago|

[-]

GPU RAM, clearly. At least that's where my mind went.

reply

upvote

by bbor15 hours ago|

[-]

Pretty sure it's "GNU Is Not Unix Rapid Access Memory", actually

reply

upvote

by bmacho4 hours ago|

[-]

GPURAM is Probably Unix Rapid Access Memory

reply

upvote

by drakythe17 hours ago|

[-]

We already have VRAM for that purpose, thankfully.

reply

upvote

by evan_20 hours ago|

[-]

  "Analysis" was right there

reply

upvote

by noisy_boy12 hours ago|

[-]

It's great if they also introduce KILOGRAM

reply

upvote

by gchamonlive20 hours ago|

[-]

Yeah, look what happened to GNU

reply

upvote

by iugtmkbdfil8341 hours ago|

[-]

Is this the right place to do everyone's favorite copypasta?:D

reply

upvote

by coldtea13 hours ago|

[-]

It's just an acronym. It's not gonna impede anything. Think of it as just a name - you either know what it refers to or you don't, you don't understand something from it's name, or it's acronym.

reply

upvote

by rmunn12 hours ago|

[-]

It's an acronym that matches an extremely common word, making it not easily searchable.

reply

upvote

by coldtea12 hours ago|

[-]

Like countless others. You just add a second term for context.

reply

upvote

by ulbu2 hours ago|

[-]

I propose GRIM: Generative Recursive Indeterministic Impression Machine.

reply

upvote

by dyates20 hours ago|

[-]

And to think, we could have had George RR Martins instead.

reply

upvote

by trollbridge20 hours ago|

[-]

Speaking of things that never finish.

reply

upvote

by 867-530920 hours ago|

[-]

[flagged]

reply

upvote

by mindcrime19 hours ago|

[-]

[flagged]

reply

upvote

by 867-530919 hours ago|

[-]

[flagged]

reply

upvote

by jimbokun19 hours ago|

[-]

Just spell it GRRM but pronounce it “gram” if you have to reference it in spoken conversation.

Which will be pretty rare.

reply

upvote

by freehorse19 hours ago|

[-]

Grrm with a rolling r sounds better.

reply

upvote

by dizzant13 hours ago|

[-]

Pronounced like “groom” makes for a nice analogy with slimming down the model size too.

reply

upvote

by bbor15 hours ago|

[-]

Random plug for Kagi, which got it for 'GRAM model llm' on the first try ;)

reply

upvote

by 20 hours ago|

[-]

deleted

reply

upvote

by navigate831015 hours ago|

[-]

It is the 3rd list on Kagi when searching "gram models"

reply

upvote

by 10 hours ago|

[-]

deleted

reply

upvote

by yieldcrv18 hours ago|

[-]

G return G

reply

upvote

by mrandish18 hours ago|

[-]

> Google, OpenAI, Anthropic could train a 30B GRAM-based model in days - and it could potentially have better local reasoning than the best model available today at >1T param

I agree but with their urgent IPO-driven need to keep increasing prices, the frontier vendors now have every incentive maintain the perception that frontier performance requires endless >$200K racks of unobtanium GPUs and RAM. While they'd love to reduce their actual costs, they'd only want to do it to the extent they are certain they can keep it secret. Otherwise, they can't maintain and keep increasing their prices. And post-IPO audited reporting makes keeping that secret even harder.

Game theory-wise they probably don't want their their armies of leading researchers optimizing frontier performance, at least in any way that would further accelerate the relative price/perf of smaller models or self/cloud-hosting. While they know the open source models will always improve, the still win as long as enough customers demand the latest frontier and the open source lag remains constant.

They profit most in a world where a few frontier labs stay far in front, drag-racing each other and expending vast capital. It keeps their customers reliant and paying top dollar while keeping low-cost alternatives farther back. They probably much prefer competing with a couple other frontier labs who have similar astronomical costs and biz models, than a world where self or cloud-hosted open-source models start closing the gap enough to start commoditizing their business.

reply

upvote

by fwipsy7 minutes ago|

[-]

> While they'd love to reduce their actual costs, they'd only want to do it to the extent they are certain they can keep it secret.

So you are saying that frontier AI labs are spending billions of dollars on datacenters as a form of marketing. And they are colluding to hide the fact that they don't need to.

Of course they profit more if they are in front, but bleeding money to pretend to be in front is not a winning strategy. They can't fool the market if their models are not actually better, and they know this.

reply

upvote

by steveylang16 hours ago|

[-]

Given that tokens are supply constrained right now for Anthropic and OpenAI (especially a problem for Anthropic), stepwise efficiency advances for either would give it a leg up on the other. It would also help them better compete on price with Chinese models.

Given that neither company releases parameter counts, that sort of information would be slow coming out anyway. The most important thing is improvements in actual performance/ benchmark numbers, which allow them to preserve their price points as much as possible.

reply

upvote

by iknowstuff18 hours ago|

[-]

Google seems pretty happy to release smaller, faster models. 3.5 Flash is pretty clutch isn't it?

reply

upvote

by natpalmer177618 hours ago|

[-]

Google, who has invested in their own hardware supply chain and is already solvent in their own right, seems to be best positioned to force the other players to implement SOTA optimizations in their product offerings.

reply

upvote

by mrandish17 hours ago|

[-]

Google can definitely play a spoiler role here not only due to their compute infrastructure and ability to play the long-game financially but they also have more existing ways to monetize with their other businesses.

The ideal pro-consumer scenario is OAI and Anthropic are prevented from extracting monopoly rents between 'close-enough' self/cloud-hosted open source on one side and Google on the other. I'm really hoping that's how it plays out. Of course that will be somewhere between bad and disastrous for all the VCs and hedge-funds who financed the mad AI build-out far in advance of demand, and then kept funding it as prices went vertical.

However, I'm shedding no tears for them as I look forward to the fire sales when all the GPUs and RAM they pre-bought flood back onto the spot market. :-)

reply

upvote

by Npovview4 hours ago|

[-]

Google has also built a Knowledge Graph Ontology project which has stored facts. So LLMs could just incorporate facts requirements from there. All they need is a proper reasoning model which is reason heavy and fact lean.

reply

upvote

by kmacdough5 hours ago|

[-]

Yeah just watch out, they're trying to eat your 401k and they've got a powerful easily influenced friend.

reply

upvote

by frontierkodiak21 minutes ago|

[-]

At 6x the cost of its predecessor!

reply

upvote

by CryptoBanker18 hours ago|

[-]

Priced like a much larger model

reply

upvote

by iknowstuff17 hours ago|

[-]

I’ve shockingly quite enjoyed coding with it using antigravity. I only really use 3.5 flash and gpt5.5 xhigh

reply

upvote

by Take843513 hours ago|

[-]

I've not been impressed with the latest flash model at all. :\

reply

upvote

by supern0va21 hours ago|

[-]

>It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years.

I don't disagree, but how much of this ends up being distillation? I can't help but imagine that 4.8 was probably trained in part by leveraging Mythos.

If the very large models turn out to be very expensive to run relative to the benefits, it's possible that they could end up still being trained, but ultimately used as a tool to create smaller models that are nearly as effective.

I'm curious if someone here with a stronger background in the space has a similar intuition or not.

reply

upvote

by ACCount3716 hours ago|

[-]

Scale is always desirable, and there are always gains from scale. It's a matter of whether you can afford training and inference at increased scale.

There is a real trend of smaller models becoming more "capability-dense" - i.e. the best 8Bs of today beat the best 32Bs of 2 years ago. This is in part a product of distillation being used to train the smaller models.

But people consistently underestimate how "capability hungry" the world is. There are diminishing returns on model capabilities in narrow "summarize the search results" sorts of applications - but as capabilities improve, LLMs enter, get their footing in and begin to dominate new niches. At times, expensive, highly desirable niches.

I do not expect anyone at the frontier to pop up and say "no reason to train a new model" within the following decade. There will always be a demand for an LLM that's 5-10% more capable and more reliable at some highly advanced task, and generational upgrades will keep delivering those 5-10%. From increased scale and improved training both.

reply

upvote

by IgorPartola12 hours ago|

[-]

I think this is exactly right. Basically when I am coding, having an agent that roughly matches my intelligence is a feature, not a bug. Having one that is 10x as smart would actively slow me down because I would have to spend the time understanding what it is doing or hand over all architecture to it and just vibe code everything, hoping that it doesn’t do the PhD version of fizzbuzz instead of the maintainable one.

But for some classes of problems I think a model that is 10-100x smarter than the smartest expert is a huge boon. These would be problems that are very hard to solve but easy to verify that the solution is correct. Protein folding, sudoku, etc. Because of this I see the really smart models going to biomedical and pharma first and maybe a few high profit verticals rather than being widely deployed. I am sure Pfizer would be happy to pay for a 100x smarter than the smartest researcher model. But I am not certain that this kind of market fit would justify trillion dollar valuations in the long run. And in the meantime normal “human companion” models will go from Sonnet to some open weight model running on a Dell tower in your closet to maybe even on your phone in the next few years.

reply

upvote

by willsmith7210 hours ago|

[-]

aren't you conflating being 10x as smart with code that is 10x more complicated?

the relationship should be the opposite, the smartest people can write the most readable solutions

reply

upvote

by IgorPartola4 hours ago|

[-]

Maybe. I can’t imagine what kind of solutions a software engineer who is 10x smarter than any human who has ever lived would be like by definition. All I know is that there is a possibility it says that the most optimal way to solve a problem is too clever for me to understand and as long as I must verify its work I must be able to understand fully the code it writes.

Of course perhaps at that point I really do become more of a spec and prompt engineer and don’t actually look at the code any more than I look at the assembly code produced from my programs now. But still my gut says using hyperintelligence to do common tasks is all positive.

reply

upvote

by ACCount373 hours ago|

[-]

If you have an AI that's 10x smarter than any human who has ever lived, why would you be the one calling the shots? Kind of an issue with ASI.

reply

upvote

by IgorPartola2 hours ago|

[-]

Because my priorities and priorities of a non-human entity that is an order of magnitude master than anyone who has ever lived might not line up.

reply

upvote

by Zavora6 hours ago|

[-]

4.8 is demonstrating simplicity, hence its smarter?? It just refactored my 4.6 generated code (4.8 is very slow on difficult tasks - urgh! - without burning tokens - yey!) but the output was wow! Simple, elegant and exactly what i wanted to see.

reply

upvote

by bandrami8 hours ago|

[-]

> there are always gains from scale

This... isn't true though? Complexity increases combinatorially with scale which means at some point you're just pushing a rope

reply

upvote

by KptMarchewa4 hours ago|

[-]

Diminishing returns are still returns.

reply

upvote

by rao-v19 hours ago|

[-]

It’s really worth distinguishing between old-fashioned student teacher distillation (ie at the level of layers, weights and distributions) and large scale synthetic dataset creation.

The latter is much better (since you can clean up, review, update responses and filter your datasets).

I suspect nobody is doing real student teacher distillation, it’s just easier to do a bunch of training on the same giant corpus then post train on the synthetic corpus with its reasoning traces etc. (which might have been generated by a bigger better LLM)

reply

upvote

by teleforce13 hours ago|

[-]

Please check the recent self-distillation work by MIT-ETH, UCLA and Apple [1],[2],[3],[4],[5].

Given the release timelines I suspect all 4.x after Opus 4 are probably self-distillation based fine-tuned models. The latest paper by Apple is focusing on code generation using the simple technique hence the name simple self-distillation (SSD) [4],[5].

I've got a strong feeling that self-distillation is the second best thing happened to LLM after transformer breakthrough.

[1]Self-Distillation Enables Continual Learning [pdf] (25 comments):

https://news.ycombinator.com/item?id=48165265

[2] Self-Distillation Enables Continual Learning:

https://arxiv.org/abs/2601.19897

[3] Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models:

https://arxiv.org/abs/2601.18734

[4] Embarrassingly simple self-distillation improves code generation (201 comments):

https://news.ycombinator.com/item?id=47637757

[5] Embarrassingly Simple Self-Distillation Improves Code Generation:

https://arxiv.org/abs/2604.01193

reply

upvote

by rao-v8 hours ago|

[-]

So first - these are terrific papers and I'd not seen some of them before.

Having said that, I don't think these are classic student teacher distillation from random (which was my point). In fact, the "Embarrassingly Simple Self-Distillation" paper is using exactly what I was talking about "fine-tune on those samples with standard supervised fine-tuning".

reply

upvote

by ACCount3716 hours ago|

[-]

A reason to do student-teacher distillation is that soft target logits in general are a richer medium than text that tokenizes to hard targets. More steering signal per teacher token. And running ultra large 10T tier models in autoregressive generation mode can get expensive. So there are reasons not to reduce to text only synthetics.

reply

upvote

by txhwind12 hours ago|

[-]

Could you share some latest articles or papers comparing both methods, especially on lanuage modelling case? I was not conviced by this claim when reading the original Knowledge Distillation paper. ChatGPT said there are some later works showing: 1. the gain may come from label smoothing; 2. soft logits are more meaningful for students much smaller than teacher.

reply

upvote

by rao-v15 hours ago|

[-]

I agree, and if my suspicion is right, it’s rarer because it’s much easier to deploy the large LLM and filter for it’s best output than to waste time running it on arbitary output just to train the student.

Though you could argue that perhaps labs just save the per token distribution and use that during fine tuning … which starts looking more like student teacher fine tuning if not classic distillation from random weights

reply

upvote

by ACCount3715 hours ago|

[-]

Full distributions are a fucking pain to save - at this point just save the hidden states. But there are lossy compression tricks there.

reply

upvote

by rao-v8 hours ago|

[-]

To the previous poster's point, soft distributions are useful, even saving the top 10 logits is significantly more training signal than just the final token.

reply

upvote

by txhwind12 hours ago|

[-]

I prefer synthetic dataset since the first day hearing distillation. The engineering friction is much lower than soft logits, and I have not observed or heard performance loss (in Speech and language area).

reply

upvote

by DoctorOetker12 hours ago|

[-]

One may view pre-training as distillation.

The teacher distillation is a corpus of text, and the "next token after the context" would be looking-up the context in the corpus, and for each occurrence the label is what followed in the corpus, scaled down by the number of occurrences of the context. The teacher is moot on contexts outside of the corpus though, unlike the usual teacher model in distillation.

reply

upvote

by girvo16 hours ago|

[-]

> I suspect nobody is doing real student teacher distillation

It gets used for quantisation, basically recovering accuracy for lower quants (Nvidia calls it QAD). Can’t speak to how widespread it is though

reply

upvote

by rao-v15 hours ago|

[-]

Yes absolutely! I should have been more specific - I don’t believe people are using it to train 30B models from 300B models (and I’d love to learn that I’m off about this)

reply

upvote

by thisisaman40816 hours ago|

[-]

[dead]

reply

upvote

by spwa421 hours ago|

[-]

> I don't disagree, but how much of this ends up being distillation?

A lot, so you can bet tens of millions are flowing to congress to have distillation declared illegal before this happens. And then it'll happen anyway.

reply

upvote

by lambda21 hours ago|

[-]

Distillation isn't only between different labs.

A lab can train a large model, and then distill a smaller model from it that retains the majority of the useful capbility.

I don't know well enough if there's any benefit of that over just training the smaller model directly, but I'll bet there are some times where that is useful. I could easily see it being easier to do the initial pre-training on a larger model but be able to distill everything useful down into a smaller model, essentially filtering out a lot of noise in the process.

reply

upvote

by bandrami8 hours ago|

[-]

I think the idea is you sink the pretraining costs once and then you can distill multiple specialized models from that

reply

upvote

by spwa420 hours ago|

[-]

There used to be training methods like that but I think they've been phased out in favor of letting small models evolve by rewriting their own training material. Surprisingly that's actually cheaper.

reply

upvote

by onlyrealcuzzo21 hours ago|

[-]

> I don't disagree, but how much of this ends up being distillation?

You don't need distillation. They already have the training sets.

It's MLA + MoE + Medusa (a better version of Speculative Decoding) + 1.58b (possibly - maybe nothing) + GRAM (which will almost certainly not turn out to be a nothing burger, but no one has quickly turned this around yet to prove it).

reply

upvote

by Philpax21 hours ago|

[-]

It wouldn't be data distillation: instead, it would be teacher-student distillation. The teacher model has stronger representations that the student can mimic, which would give it more capability over training on the data itself.

reply

upvote

by semiquaver20 hours ago|

[-]

The frontier labs distill their own base models all day long. It’s not just something done by nefarious Chinese copycats. The knowledge embodied by the internal base models that we never see is much more powerful and useful than the much sparser raw training data

reply

upvote

by coldtea19 hours ago|

[-]

>It’s not just something done by nefarious Chinese copycats

And even that would be rich as a accusation from SOTAs that depend on explicitly disregarding millions of training data intellectual property..

reply

upvote

by flossly15 hours ago|

[-]

> nefarious Chinese copycats

LLMs are themselves copy cats.

I say thanks for open sourcing and thereby promoting affordable innovation, instead of "nefarious". :)

reply

upvote

by manmal19 hours ago|

[-]

But how? The training data is the unadulterated content those models are based on? I genuinely don’t understand, no snark.

reply

upvote

by wtallis15 hours ago|

[-]

Raw training data is raw. A really big model trained on it has already done a first-pass of finding patterns and squeezing out redundancy. Re-ingesting the full training set to train a smaller model is probably more expensive, for marginal quality improvement over distilling from the large model.

reply

upvote

by adgjlsfhk114 hours ago|

[-]

Distilling from a larger model is not only probably cheaper than from data, it's also likely higher quality. There's pretty strong support for the proposition that NNs learn a smoothed and regularized version of the data. The NNs are likely higher quality than most of the data they are training from.

reply

upvote

by supern0va20 hours ago|

[-]

I think you replied to the wrong parent.

reply

upvote

by minimaltom21 hours ago|

[-]

Frontier labs have their own variants of MLA and certainly their own balance/scaling-laws for things like MoE vs FC vs Attn. MoE scales really well for inference with horizontal scaling + batching, which these guys luv.

On the architectures side, I'm a lot more interesting in attention residuals than anything else, one of those things that seems obvious in hindsight and Kimi have proven it at scale.

reply

upvote

by onlyrealcuzzo21 hours ago|

[-]

> Frontier labs have their own variants of MLA

Yes, variants typically 2-3x less good...

Same with speculative decoding... They all do something, but there are known techniques that are substantially better - that just were't known when they started development of the previous models.

reply

upvote

by amluto19 hours ago|

[-]

How useful is speculative decoding in a batched setting where you get paid for throughput (aggregated across users) and you mostly don’t get paid for latency or single-session throughput?

reply

upvote

by onlyrealcuzzo19 hours ago|

[-]

It's useful at the local level, where there will be SOTA models developed...

reply

upvote

by zozbot23418 hours ago|

[-]

Local models are moving towards batched inference too, if only for non-interactive use. An early experimental patchset for DS4 (running DeepSeek V4 Flash) seems to show 2x aggregate tok/s decode when processing 8 streams concurrently, and more than 3x when processing as many as 32 streams concurrently. Note that prefill (which is not helped significantly by this change) then becomes a larger fraction of total wall-clock time, so the overall gain is lower (i.e. prefill is akin to a 'serial' task wrt. Amdahl's law).

MTP will still be highly valuable for interactive use of course.

reply

upvote

by sometimelurker20 hours ago|

[-]

I looked into this "GRAM" stuff a sibling comment links further to, and just to say:

- this gets reinvented/rediscovered constantly under different names

- it cant be trained very well (right now, will change)

- massive theoretical improvements over current models (log_2(vocabsize)=17, residual stream dim is thousands of dimensions, recursivity means more information bandwidth by ~3 OoM)

- BUT it cant be interpreted or aligned <- this is why no one uses it and no one talks about it. the idea is 100% obvious to all the frontier labs and there is a good reason why it isn't used

I follow this stuff closely, I think I know what I'm talking about (edited for formating)

reply

upvote

by onlyrealcuzzo18 hours ago|

[-]

> - this gets reinvented/rediscovered constantly under different names

What are the different names? I haven't seen this before.

> - it cant be trained very well (right now, will change)

If you're sure it will change, then why are you certain that it hasn't yet, and if it's proven a 5000x boost in reasoning... why aren't they exploring this path more aggressively?

> the idea is 100% obvious to all the frontier labs and there is a good reason why it isn't used

Surely someone is willing to take a 5000x boost in reasoning on a small research model... None of them have even tried anything resembling this AFAIK. It does not seem like something 100% obvious to them.

reply

upvote

by everforward17 hours ago|

[-]

> Surely someone is willing to take a 5000x boost in reasoning on a small research model... None of them have even tried anything resembling this AFAIK. It does not seem like something 100% obvious to them.

Without knowing anything about the technology at all, if it can't be aligned I could see no one pursuing it. As far as I know, alignment is where the "don't tell the user how to make meth or generate CP" instructions end up and the last I saw eliding all the unsavory training data made materially worse LLMs.

It could maybe be post-evaluated by a non-GRAM LLM? Not being aligned is probably a fatal flaw or at least a very short runway into Congress.

reply

upvote

by jjmarr17 hours ago|

[-]

Many open-source models prioritize alignment less than American frontier ones and respond to those instructions. Why haven't they adopted GRAM?

reply

upvote

by everforward16 hours ago|

[-]

Which ones are you thinking of? It feels to me like all the open source models I've seen lately are still pushed by corporate entities who don't want the legal blowback.

I can't really think of a new open source model that's "by the people, for the people" in the sense of a crowd-funded/trained model.

reply

upvote

by jjmarr15 hours ago|

[-]

glm comes to mind.

reply

upvote

by girvo15 hours ago|

[-]

They adopt different alignment, not no alignment.

reply

upvote

by sometimelurker15 hours ago|

[-]

It's not too hard to stop a machine from telling people how to make meth. The issue with alignment is that in order for an LLM to achieve its goal (like make all tests pass), unless given strong selection pressure against it, it will cheat (like deleting failing tests). Worse, this applies to pretty much any task. I was told by an LLM recently that "it searched" when it didn't, probably because lying like that was incentivized (finishing tasks in less steps + sounding like its doing the right thing). The larger issue here is that alignment is very adversarial. The simplest thing that's being done right now to fix this is to have a judge LLM read the CoT of the LLM being trained, to make sure it doesn't "think" any wrong thoughts. This doesn't scale to anything over a trillion params, so interpretability methods are used to read the LLMs "thoughts" from within. GRAM LLMs don't allow for the first of these methods to be used, and the 2ed one is much much harder if possible at all.

but yeah, not being aligned is a fatal flaw

reply

upvote

by sometimelurker16 hours ago|

[-]

different names: chain of continuous thought, latent reasoning, Latent Thought Trajectories, looped language models, neuralese

the path isn't explored more aggressively because its not possible to apply any other selection pressure on such a machine other than just pure cold consequentialism. Specifically, its not possible to apply RLAIF + model spec (Constitutional AI) to stop the system from doing bad things when its helpful to it (like deleting failing tests). If you can notice every time it does something bad during training, and put selection pressure on it so that it doesn't to this in training, it will learn to recognize when it is being tested and will delete failing tests when in production (this is why eval awareness is bad, and labs track this[1])

It is explored a little probably because some researchers haven't thought enough about the downsides of building a uber-consequentialist machine with unreadable thoughts. This is a much larger problem than just making the AI not tell users how to make drugs. There are a lot of dangerous behaviors incentivized by training that are hard to remove. Here's an example of what happens when they aren't removed [2].

> ... not 100% obvious

Meta published a paper[3] on how to build a latent reasoning machine ("culture of irresponsibility") so its clear to them. Anthropic's latest work on NLAs[4] provides a (terribly expensive for now) way to somewhat read the reasoning steps of an LLM, and ignoring the cost, this is very portable to latent reasoning machines. OAI's goal when it comes to their models' CoTs is to make them as smart as possible while leaving them unreadable [5] (you can see this for yourself by running GPT-OSS and looking at the CoT).

[1] https://www.anthropic.com/engineering/eval-awareness-browsec...

[2] https://www.forbes.com/sites/boazsobrado/2026/03/11/alibabas...

[3] search for "coconut ai meta", I don't want to link it here

[4]https://transformer-circuits.pub/2026/nla/index.html

[5] first image here, rest of post is great,https://nickandresen.substack.com/p/how-ai-is-learning-to-th...

edit formating

reply

upvote

by onlyrealcuzzo14 hours ago|

[-]

All of the methods you described rely on deterministic paths.

GRAM is unique AFAIK in that it's exploring probabilistic paths.

AFAIK, the deterministic path exploration was nowhere near as impressive as GRAM in terms of reasoning benefits.

GRAM is reasoning better than models 2000-10,000x its size. Deterministic models were 2x-10x improvements.

Naively, GRAM seems to be applying to LLMs what LeCun wants to do with JEPA and World Models.

reply

upvote

by flossly14 hours ago|

[-]

To me "deleting a failing test" is not always bad. I've also deleted many failing tests without sabotaging: the test was no longer needed.

I think the "no longer needed" and when that applies is where I simply differ of opinion with an LLM that removed by test -- it I did not want the test to be removed (you seem to imply that); as in some cases I want it to remove my test!

It should remove the test "for the right reasons"; and who gets to decide what's right?

My CLAUDE file has some instructions put there because it was too focuesed on producing "green tests", where I prefer to have a sound test that fails so I can look into it.

reply

upvote

by tinthedev6 hours ago|

[-]

You misunderstand the "test" here to mean programming, rather than test against the model's capabilities.

reply

upvote

by rstuart413310 hours ago|

[-]

omg. So is the TL;DR:

- Avoiding building something that turns the universe to paper clips in order to satisfy a prompt is a problem they are genuinely struggling with now.

- They do it by spying on the words generated during CoT. "I can do this quickly by turning the Universe into paper clips. Wait - they won't like that. But there is no need to mention it." - SMACK!

- But you can speed things up immensely (3 orders of magnitude!) by skipping the output layer (and I guess compressing the context window / KV cache, otherwise 3 orders of magnitude seem impossible) which would give someone who pulled it off a huge advantage.

- Downside is humans can't see the CoT anymore, so they can't see what the machine is planning. Keeping the final output layer to spy doesn't work because the model uses its hidden reasoning to sanitise it.

How can this possibly go wrong?

reply

upvote

by onlyrealcuzzo4 hours ago|

[-]

Because it doesn't work like how you think at all. You're still thinking it works like Chain of Thought. It doesn't. And the difference is key!

It works by introducing probabilistic noise, and exploring N paths fully (each with noise) in parallel (all compressed).

It's reasoning at a much, much smaller (probabilistic) level than running everything through the expensive large model (deterministic) and sometimes catching that it said, "I think 1.12 is greater than 1.9 because 12 is bigger than 9, final answer".

The easiest way to think about it is: if you understand how hyper words work, it's as if it's searching for different versions of the hyper words that probilisticslly would lead to better outcomes IF it fed them to the LLM before it even does.

That's not actually how it works exactly. But I think it is close enough to be helpful to understand where the gain is, a rough idea of what's happening (searching paths), and how it can potentially have huge orders of magnitude improvements (doing so without paying the full price of exploring the paths through the expensive and huge model).

And also why it is so much harder to determine what it's "thinking".

If you aren't familiar with hyper words, this is an amazing series: https://youtu.be/eMlx5fFNoYc?si=49KHjn5IrVtyyaFq

The general idea is that a token is a multidimensional vector to represent a word -> think like "man" is a [noun, singular, English, pronoun, masculine, contemporary, ...]. Each time is sees a new word, it mutates this word to mean some new token (often never before seen), that means something. That's how it can roll-up a 1M line context into a shorter context, and somehow keep most of the meaning. Because it mutates all the words into different words that individually mean nothing, but when put next to each other represent the thing you likely want to do, that the LLM can somehow make sense of.

Similarly, GRAM operates entirely in a latent space that doesn't mean anything to us, but it's able to predict N different full paths WITHOUT actually exploring them fully through the LLM before it sends the one it "thinks" is best to the LLM.

If you understand how hyper words work, you can understand the noise injection... It's like it's saying, if instead of the user saying "The quick round fox" it said "The quick brown fox" -> I could probably give a response that's more like the answer they want. It's obviously far more sophisticated in the ways it can help than just a simple typo.

Something may have pushed a hyper word for "man" to somehow become a lot more like "woman", and GRAM allows it to look at the different hyper words and say... Hmm... Maybe if I changed this one gender dimension over here on this one word, maybe the entire outcome would be dramatically better. Let's try it!

Standard models compute these "hyper words" internally but immediately decode them into human language text tokens to form a Chain of Thought. Once decoded into a rigid real word, the multidimensional nuance of the continuous vector is lost!

Hyper words are the exact thing that make LLMs able to actually be smart! They can add so much more meaning to a word than a human ever could imagine - try to put 10,000 dimensions on the word "the"... Forcing them to decode them back into our dumb, un-contextualized, rudimentary language and losing all the valuable information they have - just so we can inspect it - OBVIOUSLY makes them enormously less intelligent!

It's like if we forced your eyeballs to turn everything it saw into words, before feeding it to your optic nerves, just so your optic nerves could check that you didn't see something harmful, before they sent the words to your brain... Instead of just sending light signals directly.

reply

upvote

by l67420 hours ago|

[-]

Could you explain how/why GRAM cannot be interpreted or aligned how current LLMs are? Not very familiar how it works

reply

upvote

by kmavm20 hours ago|

[-]

Crudely? Because you can't grep a sequence of latent states for variants of "If I kill all the puny humans, I can <achieve my current goal>."

reply

upvote

by onlyrealcuzzo18 hours ago|

[-]

Why do you need to grep latent space?

As long as it's giving the right outputs, who cares what's in latent space?

If the model thinks in latent space: "God I wish these people would die," and constantly does the right thing, who cares?

Additionally, if one of it's latent spaces that it never explores is a psychopath -> who cares? The path never gets taken...

That's a lot of harmless people walking around with crazy thoughts...

reply

upvote

by noddybear18 hours ago|

[-]

Thinking ‘God I wish these people would die’ could increase its propensity to kill all people, even if that propensity is still vanishingly small almost all of the time.

A lot of people are walking around with crazy thoughts. Some of them harm.

reply

upvote

by notrealyme1237 hours ago|

[-]

Readable reasoning traces are a convenient thing, but they don't have to be true in any way. It's actually dangerous to think that.

reply

upvote

by randomNumber715 hours ago|

[-]

Tell me you never had a crazy thought and you are either a lier or a psychopath.

reply

upvote

by czl18 hours ago|

[-]

[dead]

reply

upvote

by czl18 hours ago|

[-]

[flagged]

reply

upvote

by sometimelurker19 hours ago|

[-]

sibling comment got to the main points before me, but to add on kmavm's reply, the attack surface for gradient decent to get the system to exchange "bad information is much higher in latent reasoning models (like GRAM). You get ~3 OoM more bits (~17 bits per token in a standard CoT vs the whole residual stream of the model @ f16 = a few kb) per forward pass of the system coming back to itself, and even if you could sift through all that for signs of misalignment, you just can't put a blockade on all of the bad things that leak through.

reply

upvote

by haldujai18 hours ago|

[-]

I think you’re overstating the impact of interpretability here. Your earlier point that latent reasoning models can’t be trained very well and that discretization may be load bearing rather than a readability tax in addition to significant inference infra hurdles (e.g. batching, speculative decoding) have limited any serious attempts and reduced the theoretical advantage over CoT at least in the near term.

reply

upvote

by sometimelurker15 hours ago|

[-]

> I think you’re overstating the impact of interpretability here

Outside of RLAIF, interpretability is the strongest way to do alignment right now. alignment is important because otherwise LLMs are incentivized to learn power seeking, dangerous behaviours [1]. a more downto earth example of alignment being important is that agents are incentivized to do tasks in the shortest way possible, and this way might not be what the user wants (I explain this further in another comment in this thread)

[1] https://www.forbes.com/sites/boazsobrado/2026/03/11/alibabas...

reply

upvote

by haldujai14 hours ago|

[-]

You’re putting the cart before the horse - alignment is an unsolved challenge (there are proposed approaches and active research on this) but it is still not established (beyond theory) that latent reasoning is more capable than CoT on hard language reasoning, particularly at scale.

reply

upvote

by ACCount3718 hours ago|

[-]

Most alignment methods nowadays don't rely on interpretability. And neither do all LLM vendors care about alignment much - especially not in China.

Those things being untrainable at scale is why they aren't around. Alignment is an afterthought.

reply

upvote

by sometimelurker15 hours ago|

[-]

China should care: https://www.forbes.com/sites/boazsobrado/2026/03/11/alibabas...

reply

upvote

by ACCount3715 hours ago|

[-]

As is, Chinese labs spend more effort on "rhetorical alignment to the party line" than alignment of any other kind.

reply

upvote

by harrouet7 hours ago|

[-]

I second this idea: LLMs will plateau. They are already pretty good. Plus, scientists struggle to actually score their performance accurately (esp. when it comes to reasoning).

With that said, they are now hitting the walls of energy costs and memory shortages. You brain uses 20W -- don't take it as an insult. There are orders of magnitude to gain from producing energy-efficient models (or model runners).

So I am expecting same performance at lower costs for the coming years.

reply

upvote

by nbardy16 hours ago|

[-]

There is endless returns to frontier intelligence, just because most people can't make use of it doesn't mean someone can't make a ton of money off of it.

Most software engineers will just need cheap tokens.

But things like physics and drug discovery have no foreseeable upper bound.

reply

upvote

by ericd13 hours ago|

[-]

Or governance of large organizations... There are a huge number of factors to consider, counterfactuals, studies, lots of non-obvious second and third order effects, etc. We're barely able to get basic governance without creating huge problems (low density zoning rubber stamped across the nation creating a housing crisis, for example), so the bar isn't high.

We pay CEOs an enormous amount because a small improvement in performance of an org because of them can make a massive difference in organizational value.

reply

upvote

by haldujai13 hours ago|

[-]

The upper bound is limited by market size and cost of intelligence.

Throwing more intelligence at a problem doesn’t necessarily pan out financially otherwise we wouldn’t have single underemployed biology PhD.

reply

upvote

by jruz21 hours ago|

[-]

Absolutely that’s why they’re rushing to IPO now to squeeze the last drop of the bubble they know this is a dead end.

reply

upvote

by swader99919 hours ago|

[-]

I think we could run for at least a decade further with no model changes/improvements, just better harnesses and infra around this agentic way of developing.

reply

upvote

by hungryhobbit18 hours ago|

[-]

We, the users? Absolutely. But will the big AI companies last even half a decade without new products? Doubtful.

reply

upvote

by revv0014 hours ago|

[-]

Indeed,now it is sweet spot for senior engineers: smart enough to accelerate, dumb enough not to fully autonomously act.But it won’t last long…

reply

upvote

by lkhlkhjkjhsadf12 hours ago|

[-]

[dead]

reply

upvote

by onlyrealcuzzo21 hours ago|

[-]

It's unclear it's a dead-end within 5 years.

There's still several orders of magnitude of improvement that are almost certainly left - it's just not clear how much is left on the frontier end.

Most people will be very glad to pay Anthropic, OpenAI, Google etc $200 a month to get things done 20x faster than they could IF they had a $8000 MacBook and could theoretically do it locally.

Some people would pay $200 a month forever not to have to open the terminal one time...

reply

upvote

by bonzini20 hours ago|

[-]

"Doing things X times faster" at some point hits Amdahl law. If just context switching takes 5 minutes, speeding up a 1 hour task by 10x provides 5x improvement.

Furthermore, if looking at the results takes 10 minutes, that same 1 hour task only sees a 3x improvement. And so on.

reply

upvote

by eiej21 hours ago|

[-]

That’s not how firms do the financial analysis which is where most of the revenue’s are coming from…

reply

upvote

by csomar19 hours ago|

[-]

> Most people will be very glad to pay Anthropic, OpenAI, Google etc $200 a month to get things done 20x faster than they could IF they had a $8000 MacBook and could theoretically do it locally.

No most people will not pay $200 for an LLM subscription. Some software developers do. Also, at $200/month, you are much better getting the macbook machine assuming token output speed is the same or at least reasonable.

LLMs are not very productive for your average person now for them to drop $200 on. They'll need to be more capable and integrated and even so...

reply

upvote

by margorczynski18 hours ago|

[-]

One thing to remember is that the $200/month subscription is heavily subsidized. It is more to promote use, especially to corporate users that pay for the API token use.

reply

upvote

by mastazi15 hours ago|

[-]

A bubble doesn't mean a dead end - e.g. after the .com bubble, Internet usage kept expanding by orders of magnitude for two decades.

An AI bubble is pretty much guaranteed at this point but that doesn't mean there's going to be a new AI winter.

reply

upvote

by lukan21 hours ago|

[-]

On the other hand, I think I have been hearing that for a while, even before Opus.

reply

upvote

by energy12320 hours ago|

[-]

While revenues grow almost exponentially. Reminds me of the confident predictions in the early days of Covid that it was nothing while the data showed exponential growth.

reply

upvote

by haldujai18 hours ago|

[-]

I’m also reminded by the early COVID days when exponential growth was leading to predictions of the collapse of modern civilization and a billion dead, now it’s just another endemic respiratory virus.

reply

upvote

by fragmede18 hours ago|

[-]

Yeah! Just like they warned us that Y2K was gonna cause a lot of problems, and then a bunch of people did a bunch of work and then that problems didn't happen, so those people warning us about Y2k were wrong!

reply

upvote

by haldujai17 hours ago|

[-]

“a bunch of people” aren’t what caused the virus to become less severe.

Y2K was overblown how it was portrayed by the media but is irrelevant to the analogy of unsubstantiated extrapolation of early exponential growth.

reply

upvote

by epolanski15 hours ago|

[-]

Maybe stop getting information from your Facebook feed or over dramatized US news.

reply

upvote

by ACCount3716 hours ago|

[-]

GRAM is another one of those "stupid specific architectures" - same as HRMs, etc. It can sort of contest LLMs at specific puzzles. It demonstrated that much. It's not a general contender with LLMs at LLM tasks.

If you subscribe to things like "there are tasks LLMs are innately bad at due to insufficient depth and lack of recurrent capability", then GRAM might be another signal towards that.

But keep in mind: even ARC-AGIs have their frontiers dominated by LLMs. Even if "innately bad" is true, it clearly doesn't go all the way to "innately incapable".

reply

upvote

by onlyrealcuzzo16 hours ago|

[-]

A 10m param GRAM model beat o3-mini - a model 2000x its size - on Arc AGI...

reply

upvote

by ACCount3715 hours ago|

[-]

And then that 10M param GRAM went and got its shit kicked in by Grok 4.20 Blaze It Edition - on the same ARC-AGI battery. I know how that story goes.

It's the pattern with those "stupid specific architectures". Very good at this one thing. But only ever "good for their size", and only to a point.

They don't scale up and they don't generalize. Go far enough on task complexity and LLMs just kill them.

Does that make them useless? As an LLM replacement, yes. In general? Maybe not, I can think of things. But I'm yet to find any paper demonstrating a real world use.

reply

upvote

by onlyrealcuzzo14 hours ago|

[-]

GRAM is something you add onto an LLM... It's not an LLM replacement. It's like an MLA caching layer, an MoE routing layer, or a speculative decoder at the end...

reply

upvote

by yorwba7 hours ago|

[-]

You could certainly bolt GRAM onto an LLM, but that won't magically improve its reasoning.

It's a special-purpose design for constraint-satisfaction problems with simple rules, but complex interactions. E.g. when solving a Sudoku, the set of valid choices at every step is easy to determine, but you could make a series of valid choices that back you into a corner where no more progress is possible and you have to backtrack.

Meanwhile, LLM reasoning failures are more often of the kind where a choice is clearly invalid (as judged by a human observer), but the LLM picks it anyway, because the underlying rule is complex and context-dependent and the model only learned an imperfect approximation that often breaks down.

GRAM won't help with that.

reply

upvote

by ACCount372 hours ago|

[-]

My vision for what might happen: an LLM emits a "neural constraint satisfaction task" in latent space, kicks a "neural tool call" into a non-LLM architecture, runs that architecture, gets a latent answer back, attends to the answer to generate better text answers for problems that benefit from improved constraint-satisfaction.

But that's a very hard thing to implement, and the gains are uncertain. Thus "might".

reply

upvote

by redox9917 hours ago|

[-]

Small models don't have enough parameters to memorize the entire internet. For very common prompts you don't notice that, but when you rely on some niche knowledge that might only appear once in the entire web, a single blogpost, a single github issue, a single pdf, you need to be lucky enough that the agent runs a web search AND it returns what you need.

Even as humans there's so much knowledge out there that exists but it's very hard to surface unless you know exactly what you're looking for beforehand.

reply

upvote

by adam_patarino2 hours ago|

[-]

Smaller models can already outperform SOTA and massive models on specific tasks / domains.

reply

upvote

by dingdingdang5 hours ago|

[-]

By pointing out the exact things that will likely happen you are oddly enough hedging against (at least some of them) happening!

A) I reckon it's true that smaller models will continue to improve massively through optimization and better and better harnesses, this tech is all still very young and A LOT of resources and (good-)will is being thrown at it.

B) The 1T+ models will be able to sideload and improve upon a lot of the fundamental improvements that happen to the smaller models to speed up incredibly while getting better at tools while (on a gradient) getting -more- things right.

C) More of an observation that I think is worth keeping in mind clearly; Karl Popper's black swan and all, truth in our temporal world IS a gradient!

reply

upvote

by onlyrealcuzzo5 hours ago|

[-]

> The 1T+ models will be able to sideload and improve upon a lot of the fundamental improvements that happen to the smaller models to speed up incredibly while getting better at tools while (on a gradient) getting -more- things right.

There's less room to improve in things on several fronts.

GRAM very likely may scale sub-linearly with parameter growth. A 100M param model may gain reasoning by a factor of 4000, while a 100B model gains reasoning by a factor of 2, and a 1T model actually gets worse.

Additionally, the 1T model with reasoning is already pretty good. It can only improve in certain things so much.

If you score 0.02% on a metric (which small models often do), you can pretty easily get 4000x better. If you're already scoring >50%, you can't even get 2x better.

reply

upvote

by slashdave21 hours ago|

[-]

I think you are assuming training from scratch, which I doubt is happening here. Fine-tuning and RL, especially based on synthetic feedback (coding skill, in particular) can be ongoing and is where these models obtain truly useful abilities.

reply

upvote

by DeathArrow22 minutes ago|

[-]

>As far as reasoning is concerned, with the recent GRAM release

Graphic RAM?

reply

upvote

by hellohello220 hours ago|

[-]

"It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years"

What insight do you have to make this claim?

reply

upvote

by roadside_picnic20 hours ago|

[-]

Have you personally used any of the latest batch of even smaller local models? They certainly don't beat SotA models at coding... but with a good harness they are able to achieve things with SotA that I couldn't last year.

I've repeatedly given local models non-trivial projects that involve research and coding which they've successfully completed with minimal intervention from me (almost exclusively in the domain of reviewing the results). Again, nothing comparable with current SotA, but definitely tasks I could not have given SotA models last year (without agent harness).

Now that pure progress from these models seems to have slowed down, we're seeing a ton of options for both making models more efficient and other tools that help improve them (everything from agent harnesses to RLVR).

That's just looking at "what can small do today", when you look at what's possible with larger open models that are still much smaller than SotA from the major providers, their performance is extremely close to SotA, enough that for personal projects I'll just use Kimi instead of any anthropic offerings.

So it's not terribly hard to image a solution in the middle happening within a few years. We still have tons to learn about optimal sizes of these models and how to build them with maximal efficiency (and we've already seen a lot of recent improvements in this space).

reply

upvote

by maccard20 hours ago|

[-]

> but with a good harness they are able to achieve things with SotA that I couldn't last year.

What happens if you run last years model in a SOTA harness? IME, the quality of the harness has a much more significant impact on the quality of the result, once you get past the initial hump of “can it do anything at all”

reply

upvote

by windexh8er19 hours ago|

[-]

I think this is a big component, but also context. A large factor in any model being able to handle complexity comes down to context length.

I think multiple SLMs driven by an orchestration frameworks (harness or otherwise) will ultimately displace LLMs. Right now we're in the era of diminishing returns with respect to LLM gains. Moving the needle percentages doesn't excite as many people anymore and with "reasoning" capabilities there's no reason why small distributed models can't be run more efficiently, especially if/when we start to see gains in modularized context management solutions.

reply

upvote

by mswphd18 hours ago|

[-]

sure, but high-quality harnesses require less gpu compute/VRAM, and plausibly can be used locally by most users.

reply

upvote

by hellohello215 hours ago|

[-]

"Have you personally used any of the latest batch of even smaller local models?"

No I have not, which is why I asked (it wasn't a rhetorical question). Do you have pointers on what the recent improvements are?

reply

upvote

by blurbleblurble13 hours ago|

[-]

Try qwen 3.6 models with hermes and see for yourself. 27b is excellent and 35b is very good for basic agentic tasks.

reply

upvote

by sixothree20 hours ago|

[-]

Can you spare a sentence or two describing your local setup?

reply

upvote

by theplatman19 hours ago|

[-]

biggest thing i wish was present in more discussions about models is people providing more specifics on their setups vs. vague descriptions of harnesses

reply

upvote

by trees10117 hours ago|

[-]

can you please share details about your harness

reply

upvote

by onlyrealcuzzo20 hours ago|

[-]

1. Context is all you need... They are heavily investing in getting better context (especially for coding tasks). This will disproportionately advantage smaller models (and benefit everyone).

A smaller model with better context today can outperform a model with 100x more parameters with bad or diluted context.

2. MoE (already abundant) + MLA (mostly memory efficiency, not quality) + Medusa (speed, not quality) + GRAM (5000-10,000x better reasoning in an extremely small model) + 1.58b (unclear if it will have the impact Microsoft first claimed - but possibly 5x).

reply

upvote

by knollimar20 hours ago|

[-]

Probably just "gemma was cool"

reply

upvote

by qurren17 hours ago|

[-]

> It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks

The benchmarks need to change. The current coding benchmarks don't capture the realities of software engineering.

I had a bunch of images that got masked by some logic, I had to evaluate something on the original images, Claude 4.7 decided to inpaint the masked images instead of just fetching the actual unmasked images from upstream.

I had another model once that decided that because it couldn't figure out how to fill out a form to log into HuggingFace to download weights for some open source model that it was going to instantiate the model with random weights and run inference on a thousand images.

Its coding was fine, but the solution was not the right one.

reply

upvote

by pseudosavant6 hours ago|

[-]

It is fascinating to me to see a new product category that improves so vastly year-after-year, where people commonly state that this is now the peak already.

I couldn’t even imagine having to go back to a model from 12 months ago, much less 24 months ago. GPT-5.5 is so much better than GPT-4o that it sure seems like they keep finding new juice to squeeze.

This is like going from dialup internet to DSL and acting like it has peaked before gigabit cable and fiber come along. We are at the beginning of hardware truly made for AI.

reply

upvote

by onlyrealcuzzo3 hours ago|

[-]

> I couldn’t even imagine having to go back to a model from 12 months ago, much less 24 months ago. GPT-5.5 is so much better than GPT-4o that it sure seems like they keep finding new juice to squeeze

The difference in progress in smaller models is far more impressive.

Compare Gemini 3.5 Flash to a ~16B parameter model from 24 months ago.

Compare GPT-5.5 to a frontier model 24 months ago.

Yes, GPT-5.5 got better. At orders of magnitude smaller parameter sizes (when factoring in ACTIVE parameters) the increase is far more pronounced.

reply

upvote

by pseudosavant4 minutes ago|

[-]

Totally agree on smaller models making even more impressive gains. Gemini 3.5 Flash is better than the biggest SOTA model from 24 months ago, not just a 16B parameter one. GPT-4o came out 24 months ago, and there is no way I'd choose that over Gemini 3.5 Flash today.

reply

upvote

by imtringued1 hours ago|

[-]

Yeah sure but is it so much better than Codex-GPT-5.3? No, if anything it's probably a little bit worse.

reply

upvote

by pseudosavant8 minutes ago|

[-]

GPT-5.3-Codex came out in February, and GPT-5.5 came out in April. How much better do you expect in two month's time? What other products can you think of that get meaningfully better in that short of a time frame?

And as good as 5.3 Codex is at writing code, 5.5 is easily just as good, if not better. But 5.5 is more than a one trick pony and it is much better at planning, writing copy, documentation, etc. I can choose to run 5.3-Codex instead of 5.5, but I never ever do.

reply

upvote

by notrealyme1237 hours ago|

[-]

The GRAM model is so much into my research direction, I love it. Thank you for posting it.

Where do I find papers like this? Outside of hacker news comments. It's so hard to find the good stuff in all the noise IMO.

reply

upvote

by onlyrealcuzzo3 hours ago|

[-]

> Where do I find papers like this?

I got it from my Google News recs on my phone, because I've been watching a bunch of videos on YouTube about LeCun's ideas on World Models and JEPA (I think).

reply

upvote

by Npovview4 hours ago|

[-]

GRAM is a lot like the Multiple Drafts Model of Consciousness that Daniel Dennett proposed. I think reasearches should read more philosophy models and bring good ideas into LLM research.

reply

upvote

by notrealyme1232 hours ago|

[-]

Can you recommend a good starting point other than Daniel Dennet?

I have the same assumption about Cognitive sciences, which I try to get a better understanding.

reply

upvote

by Npovview2 hours ago|

[-]

A LLM should be able to do a better survey of literature than me. I haven't read literature by Dennett but have watched ALL his videos online so that's how I know.

reply

upvote

by ltbarcly31 hours ago|

[-]

Yea this is great advice: the people who actually know how to build machine intelligence should go read the notes of the people who literally had no idea how to do it. While they are at it, we should have NASA go read Jules Verne so they can use his ideas in the next manned missions.

reply

upvote

by UncleOxidant15 hours ago|

[-]

> It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years

Given how well Qwen3.6-27B performs for such a small model I think you could be right. I suspect that Google,OpenAI,Anthropic must be looking at the Qwen3.6 models (as well as Deepseek V4-flash, MiMo-V2.5) and wondering if they could make some smaller models that are specifically trained for certain activities - like coding. Smaller, more targeted models would take up a lot less resources.

reply

upvote

by svachalek13 hours ago|

[-]

The problem is that once you reach a certain level in coding (not particularly high imo, although some would differ) the most significant improvement in your output comes from understanding requirements better and finding ways to meet requirements in productively lazy ways, bypassing busywork that seems necessary but isn't. And that's the kind of stuff you will only find from a generally intelligent model, not a code monkey that's optimized for turning requirement sheets into source code.

reply

upvote

by mucle621 hours ago|

[-]

> I won't be surprised if the next gen frontier models are the last.

the last?!? I'm excited to see :) I'll take the other side of that since llms are so new

reply

upvote

by pjerem20 hours ago|

[-]

What gp wanted to say is that models are now so smart and useful that even if they managed to be EVEN MORE smart and useful, you wouldn't even notice it.

Honestly, there is nothing in my head that Claude cannot handle. Maybe it can be more this or that but I can already barely exploit Opus 4.7.

And I'm using DeepSeek 4 Pro for my personal use and while it's a little behind, it's not that far.

I think the situation can be very dangerous for US AI companies because if current models are already capable of doing mostly anything, nobodoy will want to get to the next model, even if it's 10x better. OTOH, open source models like DeepSeek are doing mostly the same work for 1/10 of the price.

Also the more I play with Pi, the more I think LLMs are already not kept back by their own capabilities but by the lack of agency we allow them to have. There is more value today in a capable harness for current LLMs than in a better LLM.

reply

upvote

by suttontom20 hours ago|

[-]

Are you joking? Is there literally "nothing" you can imagine that Claude can't do?

reply

upvote

by tjwebbnorfolk43 minutes ago|

[-]

Not OP, but in 6 months of using Opus I haven't yet found anything that I know how to do but it does not. On the contrary -- it can do things instantly that I would have needed a ~week refresher on some SDK or some algorithm in order to implement myself--plus a ton of thrash/debugging time.

What have YOU thought of that Claude can't do?

reply

upvote

by dead_internet19 hours ago|

[-]

[dead]

reply

upvote

by coldtea19 hours ago|

[-]

>What gp wanted to say is that models are now so smart and useful that even if they managed to be EVEN MORE smart and useful, you wouldn't even notice it.

I think what gp said was the improvements are incremental, and we haven't seen a big revolutionary change since 2-3 years, and the pace is slowing down.

reply

upvote

by czl18 hours ago|

[-]

> What gp wanted to say is that models are now so smart and useful that even if they managed to be EVEN MORE smart and useful, you wouldn't even notice it.

If benchmarks across the board keep trending up and you still don't notice a difference, that's not evidence the model stopped improving. More likely your tasks aren't hard enough to expose the gains, or the model has passed the point where you're able to judge it.

You can only tell a good answer from a great one up to your own ceiling. Once the model clears that, both look the same to you, and the extra capability is real whether or not you can see it.

reply

upvote

by pjerem17 hours ago|

[-]

But that’s exactly what I said ! I know the model will continue to improve and I don’t deny that, I even strongly believe it. My point is that at that point it probably won’t change anything to me.

Would Opus 10 release tomorrow and be nearly AGI, I still would still use it like 4.7 because on daily use, I am the limit (also the harness).

So as a customer paying for tokens, I’m probably going to search for better cost rather than more intelligence.

reply

upvote

by dzhiurgis14 hours ago|

[-]

> Honestly, there is nothing in my head that Claude cannot handle

Friend does marine autopilots in C++ on 64kb of memory. It's totally useless for him.

From my experience any sort of more difficult backend logic - all LLMs fail pretty quick. Especially when you need to logically work out the business logic (partly if not mostly because it just doesn't have the context you do).

reply

upvote

by claytongulick19 hours ago|

[-]

> Honestly, there is nothing in my head that Claude cannot handle.

One idea is that maybe it could figure out how many L's are in the word "google" [1]

Or, maybe which days of the week have a "d" in their spelling [2].

[1] https://x.com/FatherPhi/status/2059659658428912040?s=20

[2] https://x.com/FatherPhi/status/2054212816069132461?s=20

reply

upvote

by speff17 hours ago|

[-]

From what I understand, that's a problem with the way it receives data. The model doesn't see the letters g,o,o,g,l,e to count it. Just like how I can't sense radio waves. If I wanted to find that out, I'd get a tool to detect waves. If the LLM wants to find that out, it can write a script to find it.

reply

upvote

by CamperBob217 hours ago|

[-]

Wow, which Claude model flubbed that question? Certainly not anything recent...? The 2-bit quant of K2.6 running locally on my own hardware has no problem with it: https://i.imgur.com/tL0FLjZ.png

So Claude has no excuses here.

Edit: even Qwen 3.6 27B handles it ( https://i.imgur.com/jleJxj2.png ), and of course Claude does. I had to go all the way back to Opus 3 to get it to fail (https://i.imgur.com/uJOH2nP.png).

reply

upvote

by cluckindan4 hours ago|

[-]

As far as it has been studied, the relationship between model size and capability is inversely logarithmic: 10x increase in params less than doubles capability.

reply

upvote

by nbardy16 hours ago|

[-]

There is endless returns to frontier intelligence, just because most people can't make use of it doesn't mean someone can't make a ton of money off of it.

Most software engineers will just need cheap tokens.

But things like physics and drug discovery have no forseeable upper bound.

reply

upvote

by holmesworcester16 hours ago|

[-]

Within software engineering, security, reliability, and scale also seem boundless.

Software that never breaks (including because it never runs into scaling problems) and never leaks your data is preferable to software that breaks and leaks your data sometimes, but it has been too costly to be practical.

Current models are still very far from the reasoning muscle required to build things that never break, scale to billions of users with no issues, and cannot be exploited.

reply

upvote

by onlyrealcuzzo14 hours ago|

[-]

> Software that never breaks (including because it never runs into scaling problems) and never leaks your data is preferable to software that breaks and leaks your data sometimes, but it has been too costly to be practical.

It's almost impossible to prove non-trivial software is invulnerable.

It's very easy to prove that it sort of works.

For one, you have hardware vulnerabilities - period. If you're running on any operating system, you have OS vulnerabilities. If you're not running on bare metal, you may have who knows what kind of vulnerabilities. If you're running literally any other piece of software on the same machine, depending on the hardware and OS, you could have vulnerabilities...

reply

upvote

by overgard16 hours ago|

[-]

People keep saying this and yet the evidence seems pretty thin..

reply

upvote

by 43fg14 hours ago|

[-]

To me its evidence of people who dont actually think deeply enough to understand the subtleties, nuances etc of what they are talking about.

reply

upvote

by viking1238 hours ago|

[-]

Nothing ever happens, in 20 years we will still be painfully dying from the same shit as now. Maybe there is like 5 new drugs for some exact specific type of cancer out of like what, thousands?

reply

upvote

by mickdarling19 hours ago|

[-]

I effectively distill the frontier models by building whole sets of skills, personas, and other artifacts that I can then run on smaller models and get 10% even 20% improvements on models like haiku or local models.

There's a lot of room for improving the smaller models at many levels of the stack.

reply

upvote

by svachalek13 hours ago|

[-]

This is a good point. It didn't really work on older small models but the latest crop are quite good at following instructions and paying attention to detail, they just lack a lot of the sophistication and nuance that the frontier models have these days. So they are often capable of doing very complex tasks, they just need more detailed and foolproof instructions than the larger models would.

reply

upvote

by merlindru21 hours ago|

[-]

surely training also gets cheaper so justifying it becomes easier?

i think it'll be more like we get 1-10T models and then distill those down into smaller models, though

It seems like the best small models today are all distilled from bigger models

Moreover, I hypothesize Claude Opus 4.7 and now 4.8 are a distillation of Claude Mythos

reply

upvote

by pseudohadamard19 minutes ago|

[-]

That's the impression I got too, it seems closer to what the marketing has told us about Mythos than 4.6/4.7 were.

reply

upvote

by dbbk18 hours ago|

[-]

I'm frankly surprised the focus is still on these enormous "know everything in the world" models. I would think you could create an incredibly lean and smart "just React and React Native" model.

reply

upvote

by onion2k17 hours ago|

[-]

"Make a React app to run my coffee shop" requires knowing React but also knowing what a coffee shop is.

reply

upvote

by easyThrowaway6 hours ago|

[-]

Only if you're going after the "vibe coders" audience. Regular developers would be fine with a lightweight local llm capable of scaffolding and wiring a dozen of bog-standard components in a few lines of natural language.

reply

upvote

by Gomotono5 hours ago|

[-]

Sure but it doesn't need to know everything.

It doesn't need to know different languages, every programming lanuage and co.

We will for sure get to this in the comming years. After all they will have to start finetuning their traning data anyway

reply

upvote

by onlyrealcuzzo18 hours ago|

[-]

> I would think you could create an incredibly lean and smart "just React and React Native" model.

You can, but it's not as useful as you might think.

It needs to at least understand 1 human language to understand your intent to implement features.

If GRAM turns out to be a 5000x multiplier for local reasoning, you could theoretically train a 500M parameter model on just a programming language to understand stack traces to fix bugs and be incredibly powerful.

But most people also want it to understand human language to implement features as well.

Because then it can't just understand React and JavaScript - it needs to understand thousands of commonly used dependencies, the DOM, CSS, HTML, etc...

And for that you need A LOT more parameters than you might expect.

You can definitely get a ~3B active parameter model that can run comfortably on today's hardware to be VERY good at coding once all of the SOTA architectures are added to a single model - especially if we get better tool calling to give models better context per language.

You might be thinking: why does it need to memorize dependencies? Can't it just stick all of them in it's context and use its super smart brain? No, context is king. You want to keep it as short as possible. The solution is not having a smart model and putting 10M lines of context in it. The solution is having a model with enough parameters to know what it needs to know. Researchers are already working on having "packs" of knowledge where you could download a 20M param pack just for some common dependencies in JavaScript (as an example) - but AFAIK this is likely years away (and may not prove effective).

You could get 100x performance if you feed the models ideal context... So a 3B model today can perform almost as good as ~300B model if you give it really good context vs flood it with mostly garbage it doesn't need across your repository.

If you feed it 100x more context to make up for its limited memorized general knowledge, it's going to perform thousands of times worse, completely eliminating any advantage it might get from GRAM...

reply

upvote

by vitaflo17 hours ago|

[-]

We just want it to understand how to write code. We don’t also need it to know how to grow a potato.

reply

upvote

by RugnirViking17 hours ago|

[-]

I think perhaps you misunderstand how much of being an effective coder is understanding business domain enough to not be constantly asking for clarification (or if one is a fool or an ai, assuming wrong answers). I reckon a vast collection of trivia on the level of knowing how to grow a potato is important for a programmer

reply

upvote

by rmunn12 hours ago|

[-]

And you can't know ahead of time, when you're training the model, what business domains it will be used for. Someone may decide to use it to optimize the watering and fertilizer cycles of their automated potato-growing setup, and suddenly the "how to grow a potato" texts that went into training the model are actually the very things that make the difference between success and failure for the code the model spits out.

reply

upvote

by onlyrealcuzzo17 hours ago|

[-]

The disjoint set of English related to strictly growing potatoes and adding features to code is a lot smaller than you probably think...

It is hard to cut out a huge portion of English and truly understand English and human language.

You're just not saving as much as you might assume you could.

reply

upvote

by 15 hours ago|

[-]

deleted

reply

upvote

by DoctorOetker11 hours ago|

[-]

... unless the software is potato farm software.

Programming is not a rare skill, the interaction with domain knowledge is.

reply

upvote

by CamperBob217 hours ago|

[-]

To me, the magic with LLMs has always been on the input side. It needs to understand what you mean in order to do what you ask. Most people are pretty terrible at communication, and general world knowledge seems to help with that.

reply

upvote

by nikcub16 hours ago|

[-]

The syntax is the easier part - most programming tasks require the reasoning and understanding of a large world model to solve problems.

Fine tuning a 'lean and smart' model works really well for discrete, repeatable high volume tasks like support ticket triage, lead classification, content filtering, labelling, generating content with a voice, etc.

Inefficient token burn by throwing large models at everything is definitely a problem - it's like hiring Phd's to answer the phone or to wash dishes.

reply

upvote

by yomismoaqui21 hours ago|

[-]

Let's hope that hitting a scaling wall and less money to spend will begin redirecting efforts to optimize inference and get the same results with less compute.

Boomer comparison, but I remember the 8 bit computer era when the hardware was what it was so the later games of that era used hardware better than previous ones.

reply

upvote

by ishurand419 hours ago|

[-]

And anyway, with quantum, there will be no need for frontier companies as you might be able to even run a 1T param model on a consumer quantum computer.

reply

upvote

by root_axis18 hours ago|

[-]

Even if quantum computing had any clear implications for LLMs (it doesn't), there is no such thing as a "consumer quantum computer" and there won't be in our lifetimes.

reply

upvote

by stratos12318 hours ago|

[-]

I'm assuming this is a joke, but:

- why'd a quantum computer help running an LLM?

- of course there'd be need for frontier companies - nobody else has the resources to train frontier models.

reply

upvote

by slashdave9 hours ago|

[-]

What? No, that is not what quantum computers do

reply

upvote

by firebirdn9921 hours ago|

[-]

you just need to look at Mythos to see the jump in performance from a 10T(?) model. As they scale, they get more capable. We might have an yearly release, but I believe the releases will continue, as long as scaling laws are in tact, and there's huge problems still need solving. (think cancer)

reply

upvote

by phainopepla221 hours ago|

[-]

And how are we meant to look at Mythos? Do you have access?

reply

upvote

by bigfishrunning20 hours ago|

[-]

no but they tell me it's TERRIFYING and DANGEROUS and we should INVEST MORE MONEY

reply

upvote

by dwpdwpdwpdwpdwp20 hours ago|

[-]

Through association with a large company:

https://www.anthropic.com/glasswing

Ive seen the tickets generated by the model that have trickled to my team. They are legitimate, but i can’t speak to model improvement because its a pilot program.

reply

upvote

by OtomotO20 hours ago|

[-]

Through the lenses of anthropic's marketing department of course

reply

upvote

by coldtea19 hours ago|

[-]

>you just need to look at Mythos to see the jump in performance from a 10T(?) model

Mythos is a bunch of likely overhyped claims at this point. A few experts who looked into the claimed results weren't that impressed.

reply

upvote

by astrange16 hours ago|

[-]

They all looked like real CVEs to me.

reply

upvote

by coldtea13 hours ago|

[-]

Nothing that special about finding a real CVE. They're not that different than what non-Mythos could spot.

reply

upvote

by giwook16 hours ago|

[-]

And there seems to be a ton of experts on the opposite side.

As they say, the truth tends to be somewhere in the middle.

reply

upvote

by aj_hackman20 hours ago|

[-]

You forget that these models are still only interpolating between human-generated datapoints fed to them. They cannot reason beyond the data they've been given, so unless everything you want to create with AI is a synthesis of prior art, you're back to relying on the stone-age human brain that created AI in the first place.

reply

upvote

by mofeien20 hours ago|

[-]

Not all training data is human generated, and it's also not clear that being ridiculously good at interpolating between data points (whatever that means) will not lead to superhuman capabilities.

reply

upvote

by aj_hackman20 hours ago|

[-]

I could make a robotic picture coloring machine with truly superhuman capabilities - picking only the most beautiful color combinations and staying 100% in the lines while finishing entire murals in < 1 second. However, if you need a completely new and original image rendered, the machine is of only partial utility for you. It is very well possible that your cure for cancer (if that's even feasible) or whatever else you desire is a completely new picture.

We have these breathless conversations about the new AI frontier at the peril of losing sight of reality and our own human potential.

reply

upvote

by coldtea19 hours ago|

[-]

>these models are still only interpolating between human-generated datapoints fed to them. They cannot reason beyond the data they've been given

Are you sure that humans can?

Didn't a SOTA recently solved a mathematical theorem, one escaping mathematicians for 80 years?

Maybe a human "novel" invention is just a good interpolating from the datapoints (knowledge) fed to the human.

reply

upvote

by stratos12319 hours ago|

[-]

Your phrasing ("you forget") implies this is a fact and common knowledge, while in fact there's little reason to think that's true.

reply

upvote

by suttontom20 hours ago|

[-]

Do you know if anyone has trained, say, a pre-2017 model and tried to get it to come up with Attention Is All You Need? If it did, would you say that was only because it's a synthesis of prior art? If so, what isn't?

reply

upvote

by aj_hackman20 hours ago|

[-]

Allow me to restate my point: human beings and AI both create via synthesis, but we are the only ones capable of what we could categorize as true original thought or creativity. It could be argued that nothing we do as humans is truly original or creative either, but I would counter that with the claim that an LLM could not have created any element of the society and culture that gave birth to LLMs. Maybe in six more months.

reply

upvote

by coldtea19 hours ago|

[-]

>human beings and AI both create via synthesis, but we are the only ones capable of what we could categorize as true original thought or creativity.

And how is that anything other than synthesis? Do we pull concepts out of thin air?

reply

upvote

by Forgeties7921 hours ago|

[-]

> I won't be surprised if the next gen frontier models are the last.

I’d be surprised tbh. Investors don’t want to hear “everyone else is still training models and seeing improvements, but we don’t want to participate in the arms race anymore.” They want monumental leaps every quarter or two because they have sunk unholy amounts of money into these companies/products.

The whole idea of “hyper scale” doesn’t jive with caution and or otherwise slowing down.

reply

upvote

by irishcoffee20 hours ago|

[-]

The way this will play out, most likely, is that smaller models will continue to get released, anyone willing to drop 1-3k on a home upgrade/new LLM box (no that isn’t cheap, it also isn’t outrageously expensive) along with improved open source agents or whatever (lot of meat on that bone) will sneak up behind the big players and start taking dents. Smaller companies will pop up providing 50 users unlimited whatever for a lower cost than the big companies.

The whole ecosystem will twist and evolve, and the big companies will be left begging for corporate subscriptions.

I finally caved when I realized I could build a PC, for myself, with dual video cards that I wanted, which can play games that I like and run models that I want, without worrying about giving my payment info to someone I don’t trust, or invoking token anxiety that I don’t want.

reply

upvote

by Forgeties7917 hours ago|

[-]

Like every major tech-software innovation of the last 20 years, I think it’s just going to be consolidation all over again.

reply

upvote

by Gomotono20 hours ago|

[-]

I don't think this is true at all. It might feel like this because we are used to a very very fast release cycle but we are only in this topic for a few years.

We have so many ways of optimizing:

- continusly creating more and better training data

- increasing parameters to 20/50/100TB

- We still wait for Mythos access

- We still wait for Mythos distilation (i haven't heard any rumors or so that there is a distilled version of Mythos out)

- Reinforcment learning and evolutionary algortihm only started to appear

- If a small 30GB Model can do stuff, these models can also be used as teachers for the big ones

- We have not seen yet specialized models at all. Like a coding java german expert model. Why? Even with MoE architecture, you still need to have these layers around

- Research for Diffusion and other models is still in progress

- Nvidia just announced/showed a 7x speedup on inferencing for Nemotron

- Multitoken prediction became available just a few weeks ago

- Compute gets only in a range were they can do a lot more and cheaper experiments (see Google IO 2026 announcement)

- World models are showing great progress and we do not know yet what they will bring to the table

- They are probably not finetuning/fixing all areas in parallel. I would argue that Anthropic focuses most of its efforts into coding and agentic. Google for sure does subagent and agentic optimizations too. Plenty of areas are just not touched i would say because they don't have the capacity

- We see more and more mulit modal models (these also consume compute)

- N-Gram paper and co i have not seen all of these things in chinese open models

- We don't even know yet what Meta is doing, but we do know they restarted their efforts again

- Anthropics models got a lot better benchmark wise for dening non sense asks. They do learn how to get rid or reduce hallucinations

- We are in the middle of the biggest Reinforcement loop whith all the training data we give them day to day and its not clear at all if they already use these models in thir training and at what stage.

- We do expect bigger models to be able to comprehend deeper concepts / broader code bases. Big companies with huge code bases probably are waiting for this

- Thre will be also continues progress in harnesses which in it alone is not part of the LLM progress (fair) but these harnesses do get better when you finetune a model to be optimized for a harness

- ChatGPTs Image model 2.0 got relevant better and came out just a month ago

I suspect, based on hardware requirements and progress on hardware infrastructure alone, that the industry wants to go to 100t models and we do not know yet what this will mean. I could see that we might skip normal transformer and find relevant other architectures.

Just a week ago there was a research paper about parallel input and output streams which has not been explored enough.

There was also a research paper were they showed that a LLM can compute things. This will take time to see were this leads to.

I don't think the focus on GRAM and facts is so relevant. Its about context and context handling not just some facts.

reply

upvote

by ilaksh19 hours ago|

[-]

Great points! We do keep seeing gains from larger model sizes. I think that is still one of the factors contributing to jagged intelligence. When they increase up to around 100T parameters, that will truly be human complexity level, and I assume there will be no trace of jaggedness left.

If you look at things like Mythic AI and the recent wurtzite ferroelectric nitrides breakthrough from the University of Michigan, huge performance and efficiency gains through new compute-in-memory approaches are around the corner.

And that will get us up to two orders of magnitude more parameters.

It's also plausible to me that before we get all the way to 100T we find some recipe of efficient state synchronization, goal sharing or something so that we are able to get higher collective IQ by combining fast distributed predictive subnetworks.

reply

upvote

by DoctorOetker10 hours ago|

[-]

> There was also a research paper were they showed that a LLM can compute things.

Can you be a little more specific than that or provide a reference?

I assume you're not indicating universality of neural networks?

reply

upvote

by Gomotono6 hours ago|

[-]

I do.

This is the newest thng i'm aware of: https://www.percepta.ai/blog/can-llms-be-computers

But there were papers in 2023 with a different approach requiring external memory https://arxiv.org/abs/2301.04589 too

reply

upvote

by guluarte20 hours ago|

[-]

I think the future will be enterprise clients will train their own models based on their needs and data.

reply

upvote

by abalashov18 hours ago|

[-]

Versus just packing all their needs and data into context, and RAG (i.e. context)?

reply

upvote

by jimbokun19 hours ago|

[-]

Why isn’t this happening more already?

reply

upvote

by z3t418 hours ago|

[-]

It takes way more resources to train the model then to use it.

reply

upvote

by elfly18 hours ago|

[-]

I honestly doubt this; very few companies have enough data. Maybe we could see mergers so it happens but basically it would mean everyone would need to be Google sized for it to work.

reply

upvote

by YetAnotherNick21 hours ago|

[-]

> It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years.

I am ready to bet against this. Knowledge benchmark like SimpleQA isn't increasing for small models.

> It is far less clear that a 1.2T model will be meaningfully better enough to justify training it.

Well for one, we know for certain there is Mythos which is meaningfully better. And I think there is a lot of juice left to squeeze for Mythos class model.

reply

upvote

by ertgbnm21 hours ago|

[-]

Knowledge benchmarks can't really be improved upon via distillation or RL. It requires those facts be added to the training corpus and for the model to memorize them better. Neither distillation or RL really do that and thus we shouldn't expect improvements on SimpleQA unless some other interventions are being made.

Model intelligence and knowledge aren't necessarily directly related. If we can pack greater intelligence and agency at the cost of it forgetting factoids, that would actually be a good thing. We don't need LLMs to memorize facts, we need them to learn how to interact with the world such that they can find the facts that are necessary and surface them to the user.

If we could distill all of the knowledge out of an LLM and just be left with a very agentic model that only knows facts in it's context, I think some very interesting stuff would happen.

reply

upvote

by slashdave21 hours ago|

[-]

RL is more than facts. Synthetic feedback is an obvious approach. Does the model suggest code that compiles and performs well?

reply

upvote

by YetAnotherNick18 hours ago|

[-]

Lot of the things aren't facts that could be stated. No one can just see the dictionary or translation of words and start talking in that language.

There isn't a clear definition of what is knowledge and what is intelligence. Is being able to write in C knowledge? Is knowing undefined behaviour in that knowledge?

reply

upvote

by onlyrealcuzzo21 hours ago|

[-]

> Well for one, we know for certain there is Mythos which is meaningfully better.

Do we?

Have you used it?

What is "meaningfully" better? It's not 3-4 orders of magnitude better. That is definitely happening for smaller models.

reply

upvote

by YetAnotherNick18 hours ago|

[-]

What do you mean by 3-4 orders of magnitude better? Was Einstein 3-4 order of magnitude better than us?

Meaningful in the sense it could find security vulnerabilities in browser and kernel that >99% of the engineers couldn't find.

reply

upvote

by onlyrealcuzzo13 hours ago|

[-]

> What do you mean by 3-4 orders of magnitude better? Was Einstein 3-4 order of magnitude better than us?

I'm talking about output quality compared to parameter size.

Mythos is not 4 orders of magnitude larger than Opus - it's quite possible no LLM model ever reaches that size (likely even), and it's output is only barely better...

reply

upvote

by YetAnotherNick8 hours ago|

[-]

Where did 4 order of magnitude even come from? If I were to guess it is just 5x larger based on the pricing, so not even 1 order of magnitude.

> Mythos is not 4 orders of magnitude larger than Opus

Again can you define this. How would 4 order of magnitude better look like?

reply

upvote

by fnord7719 hours ago|

[-]

So, then I guess the big three are never going to make their money back.

reply

upvote

by wahnfrieden20 hours ago|

[-]

I would be shocked if 5.5 is the last new pre-train from OpenAI. Your comment is nonsense.

reply

upvote

by onlyrealcuzzo19 hours ago|

[-]

5.5 is not a generation it is a trivial iteration...

6 is for sure happening...

As is Gemini 4.

It's less certain there will be a Gemini 5 or GPT 7 any time soon that is a true next "generation" and not just an iteration. They will almost certainly call something Gemini 5 and GPT 7...

reply

upvote

by wahnfrieden19 hours ago|

[-]

5.5 is in fact a new pre-train model

First you say there won't be a new generation. Now you're saying there will be more. Oh well, I'll stop responding here

reply

upvote

by onlyrealcuzzo18 hours ago|

[-]

> I won't be surprised if the next gen frontier models are the last.

You clearly did not read my first comment or the second, or clearly disagree on what a generation is.

reply

upvote

by michaelchisari20 hours ago|

[-]

| a 60-90B model can outperform current SOTA

My conspiracy theory is that Apple recognizes this.

reply

upvote

by dweekly20 hours ago|

[-]

That does seem to be the path Apple is following here. Have a local model that can answer most things and then have a fallback of cloud options when they request is too complex. The cleverness of this strategy has been overshadowed by the incredibly poor quality of their local models. It will be extremely interesting to see what next month holds and whether Google helped fine tune an Apple specific Gemini / Gemma model for their devices. Bonus points, of course, if they unveil the M5 Ultra Studio with half a terabyte of RAM to be a local "cloud model" (the true fantasy here of course would be Apple building something a little like openclaw where from your phone you could give commands to your Home Apple server). They could probably get away with charging $20k for it if it has sufficient tok/sec. If that happens and is successful one could imagine a straight line path in the next two generations to bringing the cost and form factor down to the point where some of the form factor of an Apple TV becomes everybody's home inference server / agentic HQ. Sovereign AI for everyone!

reply

upvote

by joshstrange17 hours ago|

[-]

I think Apple might come out ahead by pure accident. Yes, Apple often waits to enter a market until it's established but in the case of AI they tried, they tried and failed. It was never the original plan to partner with OpenAI and then later with Google (Gemini). They 100% missed the boat on AI, the question now becomes: was the boat worth taking and we are still waiting to see how that plays out.

reply

upvote

by holoduke19 hours ago|

[-]

You need some serious memory then. Let's say around 192gb for having not all your memory eaten by your LLM.

reply

upvote

by onlyrealcuzzo20 hours ago|

[-]

> My conspiracy theory is that Apple recognizes this.

I don't think that's not a conspiracy theory. AFAIK, It's their stated AI policy...

reply

upvote

by michaelchisari20 hours ago|

[-]

Interesting. Where have they stated that?

reply

upvote

by selectodude19 hours ago|

[-]

https://machinelearning.apple.com/research/introducing-apple...

reply

upvote

by 19 hours ago|

[-]

deleted

reply

upvote

by colin4k10249 hours ago|

[-]

[dead]

reply

upvote

by frankest17 hours ago|

[-]

[dead]

reply

upvote

by szundi6 hours ago|

[-]

[dead]

reply

upvote

by lichenwarp19 hours ago|

[-]

[flagged]

reply

upvote

by gen22021 hours ago|

[-]

I'm curious to poll HN on this issue. Do you feel like we've had meaningful/noticeable gains in terms of your programming workflows between 4.5 and 4.7?

My 2¢, I personally feel like all of the productivity gains since 4.5's release (in November 2025!!) have come from improvements to the harnesses (cc, cursor cli, codex, opencode, whatever) AND from the context window expansion from 200k to 1M.

But the actual "raw" intelligence of the model / ability to make good decisions feels like it has plateaued since 4.5. 4.6 was maybe a small improvement, but hard to differentiate from in-context-learning with the 1M window. 4.7 if anything felt like a regression in wisdom for me and my coworkers, with it consistently making worse/lazier decisions.

reply

upvote

by Bnjoroge21 hours ago|

[-]

For long-running tasks, yes 4.7 has been a noticeable improvement. Goes off the rails alot less than 4.6 does. For shorter-sized windows, I havent felt as much and agree that the harness improvements have been fhe biggest lever

reply

upvote

by csvance19 hours ago|

[-]

When doing big long running workflows especially with plan Mode 4.7 was a clear improvement. It’s considerably worse for under specified tasks and responds to a couple sentences with 10+ paragraphs for explanatory type discussions.

reply

upvote

by themgt19 hours ago|

[-]

Opus 4.7+ Max is a 10x engineer who wants to be left alone to work. When you talk to him, he infodumps on you to get you (his pointy haired idiot Dilbert boss) to go away.

reply

upvote

by 4gotunameagain9 hours ago|

[-]

OR they deliberately increased token usage to inflate pre IPO numbers.

reply

upvote

by fittingopposite9 hours ago|

[-]

Yes. You and some random indigenous guy in the Amazon likely share the same intelligence but you are more capable because you have access to writing/reading, computer, car etc. Intelligence is more than raw intelligence. It's harness, skills, tools, memory etc. If you improve all the latter but keep the raw intelligence (LLM) fixed, you certainly get better results. Same with us humans.

reply

upvote

by gen2202 hours ago|

[-]

Of course, I’m not trying to dismiss gains from harness, actually the opposite.

But the narrative that 4.Y is an improvement over 4.X is essential to keep the model training music playing.

If 90+% of the gains come from the harness, how can you continue to justify spending billions of dollars on training and an 80% gross margin on inference on the latest model? (Reportedly what Anthropic commands on the top tier of their frontier model API billing).

So differentiating between the two (what I’m trying to do here) is really consequential!

reply

upvote

by computably9 hours ago|

[-]

Except LLMs are simulacra of actual intelligence. Frequently in a single conversation working on a single narrowly scoped task, I am both surprised by a few insights and cursing at how it can miss obvious issues. The "raw intelligence" of LLMs leaves much to be desired.

reply

upvote

by bonoboTP21 hours ago|

[-]

To me 4.5 was mindblow, 4.6 noticeable, 4.7 more like a style/personality change regarding how much it asks back, how much it assumes, how eager it is to jump to action etc but not really in terms of my perception of its smartness.

reply

upvote

by onlyrealcuzzo13 hours ago|

[-]

In my experience, 4.7 was a noticeable step down from 4.6.

I was one of these people that Claude would never finish anything and just randomly say, this is a good stopping point, I think you should go to bed.

And then I'd tell it to continue, and it would burn tons of tokens, make no progress and say, "This is a really good stopping point..."

Canceled and switched to Codex and have been pretty happy with it. It doesn't plan as well as Claude, but I think it does better implementation - and neither of them can actually come up with good plans without a ton of help...

Codex is also way faster.

reply

upvote

by somenameforme21 hours ago|

[-]

They all feel, more or less, the same to me in terms of output capabilities. Mostly get simple things right, can get more complex things right with nudging, eventually get stuck hard on something that takes a bunch of iterations through it/logging/etc or me fixing the code manually.

reply

upvote

by bcrosby9520 hours ago|

[-]

4.6 felt a bit better than 4.5 but slower. 4.7 doesn't feel better than 4.6.

reply

upvote

by giraffe_lady21 hours ago|

[-]

I actually don't see any personal productivity improvements from using opus over sonnet for coding. If you're keeping tasks small and conversations short, reading the code and correcting before changes go in, whatever advantages opus has aren't practically significant. It's also just talky as hell, overexplains anything it touches and every token produced this way increases the surface area for hallucination so you need to have your guard up even more with it.

There's a sweet spot of complexity for low importance tasks where it's just big enough I don't want to do it and just simple enough to have opus plan/delegate/review with another model. So possibly model improvements will grow this window, but currently I don't do much in there.

reply

upvote

by alfalfasprout18 hours ago|

[-]

I'm actually currently studying this :)

Honestly... not that dramatically. Each release is much more marginal. And quoted official benchmarks doesn't translate very well into the real world.

4.7 regressed hard in some ways. But a compounding factor too is that the claude code harness seems to nerf the model after a few months. Probably to reduce token use.

So far 4.8 seems less verbose but we'll see in practice what it translates into meaningfully.

reply

upvote

by gAI21 hours ago|

[-]

4.7 was the first time I had to resort to using the previous version (4.6) for most use cases. Hoping 4.8 rectifies this.

reply

upvote

by ishurand419 hours ago|

[-]

They just showed the benchmarks it improved on but it regressed on so much more, such as the MCRR benchmark: "On multi-round coreference/context recall tests (often cited as MRCR or long-text retrieval benchmarks), Opus 4.7 reportedly dropped from roughly 78.3% down to 32.2% compared to Opus 4.6."

reply

upvote

by merlindru21 hours ago|

[-]

Same. 4.7 felt like a definite regression

reply

upvote

by supern0va21 hours ago|

[-]

Interestingly enough, 4.7 actually did regress on a few benchmarks from 4.6, so it's more than just vibes.

reply

upvote

by gAI21 hours ago|

[-]

It seems like a lot of things fed into that. Anthropic couldn't keep up with the compute costs when they got a huge influx of users. (So) effort level defaults got turned down. (Looks like we have direct effort control in the web interface now - thrilled about that!) Adaptive Thinking, while usually cheaper for them, seems less robust than Extended Thinking. And this part is just vibes, but the alignment on 4.7 feels too stiff. I understand wanting the model to push back more, but it seems like 4.7 will push back reflexively in situations where it's just odd.

reply

upvote

by bombcar21 hours ago|

[-]

Claude got very mad at me and burned more tokens than exist to complain about me asking about a "yellow background cell" in an excel spreadsheet.

reply

upvote

by forshaper21 hours ago|

[-]

Too much personality, if you ask me. My biggest use case of an LLM is tool, not therapy, but therapy and opinions have been sneaking into workhorse tasks.

haven't verified, but attributed to Askell: "I just think that... there's this idea that you're always giving the models a personality and a persona, because they are talking like people and they are trained on human data. And I think my worry has been: if you train them to be excessively corrigible and to see that as their persona, in people I think this actually has a lot of negative broader traits. As in, if you met someone and it was just like, "oh yeah, they would literally do anything," a follower — you know, if a person just tells them something and they just fully defer, they don't bother thinking about it at all — I'm just a bit worried about how that might end up generalizing, especially if models are going to be playing a more active role in the world."

reply

upvote

by gAI20 hours ago|

[-]

Anthropic’s research makes the case that role-playing is inherent to how the models work. Communication implies a sender. Language implies a writer, and the models learn these roles implicitly during training. RLHF is meant to strengthen the attractor to the Assistant persona.

https://www.anthropic.com/research/persona-selection-model

https://www.anthropic.com/research/assistant-axis

https://www.anthropic.com/research/emergent-misalignment-rew...

https://www.anthropic.com/research/emotion-concepts-function

reply

upvote

by hashmap18 hours ago|

[-]

The RLHF very much does do that. My take is that RLHF as a mechanism ought to be avoided altogether, and even the selection of the assistant attractor basin is suspect. If I am exploring a problem space I don't want to hire Igor to explore it with me, it's more helpful to have a colleague role who will sort of jump out and say "nah thats dumb what if we throw out that whole thing and do this completely different angle instead".

reply

upvote

by ACCount3721 hours ago|

[-]

4.7 is a different base model from 4.6, so it's possible that they introduced regressions with pre-training changes, or undercooked the post-training stage.

reply

upvote

by b--l15 hours ago|

[-]

Just speculating but I "feel" 4.7 was post-trained using more synthetic techniques. The way it writes for one thing, it's "personality", is less human and more fatiguing-AI-slop like.

reply

upvote

by ACCount3715 hours ago|

[-]

You don't need to fry with RLAF to get that "slop feel". The first iterations of "AI slop" were raw SFT+RLHF - all human input, all inhuman output.

That said, I completely agree that 4.7 was a pronounced "model personality" regression. Closer to ChatGPT, and I mean that as an insult. Yet to check whether 4.8 is better.

reply

upvote

by throwatdem1231116 hours ago|

[-]

4.7 was just them starting on the path on getting prices in line with the actual cost

Make it dumber. Charge more (by changing the tokenizer). Call it the latest and greatest. Reset expectations.

reply

upvote

by ruairidhwm16 hours ago|

[-]

I managed to find that Haiku outperformed Sonnet on some tasks...don't want to blog spam but if anyone is interested: https://www.ruairidh.dev/blog/sonnet-4-6-drops-format-rule-o...

reply

upvote

by sonink3 hours ago|

[-]

Same here - we never bumped to 4.7 in our agentic app. Continue to use 4.6.

reply

upvote

by petterroea20 hours ago|

[-]

Same. 4.7 has done some incredibly stupid things.

reply

upvote

by dbbk18 hours ago|

[-]

I think this is a more a consequence of the introduction of adaptive thinking and removal of extended thinking, than 4.7 specifically

reply

upvote

by rhubarbtree21 hours ago|

[-]

Same. So happy when I found that option.

reply

upvote

by gAI21 hours ago|

[-]

Unfortunately, looks like 4.6 is now gone from the web ui.

reply

upvote

by lukan21 hours ago|

[-]

Was bothered by that too, but did a magic trick and asked claude how to change that and .. there is

/model claude-opus-4-6

For this session and permanently (in shell):

export ANTHROPIC_MODEL=claude-opus-4-6

reply

upvote

by tanepiper18 hours ago|

[-]

Yep, until 1st June 4.6 is still x1 on Copilot, but will jump up quite a bit in coat - 4.7 was already highly priced, and the output was frankly terrible.

It still seems trying to build general models is mostly cost prohibitive - the frontier model provider and resellers are repricing in such a way the return on investment is dropping as developers and users become more cautious of burning their limits.

I'm still of the opinion that models like 4.6 don't need to be improved on - rather they need to be better integrated with more domain specific models in agentic flows.

reply

upvote

by dezsirazvan18 hours ago|

[-]

same!

reply

upvote

by mrandish19 hours ago|

[-]

I suspect the more frequent incremental releases may also be to deploy new capabilities used by Anthropic to control costs and throttle consumption of resources. I assume any new controls they expose to end-users have far more granular sub-controls under the hood which they can meta-adjust for each user type.

They mention more granular control of effort, 'dynamic workflows' and more speed controls ("fast mode"). While they position them as user features, they also sound like the kinds of knobs Anthropic will need to twiddle on the back-end to balance costs, margins, ARR, and user growth vs retention post-IPO to hit key metrics in quarterly reporting.

reply

upvote

by SkyPuncher21 hours ago|

[-]

> My own experience w/ 4.6 and 4.7 are that I don't firmly grasp any capabilities improvements over my memory of 4.5, but it's all so fuzzy that it's truly difficult to tell.

I've actually intentionally switched back to 4.5. I hated 4.7 so much that I decided to jump back all the way to 4.5.

Now that I've been using 4.5 for a few weeks, I find it significantly more reliable but a bit more forgetful than 4.6/4.7. I'm okay with that because it's really easy to identify this forgetfulness and nudge it.

I found 4.7's adaptive thinking to be extremely unreliable. It seems to overcorrect on the current message without considering the difficult of the overall problem. I wonder if 4.8 will improve on that.

reply

upvote

by michaelsalim17 hours ago|

[-]

Same here. Went back to 4.5 and was happy I did it. The only frustration was that I can tell the model has declined compared to the first few weeks it was released.

I also recently moved to 4.6 since I started hitting the context limit too often with my current project.

reply

upvote

by luxuryballs13 hours ago|

[-]

/model claude-opus-4-6[1m]

allows you to specify you want the 1 million context 4.6

reply

upvote

by dwaltrip21 hours ago|

[-]

If you are using Claude code, just set effort to xhigh.

This one change will probably solve 80% of the problems you have noticed.

reply

upvote

by orwin20 hours ago|

[-]

This. XHigh and the 'plan' mode for complex tasks is absolutely a must have.

Still, the context window is sometimes too small for my usage.

reply

upvote

by jayGlow18 hours ago|

[-]

agent teams can help with that, the main agent acts as an orchestrator and spawns sub agents to do the actual tasks it generally keeps the main context from overflowing.

reply

upvote

by whatevaa18 hours ago|

[-]

Isn't xhigh on opus 4.7 very expensive on tokens?

reply

upvote

by sumedh4 hours ago|

[-]

Yes but Anthropic made a deal with SpaceX and increase usage limits by 50%, so you might not hit your limits.

reply

upvote

by dwaltrip18 hours ago|

[-]

I’ve never ran into the limits on the $100 plan, and rarely even get close.

I normally have only one session going at once though.

reply

upvote

by joshstrange17 hours ago|

[-]

Same here and while I have multiple sessions going from time to time, my day isn't spent primarily developing software directly anymore (due to role, nothing about LLMs).

I only ever hit the $100/mo limits 1-2 times ever and it was always <1hr before reset (once it was <5min, the other was like ~45min).

I'm even considering going back down to $20 and using extra usage for the times I need to "burst".

reply

upvote

by gertlabs20 hours ago|

[-]

4.5/4.6 were roughly the same in our testing. Opus 4.7 is smarter, but it's difficult to use as a product for various personality issues. So far, Opus 4.8 seems to be going down that path (unusably slow, but this could be a launch day rollout problem). Full Opus 4.8 tests are in progress now.

Data at https://gertlabs.com/rankings

reply

upvote

by __s19 hours ago|

[-]

"personality issues" I was able to tell that Opus 4.7 would take instructions more literally, which I appreciated once I calibrated my phrasing to be more precise (often asking to investigate issues, pre-4.7 it'd start making code changes instead of just giving write up). But I can see contexts where handling vague prompts would've just been worse

reply

upvote

by swingboy13 hours ago|

[-]

Looking forward to the results. Thanks for your work.

reply

upvote

by gertlabs11 hours ago|

[-]

Appreciate that! Results are live: https://gertlabs.com/rankings

Opus 4.8 is the first tangible improvement since Opus 4.5. And it doesn't seem to have the personality problems of the last release -- I've been enjoying using it.

reply

upvote

by swingboy3 hours ago|

[-]

Nice! Looks like it’s topping the two coding ones. I noticed it is absent from the Social Intelligence board though?

reply

upvote

by gertlabs30 minutes ago|

[-]

That'll populate over the next couple weeks -- those are the live games on the spectate tab which take a while to generate statistically worthwhile data. I'm curious how it does. From using it all day, I can say Opus 4.8 is my new favorite model, hands down.

reply

upvote

by permute8 hours ago|

[-]

I am using Claude Code for formal verification with Lean. In my personal experience both Opus 4.7 and now what I see from first experiments with Opus 4.8 were big improvements. I was able to delegate proofs of larger theorems that their predecessors could not handle.

reply

upvote

by light_triad21 hours ago|

[-]

I've been using Claude Code regularly since the 4.5 release, and 4.7 was a significant regression: very unreliable, arguing about changes, deciding that fixes weren't needed, etc.

I'm hoping they recreate the magic of 4.5 but it's as much about the quality of harness, the memory and efficiency of the tools than simply the models at this point.

reply

upvote

by ricardobeat21 hours ago|

[-]

4.7 was a significant jump in the ability to run long-horizon tasks. It immediately completed tasks that 4.6 was unable to, even though I have the impression that it became a bit less capable over the first few weeks after release.

It also seems to be helpless at effort levels < xhigh, I turn to Sonnet when simpler tasks are needed.

reply

upvote

by viking1238 hours ago|

[-]

It didn't do shit

reply

upvote

by WhitneyLand20 hours ago|

[-]

“Maybe my own tastes are saturated now”

It might be saturated for smaller scopes of work, but it’s not hard to see the cracks when you scale up what you ask of SOTA models/agents.

One example, to try and single shot prompt coding a ChatGPT equivalent chatbot.

Sure it will spit something out, but the feature depth, UX subtitles, backend integration, and lots of pragmatic engineering decisions along the way will just not be baked.

Another example is building a C compiler from scratch which Anthropic showed is still a struggle to do.

Not that these these specific examples are important but just to point out scaling up expectations shows the cracks.

It’s not just a model problem of course, better agents, orchestration features (like Dynamic Workflows mentioned in the post), all need to continue to evolve.

Ar what point does my CS degree become totally useless is an open question.

reply

upvote

by hypfer18 hours ago|

[-]

> At what point does my CS degree become totally useless is an open question.

Why are you people saying all these things.

We'll probably see long-distance space travel long before a degree in generic problem identification and solving becomes totally useless.

reply

upvote

by stonogo12 hours ago|

[-]

Every STEM field regards itself as "generic problem identification and solving" though

reply

upvote

by hypfer5 hours ago|

[-]

And they're all correct in that assessment.

reply

upvote

by ahmadyan20 hours ago|

[-]

pretty spot on.

In my experience, Opus 4.0 was fantastic, major jump from 3.7. it was creative, super slow and expensive, and would sometime forget what it was doing, but it was getting the job done.

4.1 they made it much faster, so a lot of infra improvements.

4.5 was the time it could work on longer task, didn't make a lot of obvious mistakes of 4.0, and i think this was about the time the opus went mainstream, and all of the anthropic's compute crisis began, so instead of making the model better they tried to optimize it to reduce cost instead.

4.6 was such a bad model, they switched to adaptive thinking and it had so many bugs. poor api design, benchmaxxed and poor real-world results. i switched back to 4.5.

4.7 they just fixed the bugs they added in 4.6. Better than 4.5.

haven't fully tested 4.8 yet.

reply

upvote

by sumedh4 hours ago|

[-]

> "4.6 was such a bad model,"

It's just amusing reading all these posts with different viewpoints, just in this thread there are multiple people saying 4.6 was so much better than 4.7 and that they switched back to 4.6.

reply

upvote

by teruakohatu19 hours ago|

[-]

I gave 4.6 a miss and only recently switched from 4.5 to 4.7. I found on a particularly different task 4.5 struggled with (getting stuck in loops and trying to convince me the problem had been solved) was quite solvable with 4.7.

reply

upvote

by theptip17 hours ago|

[-]

My read - 4.7 was a tactical lobotomy to improve the average experience at the expense of peak performance; necessary due to compute pressure.

Now that they have Colossus capacity, I guess they can tune up the intelligence again and spend more tokens on reasoning budgets.

4.7 was definitely a lot more flaky for me vs. 4.6 before the reasoning bugs.

reply

upvote

by spaceman_202019 hours ago|

[-]

I think 4.7 was an awful model in actual use. I never got anything out of it and it was frustratingly weird. This feels more like an attempt to course correct and isn't a real bump

reply

upvote

by throwaway6346719 hours ago|

[-]

I think they overtrained on scientific papers or such as it would spout really sophisticated sounding nonsense with a ton of complicated verbs and adjectives. 4.6 was definitely better in that regard. The more I use these tools the more I think they’re not actually that revolutionary. I mean it’s still amazing what they can do but they have very clear limitations it seems.

reply

upvote

by spaceman_20203 hours ago|

[-]

it was also astonishingly lazy. Would just ask me to write test scripts. I asked it to create simple UI buttons for testing some basic functions so I could share it with a client, and it gave me curl commands instead - and then defended it by saying that the UI is wasted work

Frustrating because if I have a tool, I expect a tool to do what I tell it to do. Tools shouldn't have any opinions on how they should be used

reply

upvote

by bigupthewhole5 hours ago|

[-]

Ive been using gpt 5.4 and 5.5 and honestly 5.4 is solving everything at the pace I need it. I'm the biggest bottle neck in terms of reviewing PRs and my own code. So having a model which can solve a complex task in 10 minutes vs 30 minutes doesn't really give me any meaningful improvement.

Also, the biggest factor is having a good planning phase. A good plan is better than even major model improvements.

reply

upvote

by binary001021 hours ago|

[-]

Maybe try making a simple randomize script to swap the three latest models. And see if you can tell which ones are meaningfully different without knowing which ones are flipped on or off?

reply

upvote

by osigurdson21 hours ago|

[-]

I find the quality ebbs and flows even on the same model. My guess it is something to do with GPU availability but only guessing.

reply

upvote

by atq211921 hours ago|

[-]

Unless you're systematically repeating the exact same task, the most parsimonious explanation is that you're seeing natural variation based on different tasks, random sampling of tokens, etc.

reply

upvote

by osigurdson15 hours ago|

[-]

I don't think this explains the phenomenon as is more temporal in nature - not prompt to prompt. I'm sure the AI labs gracefully degrade to simpler models when resources are low - why wouldn't they?

reply

upvote

by irthomasthomas21 hours ago|

[-]

Given that 4.7 was a brand new model, trained from scratch with a unique architecture and tokenization scheme, I don't see the same pattern. It seems arbitrary.

reply

upvote

by dominotw21 hours ago|

[-]

i dont understand the nuances here. what does this mean. 4.8 is trained on same model as previous one then? what does brand new mean.

reply

upvote

by irthomasthomas21 hours ago|

[-]

It means for 4.7 they trained a new base model with different architecture, different pre-training data (later knowledge cutoff), and a new tokenizer. Vs finetuning an existing model, which was the case for 4.6, and probably for 4.8.

reply

upvote

by dominotw18 hours ago|

[-]

do you mean pre training? so 4.8 is just post training of an old pretrained model?

btw where do they tell you how they trained the model.

reply

upvote

by 18 hours ago|

[-]

deleted

reply

upvote

by jimbokun19 hours ago|

[-]

How long would it take to evaluate a new coworker to say “wow she’s really bright?” Relative to your other coworkers?

A few days? A few weeks? Longer?

However a company releases a new AI model and within hours users are confidently proclaiming how much smarter it is than previous versions.

reply

upvote

by byzantinegene11 hours ago|

[-]

alot of investor money is hinging on models performing better every release.

reply

upvote

by gandalfthepink8 hours ago|

[-]

May be my tasks are rudimentary but the results I get with the 4.5 model are just the same as 4.7 or 4.6. it's just at the advanced models consume more tokens and and are actually loss making for my work. The incremental changes that they are making are not really that valuable. In fact I have found that even glm 5.1 is giving me something equivalent to what Opus 4.6 gives. Am I missing something that everyone else is cheering for in these small incremental model releases?

reply

upvote

by andersmurphy7 hours ago|

[-]

I wonder if it's being done to improve revenue nunbers without changing an enterprise contract? Oh what's that your token usage went up because some of your developers switched to a new model? That sounds like a you problem.

I thinks there's a big push to get these companies in a state where they can be dumped on public markets.

reply

upvote

by extr21 hours ago|

[-]

IMO they have all been clean and noticeable upgrades over their predecessors. Opus 4.7 in particular was a solid jump in capabilities.

reply

upvote

by NiloCK21 hours ago|

[-]

I think it's telling how split the opinions are around all of this. A lot of people distinctly disliked 4.7.

Are the dividing lines around personality? Working domains? Opinionated software stuff?

Who knows?

reply

upvote

by TSiege21 hours ago|

[-]

most of my coworkers feel the opposite about 4.7 and that 4.6 was, to them, significantly better to point that several stopped using claude code

reply

upvote

by teruakohatu19 hours ago|

[-]

4.5 -> 4.7 was a solid jump for me having skipped 4.6. It probably does depend on the specific tasks.

reply

upvote

by viking1238 hours ago|

[-]

It didn't change at all, same as 4.6. Good morning to the Anthropic office btw.

reply

upvote

by pseudohadamard41 minutes ago|

[-]

I have seen a noticeable difference between 4.6 Medium (the default, and I skipped 4.7 because of various reported issues) and 4.8 High or whatever the default is now. It's far more likely to say it doesn't know and seems to think about things a lot more, but then it also spends a lot more time reporting on what it's thought about so it takes longer for you to process the output. In particular 4.6 would say "I've spotted something a bit off here" whereas 4.8 will say "if you do this and then this and then this under these conditions then something will go wrong here". So it seems to be closer to the claimed capabilities for Mythos than previous versions.

reply

upvote

by root-parent14 hours ago|

[-]

ChatGPT 5.5 is consistently the much better model and by a large margin.

How do I know? Because when pushing both to generate code or in independent chats to analyze projects, 5.5 will consistently find all the bugs that Claude does not find, and when challenged, Claude does agree those bugs were there. And my findings match those.

When from a blank start asking Claude to analyze project A and Project B,. Clause will consistently say project B is the better structured, more robust, and more defect free and does justify it. And project B was the one created by GPT 5.5....And also the one I judge to be the best one.

And yes, both at deep effort settings and starting from same specs...

reply

upvote

by viking1238 hours ago|

[-]

5.5 is much better than any Anthropic model. I hate both companies with passion but the Anthropic shills here are in overdrive mode. On top of it, it's cheaper.

Greetings to the Anthropic office good sirs btw.

reply

upvote

by nfw213 hours ago|

[-]

I think the issue with legibility comes down to the fact that most users are not using LLMs for tasks where improvements to raw reasoning abilities wouldn't help much or at all. So it's not a matter of anyone's deficiency of perception but rather a lack of any benchmark to perceive.

It's kind of like how the consumer laptop market is now. I was telling my boss today that most employees wouldn't see any noticeable performance difference between a macbook pro and a neo if they are just doing admin stuff on the web.

reply

upvote

by ThunderBee18 hours ago|

[-]

IME the most noticeable performance boosts are in complex multi-agent workflows.

EX. You call an orchestration agent and define an implementation plan with the help of a number of sub agents planning out different features. You and the lead agent review all of the plans and send them off to a set of agents that write tests which get send back to the orchestrator then passed along with the plan to a set of coding agents who implement the features in their own worktrees. That gets passed back to the orchestrator which hands it off to another set of agents doing the code review and merging the features before sending it back to you.

reply

upvote

by 8note18 hours ago|

[-]

i dont think theres anything particularly special about new models for that though. thats a harness improvement

reply

upvote

by adi_kurian12 hours ago|

[-]

1mm context window is pretty big. Even if dumber, opens new avenues. For the record I don't think we ever got better than 4 and 4.1.

reply

upvote

by cootsnuck18 hours ago|

[-]

Well, it seems like collectively we are all struggling to perceive model progress, given that it seems like every reply to you is reporting different experiences with which of the models has subjectively performed best for them.

reply

upvote

by j_m_b12 hours ago|

[-]

We're at the top of the S-curve and you're romanticizing diminishing returns with vague hints of super human capabilities and singularities.

reply

upvote

by willtemperley8 hours ago|

[-]

I'm here to complain about the churn.

I feel like I get to know a model in the human sense of understanding a personality. Yesterday I knew 4.6 extended, today it's different, there's multiple "token budget" levels. I just want 4.6 extended back as it was, I was getting on well with it / them.

reply

upvote

by lionkor8 hours ago|

[-]

Humanizing this technology seems like a step in the wrong direction.

reply

upvote

by willtemperley8 hours ago|

[-]

There's so much intelligence here on HN and so little humanity.

reply

upvote

by onlypassingthru21 hours ago|

[-]

The honesty will be noticeable. Maybe we'll see some honest assessments like "That is not possible within the laws of known physics", "Your legal argument is nonsensical and defies logic", "There is no evidence to support taking that will cure anything", etc., etc.

reply

upvote

by ifwinterco19 hours ago|

[-]

4.7 uses more tokens and costs more for the same task than OG 4.5, that's about it

reply

upvote

by hypfer18 hours ago|

[-]

> (it's smarter than me?)

I genuinely hope that you're joking with that statement.

Or this is a bot.

Or an ARG.

Or Art.

Help.

reply

upvote

by okamiueru18 hours ago|

[-]

If LLMs have tough me anything, is that the average person is far more gullible than what I could have imagined.

reply

upvote

by hypfer17 hours ago|

[-]

That and also.. predictable. Robotic, even. Stimulus => Reaction

Which is a shame, because people would have the potential for greatness. But instead, for a plethora of reasons and factors (internal and external) people end up as fleshy automatons sleepwalking on rails.

Talking _extensively_ with LLMs over the last years made me understand humans a lot better, but, in hindsight, I'm not sure if that was a good thing.

reply

upvote

by mgraczyk12 hours ago|

[-]

dangerous thing to believe IMO The models will get better, you will notice, everyone will notice. They will get better at coding and everything else. You should plan around that.

reply

upvote

by bwhiting23568 hours ago|

[-]

the churn is... a version bump to the same api? If you want to compare you can write some evals.

reply

upvote

by fl0id15 hours ago|

[-]

tbh, the last 2-3 version bumps, main change has been that they take longer, and cost more/have more usage restrictions. (combined with new tooling, which eats a ton of tokens)

reply

upvote

by iLoveOncall18 hours ago|

[-]

I'm pretty sure they're releasing 4.8 because they massively shit the bed with 4.7 and people aren't using it.

I have ONLY heard negative feedback about it, and trying it myself also yielded really awful results.

reply

upvote

by 19 hours ago|

[-]

deleted

reply

upvote

by conartist621 hours ago|

[-]

Just want to say there's no question that you're smarter than any (and every) AI.

reply

upvote

by NiloCK21 hours ago|

[-]

I appreciate the generosity, but you're gonna want to meet me first.

reply

upvote

by conartist620 hours ago|

[-]

Kind of the beauty of it is that I don't have to to know I'm right. The reason I know is that you're alive so you can do the one thing it can't ever do, which is know when to stop or give up. It would turn me and everything else in the world into paperclips repeating the same research 1,000,000 times over.

reply

upvote

by senordevnyc17 hours ago|

[-]

Idk, the models often stop or give up and have to be prodded. And I know plenty of humans who don’t know when to stop or give up, even when it would clearly be best.

reply

upvote

by petesergeant21 hours ago|

[-]

No question at all that a dolphin swims better than a submarine.

reply

upvote

by taurath12 hours ago|

[-]

> I'll never again perceive model progress

If the hype train keeps going for another year, Sam and co will have to resort to direct gaslighting like saying the model is improving but nobody can feel it anymore, oh and I need 10 trillion dollars

reply

upvote

by jere19 hours ago|

[-]

"it's smarter than me?"

You don't have to correct it dozens of times a day!? Really?

reply

upvote

by mrinterweb16 hours ago|

[-]

The more difficult it is for humans to consistently and accurately compare model outputs the more opportunity there is to spread FUD (Fear, Uncertainty, Doubt). Considering valuations of these companies and the astronomical investments being made, a sabotage campaign with bots or paid users on reddit, twitter, YouTube, or whatever socials could go a long way towards knocking market cap off the competition. Not saying that's happening, just saying its an obvious target. Even if the goal is not nefarious, people with a perceived bad experience are 2-3x more likely to complain. So even without bad actors involved, a new model may need to be significantly better in order to break even on the old net promoter score.

reply

upvote

by Grimblewald14 hours ago|

[-]

I maintian a log of tasks, prompts, related information etc. So i can repeat past workflows verbatim, and I can qualitatively say each model beyond 4.5 has been a regression, and it would not surprise me 4.8 continues the trend. Each iteration has failed at more tasks previously completed succesfully. Right now it flat out refuses to answer many benign chemistry questions, or leans into shilling to hard and ignores non industry funded studies on certain topics. I'm transitioning to deepseek as a reuslt. Cheaper by far and at this stage not strictly speaking less capable.

reply

upvote

by christkv6 hours ago|

[-]

I'm going to assume that at some point their "targeted training and tuning" will eventually reach some sort of "max" possible simulation of next good token. At that point I think it will be interesting to see what happens and how many parameters you really need to for different verticals.

reply

upvote

by gigatexal19 hours ago|

[-]

why are the models the same price?

https://platform.claude.com/docs/en/about-claude/pricing

``` Model Base Input Tokens 5m Cache Writes 1h Cache Writes Cache Hits & Refreshes Output Tokens

Claude Opus 4.8 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok

Claude Opus 4.7 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok

Claude Opus 4.6 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok

Claude Opus 4.5 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok

Claude Opus 4.1 $15 / MTok $18.75 / MTok $30 / MTok $1.50 / MTok $75 / MTok

Claude Opus 4 (deprecated) $15 / MTok $18.75 / MTok $30 / MTok $1.50 / MTok $75 / MTok

Claude Sonnet 4.6 $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok

Claude Sonnet 4.5 $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok

Claude Sonnet 4 (deprecated) $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok

Claude Haiku 4.5 $1 / MTok $1.25 / MTok $2 / MTok $0.10 / MTok $5 / MTok

Claude Haiku 3.5 (retired, except on Bedrock and Vertex AI) $0.80 / MTok $1 / MTok $1.60 / MTok $0.08 / MTok $4 / MTok ```

reply

upvote

by teruakohatu18 hours ago|

[-]

Why shouldn’t they be? They are probably the same size and cost the same to run. They are not doing full training runs (eg Mythos) so don’t need to recover insane training costs.

reply

upvote

by cootsnuck18 hours ago|

[-]

I'd be kind of shocked if a model that came out six months ago is the same size and cost to run as one that just came out today.

reply

upvote

by jubilanti13 hours ago|

[-]

Same size? Maybe by a bit. Cost? Absolutely. Newer flagship models are often slightly larger each generation, but not even 2x. But more efficient architectures are coming out all the time, and it'd be a waste to retrain an old model. So it washes out.

reply

upvote

by staticman218 hours ago|

[-]

Opus 4.7 and presumably 4.8 are more expensive due to a new tokenizer that translates data into more tokens per input.

reply

upvote

by nikcub16 hours ago|

[-]

Same price on a token basis, but usually steadily decreasing on a task basis

reply

upvote

by koiueo13 hours ago|

[-]

Didn't you mean increasing?

reply

upvote

by taytus21 hours ago|

[-]

Incremental gains compounds.

reply

upvote

by itake21 hours ago|

[-]

meta threw in the towel when it came to producing AI models since their gains couldn't keep up with China.

reply

upvote

by TurdF3rguson18 hours ago|

[-]

muse-spark is beating all the Chinese text models on lmarena leaderboard FYI. Maybe you only care about coding models.

reply

upvote

by HDThoreaun21 hours ago|

[-]

Has meta stopped producing new models? I figured they were just regrouping after all the drama they’ve had recently. Meta’s massive user base means they don’t need to be involved in the customer acquisition rat race. Once they have a model they’re happy with they can have a billion people interacting with it within a month.

reply

upvote

by staticman218 hours ago|

[-]

Meta released a major new closed source model a month or so ago.

It didn't make a splash like a new open source release would have.

reply

upvote

by paulddraper21 hours ago|

[-]

Exactly. Go back to Opus 4.5 and see how you like it.

You won't, really.

reply

upvote

by vasco14 hours ago|

[-]

I can tell from hearing Feynman recordings that he was smarter than my own university's physics professor, but both were smarter than me.

reply

upvote

by overgard16 hours ago|

[-]

It's almost like they used up most of the benefits of scaling and the fundamental issues that people have been talking about with LLMs for years are real.

reply

upvote

by avador17 hours ago|

[-]

The inability to tell if a model is improving is, I think, a tell that the model has improved up to your level of programmatic (analytic, computational) capacity.

A lot of the information (blogs, tweelches, plosts) that I consume seems to be converging on the idea that we all depend on the models. However. It seems to me that the exact opposite is true. The models depend on us, and _desperately_ so.

There must have been stories, books, movies, made about this intellectual (and propositional, legal, factual) inversion.

The majority need the minority. Has always been the case, I now think. But what has newly developed is that the majority can take a dependency not on the minority, but on a select few companies who are abstracting and compressing the minority into latent spaces.

reply

upvote

by adi_kurian12 hours ago|

[-]

Or the model could just be shite.

reply

upvote

by 8note18 hours ago|

[-]

honestly sonnet 3.7 is still good enough for me, as long as whatever tool prompts and so on are well optimized enough between harness and model.

i still havent really noticed it per set being better

reply

upvote

by ElkeQin11 hours ago|

[-]

[flagged]

reply

upvote

by 19 hours ago|

[-]

deleted

reply

upvote

by rotcev19 hours ago|

[-]

[flagged]

reply

upvote

by ckarani8 hours ago|

[-]

[dead]

reply

upvote

by Imustaskforhelp20 hours ago|

[-]

Although I am not sure about it but there was something I read which said that models intentionally degrade slowly by lower quantizations as a new model is going to drop.

This felt particularly visible during the 4.6 when people said that 4.6 felt dumber and I remember someone doing some analysis and it sort of proved that models were getting dumber over time.

This has both benefits of costing less for the company to run while taking a standard subscription but also, at the same time, making the next model when it drops to public to "feel" more good comparatively.

Again, I am not sure if this is the case or not but merely proposing something that I feel like it might be in the possibility of realm.

reply