undefined

upvote

points

by supern0va21 hours ago |

upvote

by ACCount3716 hours ago|

[-]

Scale is always desirable, and there are always gains from scale. It's a matter of whether you can afford training and inference at increased scale.

There is a real trend of smaller models becoming more "capability-dense" - i.e. the best 8Bs of today beat the best 32Bs of 2 years ago. This is in part a product of distillation being used to train the smaller models.

But people consistently underestimate how "capability hungry" the world is. There are diminishing returns on model capabilities in narrow "summarize the search results" sorts of applications - but as capabilities improve, LLMs enter, get their footing in and begin to dominate new niches. At times, expensive, highly desirable niches.

I do not expect anyone at the frontier to pop up and say "no reason to train a new model" within the following decade. There will always be a demand for an LLM that's 5-10% more capable and more reliable at some highly advanced task, and generational upgrades will keep delivering those 5-10%. From increased scale and improved training both.

reply

upvote

by IgorPartola12 hours ago|

[-]

I think this is exactly right. Basically when I am coding, having an agent that roughly matches my intelligence is a feature, not a bug. Having one that is 10x as smart would actively slow me down because I would have to spend the time understanding what it is doing or hand over all architecture to it and just vibe code everything, hoping that it doesn’t do the PhD version of fizzbuzz instead of the maintainable one.

But for some classes of problems I think a model that is 10-100x smarter than the smartest expert is a huge boon. These would be problems that are very hard to solve but easy to verify that the solution is correct. Protein folding, sudoku, etc. Because of this I see the really smart models going to biomedical and pharma first and maybe a few high profit verticals rather than being widely deployed. I am sure Pfizer would be happy to pay for a 100x smarter than the smartest researcher model. But I am not certain that this kind of market fit would justify trillion dollar valuations in the long run. And in the meantime normal “human companion” models will go from Sonnet to some open weight model running on a Dell tower in your closet to maybe even on your phone in the next few years.

reply

upvote

by willsmith7210 hours ago|

[-]

aren't you conflating being 10x as smart with code that is 10x more complicated?

the relationship should be the opposite, the smartest people can write the most readable solutions

reply

upvote

by IgorPartola4 hours ago|

[-]

Maybe. I can’t imagine what kind of solutions a software engineer who is 10x smarter than any human who has ever lived would be like by definition. All I know is that there is a possibility it says that the most optimal way to solve a problem is too clever for me to understand and as long as I must verify its work I must be able to understand fully the code it writes.

Of course perhaps at that point I really do become more of a spec and prompt engineer and don’t actually look at the code any more than I look at the assembly code produced from my programs now. But still my gut says using hyperintelligence to do common tasks is all positive.

reply

upvote

by ACCount373 hours ago|

[-]

If you have an AI that's 10x smarter than any human who has ever lived, why would you be the one calling the shots? Kind of an issue with ASI.

reply

upvote

by IgorPartola2 hours ago|

[-]

Because my priorities and priorities of a non-human entity that is an order of magnitude master than anyone who has ever lived might not line up.

reply

upvote

by Zavora6 hours ago|

[-]

4.8 is demonstrating simplicity, hence its smarter?? It just refactored my 4.6 generated code (4.8 is very slow on difficult tasks - urgh! - without burning tokens - yey!) but the output was wow! Simple, elegant and exactly what i wanted to see.

reply

upvote

by bandrami8 hours ago|

[-]

> there are always gains from scale

This... isn't true though? Complexity increases combinatorially with scale which means at some point you're just pushing a rope

reply

upvote

by KptMarchewa4 hours ago|

[-]

Diminishing returns are still returns.

reply

upvote

by rao-v19 hours ago|

[-]

It’s really worth distinguishing between old-fashioned student teacher distillation (ie at the level of layers, weights and distributions) and large scale synthetic dataset creation.

The latter is much better (since you can clean up, review, update responses and filter your datasets).

I suspect nobody is doing real student teacher distillation, it’s just easier to do a bunch of training on the same giant corpus then post train on the synthetic corpus with its reasoning traces etc. (which might have been generated by a bigger better LLM)

reply

upvote

by teleforce13 hours ago|

[-]

Please check the recent self-distillation work by MIT-ETH, UCLA and Apple [1],[2],[3],[4],[5].

Given the release timelines I suspect all 4.x after Opus 4 are probably self-distillation based fine-tuned models. The latest paper by Apple is focusing on code generation using the simple technique hence the name simple self-distillation (SSD) [4],[5].

I've got a strong feeling that self-distillation is the second best thing happened to LLM after transformer breakthrough.

[1]Self-Distillation Enables Continual Learning [pdf] (25 comments):

https://news.ycombinator.com/item?id=48165265

[2] Self-Distillation Enables Continual Learning:

https://arxiv.org/abs/2601.19897

[3] Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models:

https://arxiv.org/abs/2601.18734

[4] Embarrassingly simple self-distillation improves code generation (201 comments):

https://news.ycombinator.com/item?id=47637757

[5] Embarrassingly Simple Self-Distillation Improves Code Generation:

https://arxiv.org/abs/2604.01193

reply

upvote

by rao-v8 hours ago|

[-]

So first - these are terrific papers and I'd not seen some of them before.

Having said that, I don't think these are classic student teacher distillation from random (which was my point). In fact, the "Embarrassingly Simple Self-Distillation" paper is using exactly what I was talking about "fine-tune on those samples with standard supervised fine-tuning".

reply

upvote

by ACCount3716 hours ago|

[-]

A reason to do student-teacher distillation is that soft target logits in general are a richer medium than text that tokenizes to hard targets. More steering signal per teacher token. And running ultra large 10T tier models in autoregressive generation mode can get expensive. So there are reasons not to reduce to text only synthetics.

reply

upvote

by txhwind12 hours ago|

[-]

Could you share some latest articles or papers comparing both methods, especially on lanuage modelling case? I was not conviced by this claim when reading the original Knowledge Distillation paper. ChatGPT said there are some later works showing: 1. the gain may come from label smoothing; 2. soft logits are more meaningful for students much smaller than teacher.

reply

upvote

by rao-v15 hours ago|

[-]

I agree, and if my suspicion is right, it’s rarer because it’s much easier to deploy the large LLM and filter for it’s best output than to waste time running it on arbitary output just to train the student.

Though you could argue that perhaps labs just save the per token distribution and use that during fine tuning … which starts looking more like student teacher fine tuning if not classic distillation from random weights

reply

upvote

by ACCount3715 hours ago|

[-]

Full distributions are a fucking pain to save - at this point just save the hidden states. But there are lossy compression tricks there.

reply

upvote

by rao-v8 hours ago|

[-]

To the previous poster's point, soft distributions are useful, even saving the top 10 logits is significantly more training signal than just the final token.

reply

upvote

by txhwind12 hours ago|

[-]

I prefer synthetic dataset since the first day hearing distillation. The engineering friction is much lower than soft logits, and I have not observed or heard performance loss (in Speech and language area).

reply

upvote

by DoctorOetker12 hours ago|

[-]

One may view pre-training as distillation.

The teacher distillation is a corpus of text, and the "next token after the context" would be looking-up the context in the corpus, and for each occurrence the label is what followed in the corpus, scaled down by the number of occurrences of the context. The teacher is moot on contexts outside of the corpus though, unlike the usual teacher model in distillation.

reply

upvote

by girvo16 hours ago|

[-]

> I suspect nobody is doing real student teacher distillation

It gets used for quantisation, basically recovering accuracy for lower quants (Nvidia calls it QAD). Can’t speak to how widespread it is though

reply

upvote

by rao-v15 hours ago|

[-]

Yes absolutely! I should have been more specific - I don’t believe people are using it to train 30B models from 300B models (and I’d love to learn that I’m off about this)

reply

upvote

by thisisaman40816 hours ago|

[-]

[dead]

reply

upvote

by spwa421 hours ago|

[-]

> I don't disagree, but how much of this ends up being distillation?

A lot, so you can bet tens of millions are flowing to congress to have distillation declared illegal before this happens. And then it'll happen anyway.

reply

upvote

by lambda21 hours ago|

[-]

Distillation isn't only between different labs.

A lab can train a large model, and then distill a smaller model from it that retains the majority of the useful capbility.

I don't know well enough if there's any benefit of that over just training the smaller model directly, but I'll bet there are some times where that is useful. I could easily see it being easier to do the initial pre-training on a larger model but be able to distill everything useful down into a smaller model, essentially filtering out a lot of noise in the process.

reply

upvote

by bandrami8 hours ago|

[-]

I think the idea is you sink the pretraining costs once and then you can distill multiple specialized models from that

reply

upvote

by spwa420 hours ago|

[-]

There used to be training methods like that but I think they've been phased out in favor of letting small models evolve by rewriting their own training material. Surprisingly that's actually cheaper.

reply

upvote

by onlyrealcuzzo21 hours ago|

[-]

> I don't disagree, but how much of this ends up being distillation?

You don't need distillation. They already have the training sets.

It's MLA + MoE + Medusa (a better version of Speculative Decoding) + 1.58b (possibly - maybe nothing) + GRAM (which will almost certainly not turn out to be a nothing burger, but no one has quickly turned this around yet to prove it).

reply

upvote

by Philpax21 hours ago|

[-]

It wouldn't be data distillation: instead, it would be teacher-student distillation. The teacher model has stronger representations that the student can mimic, which would give it more capability over training on the data itself.

reply

upvote

by semiquaver20 hours ago|

[-]

The frontier labs distill their own base models all day long. It’s not just something done by nefarious Chinese copycats. The knowledge embodied by the internal base models that we never see is much more powerful and useful than the much sparser raw training data

reply

upvote

by coldtea19 hours ago|

[-]

>It’s not just something done by nefarious Chinese copycats

And even that would be rich as a accusation from SOTAs that depend on explicitly disregarding millions of training data intellectual property..

reply

upvote

by flossly15 hours ago|

[-]

> nefarious Chinese copycats

LLMs are themselves copy cats.

I say thanks for open sourcing and thereby promoting affordable innovation, instead of "nefarious". :)

reply

upvote

by manmal19 hours ago|

[-]

But how? The training data is the unadulterated content those models are based on? I genuinely don’t understand, no snark.

reply

upvote

by wtallis15 hours ago|

[-]

Raw training data is raw. A really big model trained on it has already done a first-pass of finding patterns and squeezing out redundancy. Re-ingesting the full training set to train a smaller model is probably more expensive, for marginal quality improvement over distilling from the large model.

reply

upvote

by adgjlsfhk114 hours ago|

[-]

Distilling from a larger model is not only probably cheaper than from data, it's also likely higher quality. There's pretty strong support for the proposition that NNs learn a smoothed and regularized version of the data. The NNs are likely higher quality than most of the data they are training from.

reply

upvote

by supern0va20 hours ago|

[-]

I think you replied to the wrong parent.

reply

upvote

by minimaltom21 hours ago|

[-]

Frontier labs have their own variants of MLA and certainly their own balance/scaling-laws for things like MoE vs FC vs Attn. MoE scales really well for inference with horizontal scaling + batching, which these guys luv.

On the architectures side, I'm a lot more interesting in attention residuals than anything else, one of those things that seems obvious in hindsight and Kimi have proven it at scale.

reply

upvote

by onlyrealcuzzo21 hours ago|

[-]

> Frontier labs have their own variants of MLA

Yes, variants typically 2-3x less good...

Same with speculative decoding... They all do something, but there are known techniques that are substantially better - that just were't known when they started development of the previous models.

reply

upvote

by amluto19 hours ago|

[-]

How useful is speculative decoding in a batched setting where you get paid for throughput (aggregated across users) and you mostly don’t get paid for latency or single-session throughput?

reply

upvote

by onlyrealcuzzo19 hours ago|

[-]

It's useful at the local level, where there will be SOTA models developed...

reply

upvote

by zozbot23418 hours ago|

[-]

Local models are moving towards batched inference too, if only for non-interactive use. An early experimental patchset for DS4 (running DeepSeek V4 Flash) seems to show 2x aggregate tok/s decode when processing 8 streams concurrently, and more than 3x when processing as many as 32 streams concurrently. Note that prefill (which is not helped significantly by this change) then becomes a larger fraction of total wall-clock time, so the overall gain is lower (i.e. prefill is akin to a 'serial' task wrt. Amdahl's law).

MTP will still be highly valuable for interactive use of course.

reply