undefined

points

[-]

Please check the recent self-distillation work by MIT-ETH, UCLA and Apple [1],[2],[3],[4],[5].

Given the release timelines I suspect all 4.x after Opus 4 are probably self-distillation based fine-tuned models. The latest paper by Apple is focusing on code generation using the simple technique hence the name simple self-distillation (SSD) [4],[5].

I've got a strong feeling that self-distillation is the second best thing happened to LLM after transformer breakthrough.

[1]Self-Distillation Enables Continual Learning [pdf] (25 comments):

https://news.ycombinator.com/item?id=48165265

[2] Self-Distillation Enables Continual Learning:

https://arxiv.org/abs/2601.19897

[3] Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models:

https://arxiv.org/abs/2601.18734

[4] Embarrassingly simple self-distillation improves code generation (201 comments):

https://news.ycombinator.com/item?id=47637757

[5] Embarrassingly Simple Self-Distillation Improves Code Generation:

https://arxiv.org/abs/2604.01193

by rao-v9 hours ago|

parent|

[-]

So first - these are terrific papers and I'd not seen some of them before.

Having said that, I don't think these are classic student teacher distillation from random (which was my point). In fact, the "Embarrassingly Simple Self-Distillation" paper is using exactly what I was talking about "fine-tune on those samples with standard supervised fine-tuning".

by ACCount3716 hours ago|

prev|

[-]

A reason to do student-teacher distillation is that soft target logits in general are a richer medium than text that tokenizes to hard targets. More steering signal per teacher token. And running ultra large 10T tier models in autoregressive generation mode can get expensive. So there are reasons not to reduce to text only synthetics.

by txhwind12 hours ago|

parent|

[-]

Could you share some latest articles or papers comparing both methods, especially on lanuage modelling case? I was not conviced by this claim when reading the original Knowledge Distillation paper. ChatGPT said there are some later works showing: 1. the gain may come from label smoothing; 2. soft logits are more meaningful for students much smaller than teacher.

by rao-v15 hours ago|

parent|

prev|

[-]

I agree, and if my suspicion is right, it’s rarer because it’s much easier to deploy the large LLM and filter for it’s best output than to waste time running it on arbitary output just to train the student.

Though you could argue that perhaps labs just save the per token distribution and use that during fine tuning … which starts looking more like student teacher fine tuning if not classic distillation from random weights

by ACCount3715 hours ago|

parent|

[-]

Full distributions are a fucking pain to save - at this point just save the hidden states. But there are lossy compression tricks there.

by rao-v9 hours ago|

parent|

[-]

To the previous poster's point, soft distributions are useful, even saving the top 10 logits is significantly more training signal than just the final token.

by txhwind12 hours ago|

prev|

[-]

I prefer synthetic dataset since the first day hearing distillation. The engineering friction is much lower than soft logits, and I have not observed or heard performance loss (in Speech and language area).

by DoctorOetker12 hours ago|

prev|

[-]

One may view pre-training as distillation.

The teacher distillation is a corpus of text, and the "next token after the context" would be looking-up the context in the corpus, and for each occurrence the label is what followed in the corpus, scaled down by the number of occurrences of the context. The teacher is moot on contexts outside of the corpus though, unlike the usual teacher model in distillation.

by girvo16 hours ago|

prev|

[-]

> I suspect nobody is doing real student teacher distillation

It gets used for quantisation, basically recovering accuracy for lower quants (Nvidia calls it QAD). Can’t speak to how widespread it is though

by rao-v16 hours ago|

parent|

[-]

Yes absolutely! I should have been more specific - I don’t believe people are using it to train 30B models from 300B models (and I’d love to learn that I’m off about this)

by thisisaman40816 hours ago|

prev|

[-]

[dead]