> Code interleaves fork positions, where several continuations are genuinely plausible and may correspond to different solution approaches, with lock positions, where syntax and semantics leave little ambiguity but a low-probability distractor tail still remains… The best global decoding setting is therefore necessarily a compromise; we call this tension the precision-exploration conflict.
In other words, just like us, the model needs to shift from "exploration" in "fork" mode (divergent thinking to produce a creative solution) to "precision" in "lock" mode (producing syntactically correct code).
What this paper shows is that their simple technique (SSD) can improve the ranking of optimal tokens in both lock and fork positions, meaning the model is more likely to explore when it should be exploring, and more likely to be precise when it needs to be.
I love that we're still learning the emergent properties of LLMs!
TBH, this is (very much my opinion btw) the least surprising thing. LLMs (and especially their emergent properties) are still black boxes. Humans have been studying the human brain for millenia, and we are barely better at predicting how humans work (or for eg to what extent free will is a thing). Hell, emergent properties of traffic was not understood or properly given attention to, even when a researcher, as a driver, knows what a driver does. Right now, on the front page, is this post:
> 14. Claude Code Found a Linux Vulnerability Hidden for 23 Years (mtlynch.io)
So it's pretty cool we're learning new things about LLMs, sure, but it's barely surprising that we're still learning it.
(Sorry, mini grumpy man rant over. I just wish we knew more of the world but I know that's not realistic.)
I dare say that in some ways, we understand LLMs better than humans, or at least the interpretability tools are now superior. Awkward place to be, but an interesting one.
Are you surprised we understand them better than brains?
That's a bit of an overstatement.
The entire field of ML is aimed at problems where deterministic code would work just fine, but the amount of cases it would need to cover is too large to be practical (note, this has nothing to do with the impossibility of its design) AND there's a sufficient corpus of data that allows plausible enough models to be trained. So we accept the occasionally questionable precision of ML models over the huge time and money costs of engineering these kinds of systems the traditional way. LLMs are no different.
What you are saying is fantasy nonsense.
> but the amount of cases it would need to cover is too large to be practical (note, this has nothing to do with the impossibility of its design)
So it doesn't work.
The GP said, "I'm a psychiatry resident".
The entire industry is propped up by misinformed people burping up the CEO farts they are sucking.
You would be sorely mistaken to think I'm utterly uninformed about LLM-research, even if I would never dare to claim to be a domain expert.
Very, monsieur Laplace.
We have tons of low-hanging fruits across all fields of science and engineering to be picked, in form of different ways to apply and chain the models we have, different ways to interact with them, etc. - enough to fuel a good decade of continued progress in everything.
Much as Diogenes mocked Platos definition of a man with a plucked chicken, LLM's revealed what "real" ai would require: contigous learning. That isnt to diminish the power of LLM's (the are useful) but that limitation is a fairly hard one to over come if true AGI is your goal.
From what I understand, a living neural network learns several orders of magnitude more efficiently than an artificial one.
I'm not sure where that difference comes from. But my brain probably isn't doing back propagation, it's probably doing something very different.
(eg different kinds of learning for long-term memory, short-term memory, languages, faces and reflexes.)
The intersection of what with physics?
Sir Roger Penrose, on quantum consciousness (and there is some regret on his part here) -- OR -- Jacob Barandes for a much more current thinking on this sort of intersectional exploratory thinking.
> The earliest reference to the brain occurs in the Edwin Smith Surgical Papyrus, written in the 17th century BC.
I was actually thinking of ancient greeks when writing my comment, but I suppose Egyptians have even older records than them.
I think that with grammar-aware sampling / constrained decoding [0][1] it is possible to sometimes skip calling the model altogether if only one token is allowed by grammar and just insert it, but I don't think that any of the current, widely used combinations of models/harnesses use it. And it only skips inference in rare edge cases.
I wonder if there is a more general solution that can make models spend more compute on making important choices, while making generation of the "obvious" tokens cheaper and faster.
[0] https://github.com/ggml-org/llama.cpp/blob/master/grammars/R...
[1] https://developers.redhat.com/articles/2025/06/03/structured...
Making coding agents spit out syntactically correct code token by token is like asking a human to code on a whiteboard.
We kinda have a little bit of it with some coding harnesses giving model access to LSP, but I think that we can insert this knowledge on a lower level if we find a clever way to somehow utilize it during sampling.
I think that there is a lot of low hanging fruit in this area.
And in general, I think that people try to use LLMs too much to solve problems that can be easily solved by cheaper (computationally), and, more importantly deterministic tools.
For example, back in the day when LLM-assisted coding just became a thing people very often complained about models generating syntactically incorrect code and inventing non-existent library methods.
Well, I, an experienced human programmer, probably would also be making syntax mistakes and inventing non-existent methods if you stripped me of my tools and made me write code in a bare text editor without syntax highlighting.
Thankfully, my IDE would autocomplete real syntax and actually existing library methods for me and immediately give me feedback if I make a mistake anyway. And all of it is achieved using reliable deterministic code without the inherent issues of statistical models.
I think that it is really inefficient to reach for an expensive and unreliable tool when a cheap and reliable tool will do.
1. code
2. syntax check / build / format / lint (details language dependent)
3. test
and they can hop between 1 and 2 however many times they want.
I do think there is some merit in a tool that dumps all namespaces and reachable symbols so the agent can do its own autocomplete without a round-trip.
As a human coder you don’t summon intellisense. It’s just popped up into your visual field as extra input - contextual cues.
You could force intellisense state into the context vector the LLM receives.
i once asked an LLM if it could ingest code from an interactive session more easily if it were in appropriately-typed markdown fences and it said absolutely yes, and that the syntax highlighting fed to it that way helps it immensely. i was downright shocked that syntax highlighting was anything more than noise for them.
I think speculative decoding count as a (perhaps crude) way implementing this?
There's a lot of work going on in various streams towards making it possible to vary compute per-token, dynamically, e.g. universal transformers. Maybe one day it'll work well enough to beat conventional techniques.
I got unstuck by randomizing the field order for each row?!? At training, and now I'm thinking I should do the same at inference time...
> This is probably due to the way larger numbers are tokenised, as big numbers can be split up into arbitrary forms. Take the integer 123456789. A BPE tokenizer (e.g., GPT-style) might split it like: ‘123’ ‘456’ ‘789’ or: ‘12’ ‘345’ ‘67’ ‘89’
One of the craziest LLM hacks that doesn't get love is https://polymathic-ai.org/blog/xval/
xVal basically says "tokenizing numbers is hard: what if instead of outputting tokens that combine to represent numbers, we just output the numbers themselves, right there in the output embedding?"
It works! Imagine you're discussing math with someone. Instead of saying "x is twenty five, which is large" in words, you'd say "x is", then switch to making a whistling noise in which the pitch of your whistle, in its position within your output frequency range, communicated the concept of 25.00 +/- epsilon. Then you'd resume speech and say "which is large".
I think the sentiment is that today's models are big and well-trained enough that receiving and delivering quantities as tokens representing numbers doesn't hurt capabilities much, but I'm still fascinated by xVal's much more elegant approach.
Completely artistic creation, creating something that does not exist and that cannot produce things out of itself, means that locking can be more diffuse, not as settled.
I think you are implying a reverse causation. They used a metaphor from us.
In nemotron the high perplexity solutions are selected for RL, in VLM training a few people are looking at the entropy distributions of the training set, etc
>I love that we're still learning the emergent properties of LLMs!
There are tons of low-hanging fruits there.
"Simple Self-Distillation". We had an acronym for Solid-State Drive. Don't know about that technique but the naming sure sound.. Simple?
Self-distillation has been shown recently as very efficient and effective back in January this year by MIT and ETH team in their Self-Distillation Fine-Tuning (SDFT) LLM system [1],[2].
This paper is also their closest competitor named On-Policy Self-Distillation in the comparison table.
I hope they keep the original work real name that is Self-Distillation Fine-Tuning or SDFT. Imagine later paper citing this very paper as cross-entropy self-distillation instead of their very own given name Simple Self-Distillation or SSD. Although I'd have admitted it's a lousy name that breaks the namespace with common SSD nomenclature for solid-dtate drive, as others have rightly pointed.
I think they should given the proper credit to this earlier seminal earlier on SDFT but apparently they just put it as one as of the systems in their benchmark but not explaining much of the connection and lineage which is a big thing in research publication.
[1] Self-Distillation Enables Continual Learning:
https://arxiv.org/abs/2601.19897
[2] Self-Distillation Enables Continual Learning:
That already looks like Sonnet 3x and 4 level capabilities to me where the model in question (Gemma 4) set ups whole python project with a UI and installs python libraries using uv etc.
Add this Simple Self Distillation to the picture and by 2028 I see cheaper coding model providers with much more generous usage limits in the future and power users would be mostly running their own models anyway.
Anyone using these models as "non-deterministic transpilers" from natural language to code (experienced engineers who can write code themselves) would probably not be paying to any AI providers.
Right now it feels like hammering a house onto a nail instead of the other way around.
LLMs have something that's not entirely unlike the "g factor" in humans - a broad "capability base" that spans domains. The best of the best "coding LLMs" need both good "in-domain training" for coding specifically and a high "capability base". And a lot of where that "base" comes from is: model size and the scale of data and compute used in pre-training.
Reducing the model scale and pruning the training data would result in a model with a lower "base". It would also hurt in-domain performance - because capabilities generalize and transfer, and pruning C code from the training data would "unteach" the model things that also apply to code in PHP.
Thus, the pursuit of "narrow specialist LLMs" is misguided, as a rule.
Unless you have a well defined set bar that, once cleared, makes the task solved, and there is no risk of scope adjustment, no benefit from any future capability improvements above that bar, and enough load to justify the engineering costs of training a purpose-specific model? A "strong generalist" LLM is typically a better bet than a "narrow specialist".
In practice, this is an incredibly rare set of conditions to be met.
There are hardware-based limitations in the size of LLMs you can feasibly train and serve, which imposes a limit in the amount of information you can pack into a single model's weights, and the amount of compute per second you can get out of that model at inference-time.
My company has been working on this specifically because even now most researchers don't seem to really understand that this is just as much an economics and knowledge problem (cf Hayek) as it is "intelligence"
It is much more efficient to strategically delegate specialized tasks, or ones that require a lot of tokens but not a lot of intelligence, to models that can be served more cheap. This is one of the things that Claude Code does very well. It's also the basis for MOE and some similar architectures with a smarter router model serving as a common base between the experts.
...with a fair amount of supervision, while frontier models would be running circles around them using project-specific memory and on-demand training (or whatever we would have by then).
If you're building something groundbreaking and new, the advantage will be slim to none.
https://ai.meta.com/research/publications/adaptive-decoding-...
(Not fine tuning, but interesting none the less. If a model can so easily find a more elegant solution, why didn't it pick that in the first place?)
We really need to develop better tools to understand what's happening inside these NNs. Working with high-D spaces is not something we're good at, and we're basically throwing stuff at it and seeing if it sticks.
I suppose we just don't have a deeper underlying theory to lean on and help us 'design' anything.
I often find, if I've got a complicated solution, it’s because I haven’t fully examined the problem.
So no, they are not fine-tuning a general purpose model to produce "valid benchmark code results."
So you prompt the base model for answer and then rerun the prompt with the answer from the first run?
They use self-distillation to shift the output distribution of the model towards that of the same model, but running with different temperature/truncation settings in sampling.
This effectively "folds" the logit tail truncation behavior into the model itself.
Not entirely unlike a few "model controlled sampling settings" things I've seen in what it does, but different in execution.
You use the outputs from the first run (right or wrong) as answers for the second training run, and repeat. Magically it works. That's what's so surprising.
I guess a theory is because there are so many diverse ways to be wrong that they don't accumulate error... still seems surprising and would be interesting to see if it works in other domains.
Some of the claims about models training on their own data, in their enthusiasm to frame it as a failure, went further to suggest that it magnified biases. I had my doubts about their conclusions. If it were true, it would be a much greater breakthrough because the ability to magnify a property represents a way to measure a weak version that property. The ability to do that would mean they would have found a way to provide a training signal to avoid bias. It would be great if that's what they did but I suspect there would have been more news about it.
Perhaps this paper will put to rest the notion that AI output is useless as training data. It has only ever been the case that it was useless as an indiscriminate source of data.
It's the first thing anyone would think of (like a self-hosted compiler) but everything I've read said "it doesn't work."
EDIT: For context:
> Shumailov et al. (2024) — "AI models collapse when trained on recursively generated data" (Nature, 2024)Their hypothesis as to why this works requires a bit more knowledge about model architecture, but basically when a model generates code some positions have only one right answer and some have many valid options - but the model has to use one global confidence setting for both. Sampling with a specific temperature + a garbage-token filter, then training on those outputs, teaches the model to internalize 'be precise where there's one answer, stay open-minded where there are several' — without anyone labeling which is which.
Note that there's a lot more nuance to this and I simplified a lot.
You teach the machine by asking it to solve some problems, and then whatever answer it gives you say "That's exactly right. Now we train on those answers YOU just gave me" (even if they are wrong) and repeat. Somehow THAT works over time.
you can generate and train answers by exploring on varying the length of the code generated
I know virtually nothing about this area but my naive take is that something that means it still only passes tests around half the time doesn't seem like a particularly big jump forwards.
What am I missing?
But no-one quotes those any more because if everyone passes them, they don't serve any useful purpose in discriminating between different models or identifying advancements
So people switch to new benchmarks which either have more difficult tasks or some other artificial constraints that make them in some way harder to pass, until the scores are low enough that they're actually discriminating between models. and a 50% score is in some sense ideal for that - there's lots of room for variance around 50%.
(whether the thing they're measuring is something that well correlates to real coding performance is another question)
So you can't infer anything in isolation from a given benchmark score being only 50% other than that benchmarks are calibrated to make such scores the likely outcome
This feels eerily similar to sleep consolidation or synaptic pruning
I think the analogy is actually pretty specific to this paper, not just self-distillation in general.
During sleep your brain replays experiences but noisy and distorted. The replays are often incoherent as narratives (dreams are weird). But the consolidation still works because the value isn't in the narrative coherence, it's in the activation patterns at each moment. Important pathways get strengthened, weak ones get pruned. Section 4.4 of this paper is what makes the connection click. They cranked training temperature to 2.0 with no truncation. 62% of the sampled outputs had no extractable code. Coherent Python that devolves into multilingual gibberish halfway through. The model still improved (+5.7pp pass@1).
This makes no sense if you think the model is learning from good code examples. But it makes a lot of sense if you think of it as the model replaying its own knowledge back to itself in a noisy/distorted form, and the replay process strengthening what matters (sharp distributions at "lock" positions where one token is correct, broad distributions at "fork" positions where multiple approaches work) while pruning what doesn't (distractor tails). The model doesn't learn anything new. It just wakes up performing better because what it already knew got cleaned up.
How is this comment not at number 1??
Self-distillation shifts the behavior of the model towards that of the model + steering. As such, you don't strictly "need" the tokens to be in-domain for it to work. The logits are a vessel for transferring the steering into the model's internals.
The tokens can be gibberish. What transfers isn't whether they're gibberish or not, but how the flavor of model predictions, if given gibberish, differs from that of an unsteered version of itself.
In this specific case, the behavioral difference comes from the "temperature-shifted, truncated samples" in the "teacher" sampling strategy, and it is that difference that is internalized by the "student" model.
I.e sleep replays don’t need to replay Tuesday’s meeting accurately. They just need to activate the relevant pathways so that the strong ones fire and the weak ones don’t. The pattern of what fires versus what doesn’t is the signal. The “content” of the dream is basically irrelevant.
My local running Mistral 7b is a 100x better at modern JavaScript than any model on the market, mainly just from RAG on my own code samples.
That's basically what they are describing with "post-training", the TLDR is that code especially of a certain style is vastly simpler than written language.
You really don't need a huge model or data centers etc. you just need a small but good model like Mistral 7b and literally a few good samples.
But you guys keep doing you lol. A bunch of non-devs trying to solve code is pretty funny to watch.
Mutation rate modulation is the AI engineers’ heat. And selection does the trimming of the outliers.
Some more serious biomorphic thinking and we may get to the next big insight courtesy of 3+ billion years of evolution—- evolution that enabled a great ape species to write a paper like this and build LMM’s like Gemma4 that totally rock on a 3.5 pound MacBookPro M5 Max with 128 GB of RAM.
If you sample from the base model with T=1.6, top_k=20, top_p=0.8, i.e, the decode settings used for the distillation's ground truth, does it match the SSD'd model + some decoding? Performance wise.
Their sweep is missing this. And only covers "standard" decoding settings.
But to filter based on author's names sounds pretty darn racist.
They seemed like they had to be churning out papers and any little adaptation to existing research triggered a new publication.
But it may have changed now.
"Made in China, designed by Apple in California"
should be:
"Made in China, designed by Chinese people in California"?
Sorry apple, SSD is already taken, you can't use that acronym.
Consistency Preservation Update (CPU)
Guided Probability Update (GPU)
History-aware Distillation Driving (HDD)
Probability Smoothing Update (PSU)
Title should be: Simple Self-Distillation Improves Code Generation
Many computer science paper titles allude to past titles in other CS papers.
Calling it “cringe worthy” is unnecessarily mean. There is context and history you don’t understand.
There are two distinct billions. https://en.wikipedia.org/wiki/Billion