upvote
This is a property of self-distillation.

Self-distillation shifts the behavior of the model towards that of the model + steering. As such, you don't strictly "need" the tokens to be in-domain for it to work. The logits are a vessel for transferring the steering into the model's internals.

The tokens can be gibberish. What transfers isn't whether they're gibberish or not, but how the flavor of model predictions, if given gibberish, differs from that of an unsteered version of itself.

In this specific case, the behavioral difference comes from the "temperature-shifted, truncated samples" in the "teacher" sampling strategy, and it is that difference that is internalized by the "student" model.

reply
I think we’re agreeing. The point of the sleep parallel is exactly that the content doesn’t matter, and it’s the filtering process that does the work. Brains replay noisy, sometimes incoherent patterns during sleep and the value is in how that replay reshapes connection weights, not in whether the replay is accurate. That’s the same principle you’re describing with the steering signal

I.e sleep replays don’t need to replay Tuesday’s meeting accurately. They just need to activate the relevant pathways so that the strong ones fire and the weak ones don’t. The pattern of what fires versus what doesn’t is the signal. The “content” of the dream is basically irrelevant.

reply