upvote
No. There's no "answer" really.

They use self-distillation to shift the output distribution of the model towards that of the same model, but running with different temperature/truncation settings in sampling.

This effectively "folds" the logit tail truncation behavior into the model itself.

Not entirely unlike a few "model controlled sampling settings" things I've seen in what it does, but different in execution.

reply
Yeah basically.

You use the outputs from the first run (right or wrong) as answers for the second training run, and repeat. Magically it works. That's what's so surprising.

I guess a theory is because there are so many diverse ways to be wrong that they don't accumulate error... still seems surprising and would be interesting to see if it works in other domains.

reply