The amount of tokens required to properly distill a frontier model is so large that by the time you could consume the # of tokens you would either be banned for extremely obvious abuse or a new model would be released, rendering your efforts less and less valuable over time. Intelligence is not a linear thing. Being behind just a little bit can have exponential consequences.
That seems to be the argument of Dario, Sam et. al., but I'm not ready to believe it. Time will tell, but this can be a marathon and Anthropic and OpenAI is in getting ready to sprint the last lap of the first mile.
Isn't "distillation" of another provider's model exactly how these models got training date in the first place: Massive amounts of the written word + Prompt -> Answer. Why wouldn't distillation produce similar "reasoning" in the new model? It's just inputs and outputs.
The intuition is that distillation exploits not only the "right" answer but the relationship between answers (what's the second most right answer? the third? etc).