upvote
Appreciate the long answer. Why is it more likely that Gemini 3 Pro/Flash/Lite are distillations of the same parent model than that they’re different training runs on the same dataset, with minor version bumps being different post-training setups?
reply
The biggest tell is the fact that labs are staggering smaller model releases so much with big models. If the small models (flash, sonnet/haiku) were being distilled from pro models, you'd consistently see them be released fairly soon after new pro releases to maximize their competitiveness (and this was the case early on for Anthropic). Instead it seems like releases are timed to build/maintain hype.

A thing to keep in mind is that if they release a smaller model halfway between well spaced big model releases, why wait so long on the next big model release if it's sufficiently ready to distill to a smaller model? The ability to demonstrate AI superiority is worth a ton, there's no reason to hold back.

reply