They also changed the image encoder, so I'm thinking "new base model". Whatever base that was powering 4.5/4.6 didn't last long then.
It's just speculative decoding but for training. If they did at this scale it's quite an achievement because training is very fragile when doing these kinds of tricks.
Not really similar to speculative decoding?
I don't think that's what they've done here though. It's still black magic, I'm not sure if any lab does it for frontier runs, let alone 10T scale runs.