undefined

points

[-]

From what I understand, at this point, the main value of stronger model outputs is simply to bootstrap reasoning behavior during the RL post-training phase. It gets you past the “cold start” problem with RL, after which the outputs aren’t needed anymore. From then on, it’s hill climbing and that requires environments for the model to interact with get rewards from.