undefined

points

[-]

Not only that, they additionally ran an experiment with the training temperature turned way up (2.0) and truncation turned off such that the majority of SFT examples were incoherent (63% IIRC). Yet the model finetuned on these broken examples still improved over baseline.

by hooloovoo_zoo10 hours ago|

prev|

[-]

They are training the model to 1. Produce code (as opposed to answer a question, write a poem, etc.) 2. Produce long enough output to be a valid solution. So they are doing exactly what I said. Cheers.

by mememememememo9 hours ago|

parent|

[-]

In layman, they are putting wet tyres on when it is raining and saying the car performs better over the next lap?