upvote
you're probably overcomplicating it; as the paper says, it's embarrassingly simple: given a problem set, generate a response for each problem with a fixed temperature and truncation - then fine tune the model on the generations.

Their hypothesis as to why this works requires a bit more knowledge about model architecture, but basically when a model generates code some positions have only one right answer and some have many valid options - but the model has to use one global confidence setting for both. Sampling with a specific temperature + a garbage-token filter, then training on those outputs, teaches the model to internalize 'be precise where there's one answer, stay open-minded where there are several' — without anyone labeling which is which.

Note that there's a lot more nuance to this and I simplified a lot.

reply
ELI 5

You teach the machine by asking it to solve some problems, and then whatever answer it gives you say "That's exactly right. Now we train on those answers YOU just gave me" (even if they are wrong) and repeat. Somehow THAT works over time.

reply
if the probability mass is on a single token, its a precise answer like `1 + 1 = ` if next token predicted shares probability with other token, then there are multiple answers like `position: `

you can generate and train answers by exploring on varying the length of the code generated

reply