undefined

points

[-]

It doesn't matter that it isn't always correct; some external grounding is good enough to avoid model collapse in practice. Otherwise training coding agents with RL wouldn't work at all.

by judahmeek2 days ago|

parent|

[-]

And how do you verify that external grounding?

by catlifeonmars2 days ago|

parent|

prev|

[-]

What precisely do you mean by external grounding? Do you mean the laws of physics still apply?

by andy12_2 days ago|

parent|

[-]

I mean it in the sense that tokens that pass some external filter (even if that filter isn't perfect) are from a very different probability distribution than those that an LLM generates indiscriminately. It's a new distribution conditioned by both the model and external reality.

Model collapse happens in the case where you train your model indefinitely with its own output, leading to reinforcing the biases that were originally picked up by the model. By repeating this process but adding a "grounding" step, you avoid training repeatedly on the same distribution. Some biases may end up being reinforced still, but it's a very different setting. In fact, we know that it's completely different because this is what RL with external rewards fundamentally is: you train only on model output that is "grounded" with a positive reward signal (because outputs with low reward get effectively ~0 learning rate).

by catlifeonmars5 hours ago|

parent|

[-]

Oh interesting. I guess that means you need to deliberately select a grounding source with a different distribution. What sort of method would you use to compare distributions for this use case? Is there an equivalent to an F-test for high dimensional bit vectors?