undefined

upvote

points

by greesil1 hours ago |

upvote

by esafak3 minutes ago|

[-]

https://en.wikipedia.org/wiki/Reinforcement_learning#Policy

reply

upvote

by antonvs27 minutes ago|

[-]

> one could just call it model output.

That would be incorrect. My other reply attempts to address this.

reply

upvote

by greesil18 minutes ago|

[-]

But the probability vector is the output of the LLM, no?

reply