IMHO this is where code review goes until we fix the individualized model thing: you need to review the decisions the agent made, where you didn't steer. Most will be right. A few will be disastrously wrong. But decision-by-decision is a lot less to review than line-by-line of code.
I wonder if this is even desirable from a product perspective. You probably don't want online learning in a product that you are selling because you can't guarantee a consistent quality of the product.
And to be fair, the ability to fire employees and hire new ones is pretty important for that reason. In cases where you can't easily fire employees (e.g. unions), you encounter the very problem you're describing, and it often leads to companies preferring more consistent automations.