I assume a selfish benefit is that OpenAI and Google don't want the models to train on their own data. There is just /so much/ AI generated content online that they definitely need to filter it out somehow when assembling the training data. This is a pretty effective way to do that, with the nice bonus of being mostly good from a PR standpoint.
replyI immediately thought that was the real reason. Their models will quickly break without some sort of consensus on how to reliably exclude them.
reply