By the way, I don't expect you to pay me for this comment. You can just read it for free. You're welcome.
Also, how about making proper arguments yourself? The vast majority of the training data isn't generated by company-paid AI experts either.
Notably, books, even though they don't form a large part of the training data, significantly improve performance on some tasks (same way as expert-generated data).
Why do you think the AI labs are so eager about scanning (and then destroying) every book on the planet?
If you removed all copyrighted works from the training corpus, the model would be notably weaker.
It doesn't absolve them of any theft, but it does make the assertion that they should be required to release their models to the public seem, to me, a bit farcical. There are dozens of free and open-weights models that have all trained on exactly the same web crawls and books as GPT-5 and Opus. The proprietary models are better because of proprietary data.
Even if the other models were trained on the same data, which is unlikely, since they had less time and money to scrape it and fewer lawyers to be able to do something like pirate, the proprietary models are still largely built on the public data and wouldn't exist without it. At the very least, they should release the intermediate model, before training on their proprietary data. Not that that's how that works...
Source? Otherwise this is pure speculation.