Big labs ripped videos off YouTube without caring about the ToS, and grabbed as much published literature they could get their hands on, regardless of legality (Books3, The Pile). The goal of "democratizing human knowledge" by way of thinking machines is far too noble to worry about frivolities like copyright and authorial consent, they said. Until it was their output being exploited, and their earning potential threatened.
We just had years of US model providers arguing it was fine to rip off the world’s cultural output for their own profit, why should their work be treated any different?
True, but why would end users care about that? If anything, training on synthetic AI output is more ethical than on scraped human works (of course, not to say the Chinese labs aren't doing the latter)