undefined

points

[-]

I think OpenAI and Anthropic just downloaded the same torrents from Anna's Archive that anyone else can. But it's only OK when they do it. The rest of us get nastygrams from law offices. Anthropic actually had to cough up some bucks, for that matter.

At that point, a lot depends on the quality of the preprocessing applied to the raw text dumps. It is reportedly not that trivial to go from DumpOfSketchyRussianPirateSite.zip to a data set suitable for ingestion during pretraining. A few bad chunks of data can apparently do more harm than one would expect.

AFAIK Google scans almost everything in print as part of the Google Books initiative, so they may have been able to skip the torrenting step.