At that point, a lot depends on the quality of the preprocessing applied to the raw text dumps. It is reportedly not that trivial to go from DumpOfSketchyRussianPirateSite.zip to a data set suitable for ingestion during pretraining. A few bad chunks of data can apparently do more harm than one would expect.
AFAIK Google scans almost everything in print as part of the Google Books initiative, so they may have been able to skip the torrenting step.