undefined

points

[-]

This sounds very wrong to me.

Take the C4 training dataset for example. The uncompressed, uncleaned, size of the dataset is ~6TB, and contains an exhaustive English language scrape of the public internet from 2019. The cleaned (still uncompressed) dataset is significantly less than 1TB.

I could go on, but, I think it's already pretty obvious that 1TB is more than enough storage to represent a significant portion of the internet.

by FeepingCreature4 hours ago|

parent|

[-]

This would imply that the English internet is not much bigger than 20x the English Wikipedia.

That seems implausible.

by jesse__3 hours ago|

parent|

[-]

> That seems implausible.

Why, exactly?

Refuting facts with "I doubt it, bro" isn't exactly a productive contribution to the conversation..

by kgeist4 hours ago|

prev|

[-]

A lot of the internet is duplicate data, low quality content, SEO spam etc. I wouldn't be surprised if 1 TB is a significant portion of the high-quality, information-dense part of the internet.

by FeepingCreature4 hours ago|

parent|

[-]

I would be extremely surprised if it was that small.

by gmueckl4 hours ago|

prev|

[-]

This is obviously wrong. There is a bunch of knowledge embedded in those weights, and some of it can be recalled verbatim. So, by virtue of this recall alone, training is a form of lossy data compression.