upvote
That seems like a reductive way to consider it. What percent of music was created by Led Zeppelin? What percent of art was painted by Monet? What percent of films by Alfred Hitchcock? It may be a small percentage objectively but they are hugely influential.
reply
I don't think back propagation care whose text it is back propagating.
reply
The data sets aren't naively fed into the training runs.

Instead, training attempts to sample more heavily from higher quality sources, with, I'm sure, a mix of manual and heuristic labeling.

reply
fwiw, no llm ive ever used generated in the writing style newspapers and -sites use - hence i honestly doubt they've been given a meaningful boost in relevancy.

their idioms would leak occasionally otherwise

reply
90% of common crawl is complete junk. While the tiny bit of news articles powers almost all the ai answers in Google search.
reply
How many Reddit, HN, etc. posts are based on NYT articles? How many derivative news articles, blog posts, YouTube videos, TikToks, etc. are responses to those articles?

At least NYT is probably on the correct side of Sturgeon’s Law: https://en.wikipedia.org/wiki/Sturgeon%27s_law

reply
> How many Reddit, HN, etc. posts are based on NYT articles? How many derivative news articles, blog posts, YouTube videos, TikToks, etc. are responses to those articles?

You may get an inconvenient answer when you ask the question the other way around.

reply
0.06% is way higher than I would expect
reply