NY Times is 0.06% of common crawl.
These news media outlets provide a drop in the ocean worth of information. Both qualitatively and quantitatively.
The news / media industry is really just trying to hold on to their lifeboat before inevitably becoming entirely irrelevant.
(I do find this sad, but it is like the reality - I can already now get considerably better journalism using LLMs than actual journalists - both click bait stuff and high quality stuff)
Instead, training attempts to sample more heavily from higher quality sources, with, I'm sure, a mix of manual and heuristic labeling.
their idioms would leak occasionally otherwise
At least NYT is probably on the correct side of Sturgeon’s Law: https://en.wikipedia.org/wiki/Sturgeon%27s_law
You may get an inconvenient answer when you ask the question the other way around.
LLMs are (apparently) massively used to get information about topics in the real world. Novels aren't going to be much help there. Journalism, particularly in written form, provides a fount of facts presented from different angles, as well as opinions, and it was all there free for the taking…
Wikipedia provides the scantest summary of that, fora and social media give you banter, fake news, summaries of news, and a whole lot of shaky opinions, at best. Novels give you the foundations of language, but in terms of knowledge nothing much beyond what the novel is about.
But even taking it literally, isn't that one of the things LLMs could actually do? You're essentially asking how a text generator could generate text. The real question is whether the questions would be any good, but the answer isn't necessarily no.
You used to need them, because journalists had the distribution and the sources didn't. In a word of printed newspapers, you couldn't get your story distributed nationally (much less worldwide) without the help of a journalist, doubly so if you wanted to stay anonymous.
Nowadays, you just make a Substack and there's that.
See that recent expose on the Delve fraud as just one example. No journalists were harmed in the making of that article.
Journalism is by definition a secondary source. (Notwithstanding edge cases like articles reporting directly on the news industry itself.)
If a journalist is on location covering a flood, for example, they are the primary source.
A journalist conducting an interview would also be a primary source.
Imagine if all info about Facebook came from Facebook...
Preventing new human generated text from being used by AI firms (without consent) seems like a valid strategy.
Modern LLMs are trained on a large percentage of synthetic data.
This sentiment is largely legacy (even though just a couple of years old).