Source: Work at a lab, common knowledge.
Source: also work at a lab.
Reddit data is just not that interesting, that deal is worth like $60m/year. Labs spend 10x as much on computer-use RL environments.
It would also help if you could substantiate your initial claim (i.e. "internet training data is not where frontier capabilities come from")
In that case, it should be no problem for the labs to train their new models without using public data, right?
Sure, we ask a lot more of modern models, but private training data also got a lot better. You would loose out on a lot of long-tail knowledge, but that can be fixed with web search tools. You'd limit the styles, dialects and colloquial phrases the model understands and can use, but for many use cases that would be fine
But why would any frontier lab do that? Throwing in more training data still leads to better results in pretraining. And showing that they don't need to hoover up the internet and Anna's Archive only empowers regulators to prevent them from doing that
Even accepting the copying-as-theft framing, if I go to a village, steal some vegetables from everyone's gardens and ham from their sheds, and then add some prohibitively expensive spices I bought myself to make soup, do I get to claim it as mine and punish the villagers for trying to take it?
We 100% would not be at the current progress without it, though. And it's not like they only train on this once. They keep training on all the internet data PLUS the private data. Private data only (probably) wouldn't work, as learning the base regularities of language takes a lot of weights.
There no reason to not to otherwise outside of the poor little billion dollar corporations not wanting to provide a public utility they stolen from the public.
Anything that removes control from American big tech is a good thing for American citizens and the world writ large.
Copyright needs abolishing.
Companies can't be trusted with societies need for open progress.
The concept of Intellectual property exists not because it's fair but because it creates incentive to make said "intellectual property" exist. If intellectual property can be instantly copied by a competitor... why would I spend a dime to even create such a thing? I want to profit off of what I make because I'm a capitalist and money is what drives me (as a capitalist).
Anthropic models wouldn't exist if they couldn't keep a unholy grip on it. Same with openAI. Same with many life saving drugs.
Of course everyone here is talking about the obvious stuff like how it's morally wrong to with-hold life saving drugs or to have AI literally take over the world and be under the control of one company and all of this is true. But it is also true that greed is the engine that drives our economy and if you want our economy to produce "intellectual property" you must allow people to "capitalize" on that greed.
There are two controversial issues here. What is moral/fair? And what is realistically practical in optimizing the economy if said economy is based on money.
The distillation in my mind is a win for practicality because Competition also drives our economic engine. First you don't want a monopoly, but you also don't want these models to be so damn open that there's zero incentive to make them.
Why should anyone publish anything if it can be stolen with impunity? Is the value of these LLMs even remotely close to the amount of value they stole and the amount of value they will detract from economy because people will be more hesitant to publish anything now?