I'm starting to come around to this idea TBH. For a while my position was: "these companies have invested billions into training these models, therefore they should be able to control them and profit off them" but looking deeper at where they got their training data, my view is starting to shift.
IMHO I feel like we need new laws around AI, specifically training data. Something like: "you can train an AI model and ignore copyright laws, BUT you must then make the model open weight", a company can still develop closed weight models but then they must aquire permission to use training data.
But it gets murky because if something like that was on the books then AI labs would just train open weight models and then distill them into their closed weight models.
Source: Work at a lab, common knowledge.
Source: also work at a lab.
Reddit data is just not that interesting, that deal is worth like $60m/year. Labs spend 10x as much on computer-use RL environments.
It would also help if you could substantiate your initial claim (i.e. "internet training data is not where frontier capabilities come from")
In that case, it should be no problem for the labs to train their new models without using public data, right?
Sure, we ask a lot more of modern models, but private training data also got a lot better. You would loose out on a lot of long-tail knowledge, but that can be fixed with web search tools. You'd limit the styles, dialects and colloquial phrases the model understands and can use, but for many use cases that would be fine
But why would any frontier lab do that? Throwing in more training data still leads to better results in pretraining. And showing that they don't need to hoover up the internet and Anna's Archive only empowers regulators to prevent them from doing that
Even accepting the copying-as-theft framing, if I go to a village, steal some vegetables from everyone's gardens and ham from their sheds, and then add some prohibitively expensive spices I bought myself to make soup, do I get to claim it as mine and punish the villagers for trying to take it?
We 100% would not be at the current progress without it, though. And it's not like they only train on this once. They keep training on all the internet data PLUS the private data. Private data only (probably) wouldn't work, as learning the base regularities of language takes a lot of weights.
There no reason to not to otherwise outside of the poor little billion dollar corporations not wanting to provide a public utility they stolen from the public.
Anything that removes control from American big tech is a good thing for American citizens and the world writ large.
Copyright needs abolishing.
Companies can't be trusted with societies need for open progress.
The concept of Intellectual property exists not because it's fair but because it creates incentive to make said "intellectual property" exist. If intellectual property can be instantly copied by a competitor... why would I spend a dime to even create such a thing? I want to profit off of what I make because I'm a capitalist and money is what drives me (as a capitalist).
Anthropic models wouldn't exist if they couldn't keep a unholy grip on it. Same with openAI. Same with many life saving drugs.
Of course everyone here is talking about the obvious stuff like how it's morally wrong to with-hold life saving drugs or to have AI literally take over the world and be under the control of one company and all of this is true. But it is also true that greed is the engine that drives our economy and if you want our economy to produce "intellectual property" you must allow people to "capitalize" on that greed.
There are two controversial issues here. What is moral/fair? And what is realistically practical in optimizing the economy if said economy is based on money.
The distillation in my mind is a win for practicality because Competition also drives our economic engine. First you don't want a monopoly, but you also don't want these models to be so damn open that there's zero incentive to make them.
Why should anyone publish anything if it can be stolen with impunity? Is the value of these LLMs even remotely close to the amount of value they stole and the amount of value they will detract from economy because people will be more hesitant to publish anything now?
/edit Added a note to make it more obvious that the material is included in the playlist, just like the material is incorporated as part of curated AI models.
If the contract was "work-for-hire" then yes, of course I can.
By the way, I don't expect you to pay me for this comment. You can just read it for free. You're welcome.
Also, how about making proper arguments yourself? The vast majority of the training data isn't generated by company-paid AI experts either.
Notably, books, even though they don't form a large part of the training data, significantly improve performance on some tasks (same way as expert-generated data).
Why do you think the AI labs are so eager about scanning (and then destroying) every book on the planet?
If you removed all copyrighted works from the training corpus, the model would be notably weaker.
It doesn't absolve them of any theft, but it does make the assertion that they should be required to release their models to the public seem, to me, a bit farcical. There are dozens of free and open-weights models that have all trained on exactly the same web crawls and books as GPT-5 and Opus. The proprietary models are better because of proprietary data.
Even if the other models were trained on the same data, which is unlikely, since they had less time and money to scrape it and fewer lawyers to be able to do something like pirate, the proprietary models are still largely built on the public data and wouldn't exist without it. At the very least, they should release the intermediate model, before training on their proprietary data. Not that that's how that works...
Source? Otherwise this is pure speculation.
> It’s clear they are scapegoating national security and China at this point to build an anti-competitive moat.
If all that is required to train these models is public data, why can't Alibaba just use that?
The fact that Alibaba has to resort to scraping Claude suggests there already is a moat...
Should Boeing airplane designs be public domain since the underlying math is public domain?
Isn't that a bit like saying if you read books in a public library to pick up a new skill you should work for free?
> What an absurd overreach to call this an attack.
Would it be an attack to take your meal by force if you used a public recipe to prepare the meal?
Only if you’re trying to muddy the waters. No, obviously it’s not. One can also support licensing for driving a car on public roads but not for walking, even though both involve traveling. This is only confusing to people pretending to be confused, for effect.
> Would it be an attack to take your meal by force if you used a public recipe to prepare the meal?
“You wouldn’t download a car…” (unless it worked like copying an MP3, then, of course, you would, everyone would)
It’s as if you’re using terrible analogies and comparisons because stronger ones don’t exist. Great news for the AI-should-be-open crowd.
> If it was the underlying data, that's freely available.
A bunch of it is not, but was pirated. And "underlying data"—JFC, that's billions of person-hours of thoughtful work by real people, practically infinitely more worthy of respect and care than what these LLM companies have done, without which they would have nothing. Alibaba's being more above-board about this than the major American firms have been (are they in general? Oh no, I doubt it, but in this particular case, yes). Extra accounts to get around TOS restrictions is the lesser evil here, and it's being done to companies that did worse. This is the least they should suffer, and their complaining about it is as comical as a professional fence crying about how unfair it is their shop got burgled.
Live by the sword...
This also shows how Chinese firms are weak in AI algorithms, they can't build a model without stealing from American firms.
We should probably leave this here, because I don't think this is even close to true (that it's what they're trying to do, or that it's what they've done—I do believe it's the sort of claim their marketing departments and investor-hype-meisters might make, though).
They are also fear mongering (and getting shills to as well) the idea that once open weight (Chinese) models catch up to Mythos we're all doomed. Maybe I'd be bit less cynical if they weren't prepping for IPO?
Wasn't OpenAI spreading similar FUD back when GPT 2 came out?
Guys... AGI is right around the corner. Pinky swear. Now buy our stock.
Keep in mind that the entire US economy is currently propped up by AI spending, so a lot of people (banks, government) are incentivized to make sure these companies succeed. Expect this propaganda to ratchet up a notch if / when the economy starts to nose dive.
Regardless of how sad late stage capitalism makes you, or how outrageous one claims to find "hypocrisy", any national security argument about limiting Chinese AI capability stands on it's own, at least for nations likely to be drawn into a war.
Also, all the local model enthusiasts who assume Chinese firms are going be allowed to endlessly release models if they have disruptive potential attributed to Mythos are probably in for a rude awakening. Just because the PRC is content about what has happened in the past doesn't mean that they would tolerate an open model that could be truly destabilizing.
I know most Americans are fed a steady diet of “evil China” and China MAY have issues. But on the AI front they are heaps better. Even if everything got closed tomorrow, we have a plethora of good models we can inspect and tweak while from the US labs we have… a single old 120b model ?
And with the way the US is treating its allies, maybe a bunch of us are quite content with a more even match rather than US hegemony.
However, they could have used it as a judge etc. during training.
Everyone in AI industry wants to fight dirty, but gets angry when their competitor fights dirty as well. And I’ve mentioned it before, how I generally like Ant and its products.
There’s nothing fundamentally wrong with distillation.
How can you “steal” public information?