undefined

upvote

points

by onetrickwolf5 hours ago |

upvote

by _fat_santa4 hours ago|

[-]

> If anything these models should be compelled to be public since they have been trained off public data

I'm starting to come around to this idea TBH. For a while my position was: "these companies have invested billions into training these models, therefore they should be able to control them and profit off them" but looking deeper at where they got their training data, my view is starting to shift.

IMHO I feel like we need new laws around AI, specifically training data. Something like: "you can train an AI model and ignore copyright laws, BUT you must then make the model open weight", a company can still develop closed weight models but then they must aquire permission to use training data.

But it gets murky because if something like that was on the books then AI labs would just train open weight models and then distill them into their closed weight models.

reply

upvote

by ivanovm4 hours ago|

[-]

labs invest multiple billion dollars a year each in private data, and that number is growing. internet training data is not where frontier capabilities come from, this view is outdated

reply

upvote

by Salgat3 hours ago|

[-]

This is a misleading statement. The "private data" is still largely publicly produced data that has been curated through private agreements instead of scraping, such as reddit posts/comments (this is the "third-party data agreements" that companies like OpenAI mention). And yes, there is still a lot of processing done on this data, which is the norm for preparing training data.

reply

upvote

by throw2957293 hours ago|

[-]

This is doubly misleading. A lot of private data is sourced through providers like e.g. Mercor, who pay experts to answer questions and write out their reasoning. (E.g. paying a software engineer to write a project from scratch and recording every keystroke, paying a Chem PhD to answer hard Chem questions, etc.). A second source of private data comes from custom RL environments with fine-grained intermediate rewards for e.g. software engineering, financial modeling, etc.. Also, imagine the amount of usage data recorded by Claude Code, etc. Pretraining is mostly curated public data, post-training is increasingly private expert data and tests.

Source: Work at a lab, common knowledge.

reply

upvote

by jackie2937462 hours ago|

[-]

Well since you work at a lab you should know that most capabilities arise in pretraining, not posttraining or mid training, and the latter two mostly function to bring out the hidden intelligence in these models more than anything else.

Source: also work at a lab.

reply

upvote

by ivanovm2 hours ago|

[-]

No, it isn't. The private data is largely private data, created by highly-specialized, highly-paid contracted teams of experts for domains finance, swe, consulting, etc.

Reddit data is just not that interesting, that deal is worth like $60m/year. Labs spend 10x as much on computer-use RL environments.

reply

upvote

by pera2 hours ago|

[-]

Sorry but your argument doesn't seem coherent: How is the cost of RL relevant here?

It would also help if you could substantiate your initial claim (i.e. "internet training data is not where frontier capabilities come from")

reply

upvote

by ivanovm1 hours ago|

[-]

RL environment (instruction, stateful container, reward function) is the training data product being bought

reply

upvote

by maplethorpe2 hours ago|

[-]

Why are the leading models capable of regurgitating full copyrighted works such as "Harry Potter" and "On the Road"? Did they hire someone to type those out for them?

https://arxiv.org/abs/2601.02671

reply

upvote

by calgoo3 hours ago|

[-]

When did they start doing so? We all know that they DID train on all the available public information, so at what point did they stop? Is the public information still in the training set? If so, they should STILL release ALL the data as public, as they are including training data that was acquired without permission.

reply

upvote

by disgruntledphd23 hours ago|

[-]

They haven't stopped. I honestly don't understand how they ever could.

reply

upvote

by no_multitudes3 hours ago|

[-]

> internet training data is not where frontier capabilities come from

In that case, it should be no problem for the labs to train their new models without using public data, right?

reply

upvote

by islandfox1003 hours ago|

[-]

Then it should be simple for one of the frontier labs to produce a model trained only on private data. We haven't seen that.

reply

upvote

by wongarsu2 hours ago|

[-]

Didn't the famous "Textbooks are all you need" paper already proof that point three years ago?

Sure, we ask a lot more of modern models, but private training data also got a lot better. You would loose out on a lot of long-tail knowledge, but that can be fixed with web search tools. You'd limit the styles, dialects and colloquial phrases the model understands and can use, but for many use cases that would be fine

But why would any frontier lab do that? Throwing in more training data still leads to better results in pretraining. And showing that they don't need to hoover up the internet and Anna's Archive only empowers regulators to prevent them from doing that

reply

upvote

by pera1 hours ago|

[-]

Maybe I am missing your point but "Textbooks are all you need" distilled from GPT-3.5

reply

upvote

by 4bpp2 hours ago|

[-]

Define "come from". Could they have gotten those frontier capabilities, or any capabilities, without internet training data? It seems to me that without the private data, you might get a slightly less competitive model, but without the CommonCrawl-style data piles used in "pretraining", you get no model at all.

Even accepting the copying-as-theft framing, if I go to a village, steal some vegetables from everyone's gardens and ham from their sheds, and then add some prohibitively expensive spices I bought myself to make soup, do I get to claim it as mine and punish the villagers for trying to take it?

reply

upvote

by disgruntledphd23 hours ago|

[-]

> internet training data is not where frontier capabilities come from

We 100% would not be at the current progress without it, though. And it's not like they only train on this once. They keep training on all the internet data PLUS the private data. Private data only (probably) wouldn't work, as learning the base regularities of language takes a lot of weights.

reply

upvote

by Guillaume863 hours ago|

[-]

Great way to launder illegally obtained data too.

reply

upvote

by pastel87393 hours ago|

[-]

Does this private data come from places like Reddit, Twitter, etc., where it’s contributed by users? I think it is unethical for these companies to accept payment for user-contributed data.

reply

upvote

by shimman3 hours ago|

[-]

Okay that's fine, then make the law say they must provide publicly owned models off of publicly obtained data. To think that such a baseline of critical information isn't is the literal foundation of everything they will do, both now in the future, is just exposing what their end game is: control.

There no reason to not to otherwise outside of the poor little billion dollar corporations not wanting to provide a public utility they stolen from the public.

Anything that removes control from American big tech is a good thing for American citizens and the world writ large.

reply

upvote

by bfjvibybd6cuvu63 hours ago|

[-]

No, you're talking about fine tuning and most of it is coming from your customers or someone else's. Get off ya high horse.

Copyright needs abolishing.

Companies can't be trusted with societies need for open progress.

reply

upvote

by threethirtytwo3 hours ago|

[-]

I'm not taking sides here but this situation is not so black and white and it has always been the darker side of capitalism.

The concept of Intellectual property exists not because it's fair but because it creates incentive to make said "intellectual property" exist. If intellectual property can be instantly copied by a competitor... why would I spend a dime to even create such a thing? I want to profit off of what I make because I'm a capitalist and money is what drives me (as a capitalist).

Anthropic models wouldn't exist if they couldn't keep a unholy grip on it. Same with openAI. Same with many life saving drugs.

Of course everyone here is talking about the obvious stuff like how it's morally wrong to with-hold life saving drugs or to have AI literally take over the world and be under the control of one company and all of this is true. But it is also true that greed is the engine that drives our economy and if you want our economy to produce "intellectual property" you must allow people to "capitalize" on that greed.

There are two controversial issues here. What is moral/fair? And what is realistically practical in optimizing the economy if said economy is based on money.

The distillation in my mind is a win for practicality because Competition also drives our economic engine. First you don't want a monopoly, but you also don't want these models to be so damn open that there's zero incentive to make them.

reply

upvote

by nightski3 hours ago|

[-]

That intellectual property argument goes both ways. The model might not exist without protection, but it also would not exist without the data.

reply

upvote

by ozgrakkurt1 hours ago|

[-]

This perfectly explains why current LLMs should be illegal in an actual capitalist market.

Why should anyone publish anything if it can be stolen with impunity? Is the value of these LLMs even remotely close to the amount of value they stole and the amount of value they will detract from economy because people will be more hesitant to publish anything now?

reply

upvote

by rafram4 hours ago|

[-]

The core of the training data is public, but the part that actually makes these models smart came from (pretty highly-paid) experts via platforms like Mercor. Claude didn't magically learn to write good code by reading all of GitHub - humans trained it in that, more or less manually.

reply

upvote

by rapind3 hours ago|

[-]

If you pay me to curate a playlist of musical hits, can you now publish and charge people for access to that playlist (*including the curated material)? Can we do the same with movies? Books?

/edit Added a note to make it more obvious that the material is included in the playlist, just like the material is incorporated as part of curated AI models.

reply

upvote

by tanseydavid2 hours ago|

[-]

>> If you pay me to curate a playlist of musical hits, can you now publish and charge people for access to that playlist?

If the contract was "work-for-hire" then yes, of course I can.

reply

upvote

by rapind2 hours ago|

[-]

Maybe I wasn't clear. The playlist includes the material in it. Just like curated AI does.

reply

upvote

by datsci_est_20152 hours ago|

[-]

Given the breadth of LLM knowledge, I somehow doubt this. Sure, it’s probably responsible for the quality of LLM insights, but I don’t think anyone was asking experts about e.g. the complex ecological effects of invasive zebra mussels and their provenance in Lake Michigan.

reply

upvote

by visarga4 hours ago|

[-]

No, they do RLVR (reinforcement learning with verifiable rewards) like everyone else. And probably use claude data too, with human in the loop and tool feedback.

reply

upvote

by jaen4 hours ago|

[-]

...and the rest of the training data (ie. the entire corpus of copyrighted works) was not written by experts expecting compensation? Double standards.

reply

upvote

by Ajedi323 hours ago|

[-]

No, public data is not generally written by "experts expecting compensation".

By the way, I don't expect you to pay me for this comment. You can just read it for free. You're welcome.

reply

upvote

by jaen2 hours ago|

[-]

Ugh, please don't read strawmen into other's arguments and try to follow the HN guidelines.

Also, how about making proper arguments yourself? The vast majority of the training data isn't generated by company-paid AI experts either.

Notably, books, even though they don't form a large part of the training data, significantly improve performance on some tasks (same way as expert-generated data).

Why do you think the AI labs are so eager about scanning (and then destroying) every book on the planet?

If you removed all copyrighted works from the training corpus, the model would be notably weaker.

reply

upvote

by calgoo3 hours ago|

[-]

No, but people do upload data with an expectation that the data not being used without their permission (unless they do a BSD/MIT/Public domain like license). Otherwise, the platform AND/OR the user do expect the data NOT to be used for purposes other then what it was intended for. Your comment is still your comment, and the hacker news platform also has a say in this. If there had been an opt-in, then fine no problem, but there was none, they just trained on everything available, including downloading pirated books from the internet.

reply

upvote

by Ajedi323 hours ago|

[-]

I think it's unreasonable to post anything on a public forum and then expect to be able to control who reads it and for what purpose.

reply

upvote

by calgoo3 hours ago|

[-]

Answering here as it wont let me reply: Just because you feel that something that is public, does not mean you can do whatever you want with it. You can't just copy an article from a news site and paste it on yours, that theft. If you dont agree, fine, but that is the law, and ALL the mega corps have been fighting to keep it this way for the last 20 years. If they want to steal everyones info, fine, then lobby to change the copyright laws and no problem.

reply

upvote

by sneak2 hours ago|

[-]

Copying isn’t theft, we settled this in the 90s.

reply

upvote

by pastel87393 hours ago|

[-]

Books?

reply

upvote

by Ajedi323 hours ago|

[-]

The vast, vast, majority of AI training data is not books. I wouldn't be surprised if there's more text on HN alone than every book in the history of mankind (most of which are also no longer copyrighted).

reply

upvote

by rafram4 hours ago|

[-]

I didn't say that.

reply

upvote

by thom4 hours ago|

[-]

No, you just parroted an increasingly popular talking point, the entire purpose of which seems to be to absolve AI companies of the enormous theft that put them in the position to hire experts in the first place.

reply

upvote

by rafram4 hours ago|

[-]

Well, I'd never heard anyone make it before, but sure. (I looked into Mercor a bit and know some people who've worked in data generation/labeling, which is what exposed me to that side of the operation.)

It doesn't absolve them of any theft, but it does make the assertion that they should be required to release their models to the public seem, to me, a bit farcical. There are dozens of free and open-weights models that have all trained on exactly the same web crawls and books as GPT-5 and Opus. The proprietary models are better because of proprietary data.

reply

upvote

by franga20004 hours ago|

[-]

Cool, then they can train their proprietary models on their proprietary data only.

Even if the other models were trained on the same data, which is unlikely, since they had less time and money to scrape it and fewer lawyers to be able to do something like pirate, the proprietary models are still largely built on the public data and wouldn't exist without it. At the very least, they should release the intermediate model, before training on their proprietary data. Not that that's how that works...

reply

upvote

by thom4 hours ago|

[-]

I agree that saying that they have now trained on lots of proprietary data allows them to muddy the legislative waters further than they already have. What a happy coincidence!

reply

upvote

by noitemtoshow3 hours ago|

[-]

I’d suggest you to learn more about how LLM training work. Training on internet data alone will not result in an agent answering your questions.

reply

upvote

by thom2 hours ago|

[-]

Sure as shit won't answer them without that though.

reply

upvote

by mbesto3 hours ago|

[-]

> The proprietary models are better because of proprietary data

Source? Otherwise this is pure speculation.

reply

upvote

by jaen4 hours ago|

[-]

Indeed, that's exactly why I replied - you omitted one side from the discussion.

reply

upvote

by freejazz3 hours ago|

[-]

So? What about the authors of all the works these companies stole?

reply

upvote

by slibhb4 hours ago|

[-]

> If anything these models should be compelled to be public since they have been trained off public data. What an absurd overreach to call this an attack.

> It’s clear they are scapegoating national security and China at this point to build an anti-competitive moat.

If all that is required to train these models is public data, why can't Alibaba just use that?

The fact that Alibaba has to resort to scraping Claude suggests there already is a moat...

reply

upvote

by KerryJones4 hours ago|

[-]

This feels more nuanced than you are giving it credit for? Much of the training data that was available has been withdrawn, atleast for OpenAI we know that much of the training data was garnered in less-than above the board methods

reply

upvote

by flowerlad4 hours ago|

[-]

Should Google search index be forced to be public too?

reply

upvote

by calgoo3 hours ago|

[-]

Honestly, yes it should in some form. If their index contains the actual data from the sites, and they are making that information public in one way or another, then it should be available as a downloadable dataset.

reply

upvote

by flowerlad3 hours ago|

[-]

How far can we take this?

Should Boeing airplane designs be public domain since the underlying math is public domain?

reply

upvote

by r-w2 hours ago|

[-]

I don't think that slope is as slippery as you think it is.

reply

upvote

by 3 hours ago|

[-]

deleted

reply

upvote

by zobzu4 hours ago|

[-]

its mainly just a lot cheaper. copying is always cheaper anyway, very little r&d - ai or no ai.

reply

upvote

by petilon4 hours ago|

[-]

> If anything these models should be compelled to be public since they have been trained off public data.

Isn't that a bit like saying if you read books in a public library to pick up a new skill you should work for free?

> What an absurd overreach to call this an attack.

Would it be an attack to take your meal by force if you used a public recipe to prepare the meal?

reply

upvote

by topgrain23 hours ago|

[-]

> Isn't that a bit like saying if you read books in a public library to pick up a new skill you should work for free?

Only if you’re trying to muddy the waters. No, obviously it’s not. One can also support licensing for driving a car on public roads but not for walking, even though both involve traveling. This is only confusing to people pretending to be confused, for effect.

> Would it be an attack to take your meal by force if you used a public recipe to prepare the meal?

“You wouldn’t download a car…” (unless it worked like copying an MP3, then, of course, you would, everyone would)

It’s as if you’re using terrible analogies and comparisons because stronger ones don’t exist. Great news for the AI-should-be-open crowd.

reply

upvote

by petilon3 hours ago|

[-]

I think the analogies are appropriate. Anthropic took public data and added value on top of it. It is that added value that Alibaba is targeting. If it was the underlying data, that's freely available.

reply

upvote

by runtime_terror3 hours ago|

[-]

If by "public domain data" you mean stealing ungodly amounts of copyrighted works then sure

reply

upvote

by topgrain23 hours ago|

[-]

Alibaba's asking for things, and receiving what they asked for.

> If it was the underlying data, that's freely available.

A bunch of it is not, but was pirated. And "underlying data"—JFC, that's billions of person-hours of thoughtful work by real people, practically infinitely more worthy of respect and care than what these LLM companies have done, without which they would have nothing. Alibaba's being more above-board about this than the major American firms have been (are they in general? Oh no, I doubt it, but in this particular case, yes). Extra accounts to get around TOS restrictions is the lesser evil here, and it's being done to companies that did worse. This is the least they should suffer, and their complaining about it is as comical as a professional fence crying about how unfair it is their shop got burgled.

Live by the sword...

reply

upvote

by petilon2 hours ago|

[-]

What AI firms are trying to build is the artificial equivalent of a human brain. If a human learns from a source material and uses the knowledge in their career that doesn't violate the copyright law. If an artificial brain does the same then it doesn't violate it either. This is up to the courts to decide. Alibaba can't take the law into its own hands and decide what the punishment ought to be.

This also shows how Chinese firms are weak in AI algorithms, they can't build a model without stealing from American firms.

reply

upvote

by topgrain21 hours ago|

[-]

> What AI firms are trying to build is the artificial equivalent of a human brain.

We should probably leave this here, because I don't think this is even close to true (that it's what they're trying to do, or that it's what they've done—I do believe it's the sort of claim their marketing departments and investor-hype-meisters might make, though).

reply

upvote

by rapind3 hours ago|

[-]

> It’s clear they are scapegoating national security and China at this point to build an anti-competitive moat.

They are also fear mongering (and getting shills to as well) the idea that once open weight (Chinese) models catch up to Mythos we're all doomed. Maybe I'd be bit less cynical if they weren't prepping for IPO?

Wasn't OpenAI spreading similar FUD back when GPT 2 came out?

Guys... AGI is right around the corner. Pinky swear. Now buy our stock.

Keep in mind that the entire US economy is currently propped up by AI spending, so a lot of people (banks, government) are incentivized to make sure these companies succeed. Expect this propaganda to ratchet up a notch if / when the economy starts to nose dive.

reply

upvote

by ok1234563 hours ago|

[-]

Yes. They're turning on the consent manufacturing machine to make it an issue of "national security" to download some gguf file from Hugging Face. Absolutely disgusting.

reply

upvote

by msabalau1 hours ago|

[-]

There's probably at 10-15% percent chance of a war between the US and China over the next 10 years. Maybe better than even chance of a militarized crisis that might have led to war, but somehow de-escalates.

Regardless of how sad late stage capitalism makes you, or how outrageous one claims to find "hypocrisy", any national security argument about limiting Chinese AI capability stands on it's own, at least for nations likely to be drawn into a war.

Also, all the local model enthusiasts who assume Chinese firms are going be allowed to endlessly release models if they have disruptive potential attributed to Mythos are probably in for a rude awakening. Just because the PRC is content about what has happened in the past doesn't mean that they would tolerate an open model that could be truly destabilizing.

reply

upvote

by pseudony1 hours ago|

[-]

As a third party I would rather be happy about the way Chinese labs are acting in the here and now while US labs first masquerade as a public good, then turn around, bail on all promises of open AI, turn into a corporation and attempt to own the world while its runner-up is trying to scaremonger people into buying their product.

I know most Americans are fed a steady diet of “evil China” and China MAY have issues. But on the AI front they are heaps better. Even if everything got closed tomorrow, we have a plethora of good models we can inspect and tweak while from the US labs we have… a single old 120b model ?

And with the way the US is treating its allies, maybe a bunch of us are quite content with a more even match rather than US hegemony.

reply

upvote

by cma4 hours ago|

[-]

Since they hide their thinking traces it really doesn't make too much sense. We know one of their fixed degradations they talked about in a recent blog post was if you left claude code idle for too long they would rehydrate it without the thinking traces in the context and it degraded performance. So direct forms of distillation wouldn't be expected to get as good of results as they are getting.

However, they could have used it as a judge etc. during training.

reply

upvote

by coliveira4 hours ago|

[-]

What they're trying to do under the umbrella of "national security" is to legislate how we can use the results we pay for when accessing these models. This way they will control the "intellectual property" that was acquired illegally.

reply

upvote

by TZubiri5 hours ago|

[-]

Two wrongs don't make a right

reply

upvote

by tokioyoyo5 hours ago|

[-]

In this scenario it does, because consumers win.

Everyone in AI industry wants to fight dirty, but gets angry when their competitor fights dirty as well. And I’ve mentioned it before, how I generally like Ant and its products.

reply

upvote

by justapassenger4 hours ago|

[-]

Closest analogy to distillation is api reimplementation, without which current software industry wouldn’t exist.

There’s nothing fundamentally wrong with distillation.

reply

upvote

by moistoreos4 hours ago|

[-]

Pretty sure the second rectified the first.

reply

upvote

by rayiner3 hours ago|

[-]

> The public’s life is getting worse while these companies consolidate power using data they stole from the public

How can you “steal” public information?

reply

upvote

by calgoo3 hours ago|

[-]

really? You know this just like everyone else: Just because the information is available publicly, does not mean that you can do whatever you want with the information. Copyright exists for a reason, and if the copyright lobby is going to continue to push for the poor poor media companies to keep their copyrights, then we should do the same towards the AI companies. So yes, they Stole the information from everyone else, and they keep doing so, as you can see their scanners still hitting every website on the web to get an updated dataset. It does not matter what they do AFTER they steal all the information, as they already stole it.

reply