undefined

upvote

points

by kshri2418 hours ago |

upvote

by kouteiheika18 hours ago|

[-]

> I would go so far as to say the most restrictive license that the model is trained on should be applied to all model generated code.

That license is called "All Rights Reserved", in which case you wouldn't be able to legally use the output for anything.

There are research models out there which are trained on only permissively licensed data (i.e. no "All Rights Reserved" data), but they're, colloquially speaking, dumb as bricks when compared to state-of-art.

But I guess the funniest consequence of the "model outputs are a derivative work of their training data" would be that it'd essentially wipe out (or at very least force a revert to a pre-AI era commit) every open source project which may have included any AI-generated or AI-assisted code, which currently pretty much includes every major open source project out there. And it would also make it impossible to legally train any new models whose training data isn't strictly pre-AI, since otherwise you wouldn't know whether your training data is contaminated or not.

reply

upvote

by progval17 hours ago|

[-]

> There are research models out there which are trained on only permissively licensed data

Models whose authors tried to train only on permissively licensed data.

For example https://huggingface.co/bigcode/starcoder2-15b tried to be a permissively licensed dataset, but it filtered only on repository-level license, not file-level. So when searching for "under the terms of the GNU General Public License" on https://huggingface.co/spaces/bigcode/search-v2 back when it was working, you would find it was trained on many files with a GPL header.

reply

upvote

by kshri2417 hours ago|

[-]

I agree with your assessment. Which is why I was proposing a middle-ground where an agreement is setup between the model training company and the collective of developers/artists et all and come up with a license agreement where they are rewarded for their original work for perpetuity. A tiny % of the profits can be shared, which would be a form of UBI. This is fair not only because companies are using AI generated output but developers themselves are also paying and using AI generated output that is trained on other developer's input. I would feel good (in my conscience) that I am not "stealing" someone else's effort and they are being paid for it.

reply

upvote

by carlob14 hours ago|

[-]

Why settle on some private agreement between creators and ai companies where a tiny percentage is shared, let's just tax the hell out of AI companies and redistribute.

reply

upvote

by rswail12 hours ago|

[-]

Because the authors of the original content deserve recompense for their work.

That's what the whole copyright and patent regimes are designed to achieve.

It's to encourage the creation of knowledge.

US Constitution, Article I, section 8:

    To promote the Progress of Science and useful Arts, by
    securing for limited Times to Authors and Inventors the
    exclusive Right to their respective Writings 
    and Discoveries;

reply

upvote

by carlob11 hours ago|

[-]

Right, it says exclusive rights, which does not translate to "we siphon everything and you get a tiny percentage of our profits", it means I can choose to say no to all of this. To me the matter of compensation and that of authorship rights are mostly orthogonal.

reply

upvote

by kshri2414 hours ago|

[-]

> let's just tax the hell out of AI companies and redistribute.

That's not what I favor because you are inserting a middleman, the Government, into the mix. The Government ALWAYS wants to maximize tax collections AND fully utilize its budget. There is no concept of "savings" in any Government anywhere in the World. And Government spending is ALWAYS wasteful. Tenders floated by Government will ALWAYS go to companies that have senators/ministers/prime ministers/presidents/kings etc as shareholders. In other words, the tax money collected will be redistributed again amongst the top 500 companies. There is no trickle down. Which is why agreements need to be between creators and those who are enjoying fruits of the creation. What have Governments ever created except for laws that stifle innovation/progress every single time?

reply

upvote

by carlob12 hours ago|

[-]

> What have Governments ever created except for laws that stifle innovation/progress every single time?

https://www.youtube.com/watch?v=Qc7HmhrgTuQ

In all seriousness without the government you would have no innovation and progress, because it's the public school system, functioning roads, research grants a stable and lawful society that allow you to do any kind of innovation.

Apart from that, you have answered to a strawman. I said redistribute, not give to the government. I explicitly worded things that way because I don't think we should not be having a discussion on policy.

I think we are moving to an economy where the share of profits taken by capital becomes much larger than the one take from labor. If that happens then laborers will have very little discretionary income to fuel consumption and even capitalists will end up suffering. We can choose to redistribute now or wait for it to happen naturally, however that usually happens in a much more violent way, be it hyperinflation, famine, war or revolution.

reply

upvote

by kshri2410 hours ago|

[-]

> Apart from that, you have answered to a strawman. I said redistribute, not give to the government

You said: "let's just tax the hell out of AI companies and redistribute.". Only the Government has the power to tax. Question of redistribution does not even arise without first having the power to the coffers of the Company. Which you nor I have. Government CAN have if it wants to by either Nationalizing the Company or as you said "taxing the hell out of" the company. Please explain how you would go about taxing and redistributing without involving the Government?

> In all seriousness without the government you would have no innovation and progress, because it's the public school system, functioning roads, research grants a stable and lawful society that allow you to do any kind of innovation.

These fall under the ambit of governance and hence why you have a Government. That's the only power Governments should have. Governments SHOULD NOT be managing private enterprises.

> I think we are moving to an economy where the share of profits taken by capital becomes much larger than the one take from labor. If that happens then laborers will have very little discretionary income to fuel consumption and even capitalists will end up suffering. We can choose to redistribute now or wait for it to happen naturally, however that usually happens in a much more violent way, be it hyperinflation, famine, war or revolution.

Agreed. Which is why I was proposing private agreements in the first place (without involving a third-party like the Government which, more often than not, mismanages funds).

reply

upvote

by LadyCailin13 hours ago|

[-]

Uh, no? https://en.wikipedia.org/wiki/Government_Pension_Fund_of_Nor...

Just because you have a failure of imagination for how government should work, doesn’t mean it can’t work. And stifling innovation is exactly what I want, when that innovation is “steal from everyone so we can invent the torment nexus” or whatever’s going on these days.

reply

upvote

by kshri2413 hours ago|

[-]

Pension fund is an example of what exactly? All countries have pension funds. This has nothing to do with Governments wasting money. Please go beyond tiny European countries that have very few verticals and are largely dependent on outside support for protecting their sovereignty. They are not representative of most of the World.

> As its name suggests, the Government Pension Fund Global is invested in international financial markets, so the risk is independent from the Norwegian economy. The fund is invested in 8,763 companies in 71 countries (as of 2024).

Basically what I said above. You give your tax dollars to Government and it will invest it into top 500 companies. In the Norway Pension Fund case it is 8,763 companies in 71 countries. None of them are startups/small businesses/creators.

> And stifling innovation is exactly what I want, when that innovation is “steal from everyone so we can invent the torment nexus” or whatever’s going on these days.

You are confusing current lack of laws regulating this space with innovation being evil. Innovation is not evil. The technology per se is not evil. Every innovation brings with it a set of challenges which requires us to think of new legislation. This has ALWAYS been the case for thousands of years of human innovation.

reply

upvote

by kouteiheika14 hours ago|

[-]

> Which is why I was proposing a middle-ground where an agreement is setup between the model training company and the collective of developers/artists et all and come up with a license agreement where they are rewarded for their original work for perpetuity. A tiny % of the profits can be shared, which would be a form of UBI. This is fair

That wouldn't be fair because these models are not only trained on code. A huge chunk of the training data are just "random" webpages scraped off the Internet. How do you propose those people are compensated in such a scheme? How do you even know who contributed, and how much, and to whom to even direct the money?

I think the only "fair" model would be to essentially require models trained on data that you didn't explicitly license to be released as open weights under a permissive license (possibly with a slight delay to allow you to recoup costs). That is: if you want to gobble up the whole Internet to train your model without asking for permission then you're free to do so, but you need to release the resulting model so that the whole humanity can benefit from it, instead of monopolizing it behind an API paywall like e.g. OpenAI or Anthropic does.

Those big LLM companies harvest everyone's data en-masse without permission, train their models on it, and then not only they don't release jack squat, but have the gall to put up malicious explicit roadblocks (hiding CoT traces, banning competitors, etc.) so that no one else can do it to them, and when people try they call it an "attack"[1]. This is what people should be angry about.

[1] -- https://www.anthropic.com/news/detecting-and-preventing-dist...

reply

upvote

by duskdozer13 hours ago|

[-]

>under a permissive license

well, assuming all data that is itself not permissively licensed is excluded

reply

upvote

by foota17 hours ago|

[-]

I don't know how far it would get, but I imagine that a FAANG will be able to get the farthest here by virtue of having mountains of corporate data that they have complete ownership over.

reply

upvote

by msdz15 hours ago|

[-]

They’d probably get the farthest, but they won’t pursue that because they don’t want to end up leaking the original data from training. It is possible in regular language/text subsets of models to reconstruct massive consecutive parts of the training data [1], so it ought to be possible for their internal code, too.

[1] https://arxiv.org/abs/2601.02671

reply

upvote

by foota4 hours ago|

[-]

Copyright for me not for thee? :) That's a good point though. Maybe they could round trip things? E.g., use the model trained only on internal content to generate training data (which you could probably do some kind of screening to remove anything you don't want leaking) and then train a new model off just that?

reply

upvote

by pocksuppet11 hours ago|

[-]

More precisely, "All Rights Reserved" is the explicit lack of any license.

reply

upvote

by dathinab13 hours ago|

[-]

> how does that work

AI can't claim ownership, humans can't either as they haven't produced it. If there is guaranteed no one which can claim ownership it often seen as being in the public domain.

In general it is irrelevant what the copyright of the AI training data is. At least in the US judges have been relevant clear about that. (Except if the AI reproduced input data close to verbatim. _But in general we aren't speaking about AI being trained on a code base but an AI using/rewriting it_.)

(1): Which isn't the same as no one seems to know who has ownership. It also might be owned by no-one in the sense that no one can grant you can copyright permission (so opposite of public domain), but also no-one can sue (so de-facto public domain).

reply

upvote

by jacquesm13 hours ago|

[-]

Humans can't claim ownership, but they are still liable for the product of their bot. That's why MS was so quick to indemnify their users, they know full well that it is going to be super hard to prove that there is a key link to some original work.

The main analogy is this one: you take a massive pile of copyrighted works, cut them up into small sections and toss the whole thing in a centrifuge, then, when prompted to produce a work you use a statistical method to pull pieces of those copyrighted works out of the centrifuge. Sometimes you may find that you are pulling pieces out of the laundromat in the order in which they went in, which after a certain number of tokens becomes a copyright violation.

This suggests there are some obvious ways in which AI companies can protect themselves from claims of infringement but as far as I'm aware not a single one has protections in place to ensure that they do not materially reproduce any fraction of the input texts other than that they recognize prompts asking it to do so.

So it won't produce the lyrics of 'Let it be'. But they'll be happy to write you mountains of prose that strongly resembles some of the inputs.

The fact that they are not doing that tells you all you really need to know: they know that everything that their bots spit out is technically derived from copyrighted works. They also have armies of lawyers and technical arguments to claim the opposite.

reply

upvote

by dathinab13 hours ago|

[-]

> Humans can't claim ownership, but they are still liable for the product of their bot.

sure,

but that is completely unrelated to this discussion

which is about AI using code as input to produce similar code as output

not about AI being trained on code

reply

upvote

by jacquesm12 hours ago|

[-]

> which is about AI using code as input to produce similar code as output

> not about AI being trained on code

The two are very directly connected.

The LLM would not be able to do what it does without being trained, and it was trained on copyrighted works of others. Giving it a piece of code for a rewrite is a clear case of transformation, no matter what, but now it also rests on a mountain of other copyrighted code.

So now you're doubly in the wrong, you are willfully using AI to violate copyright. AI does not create original works, period.

reply

upvote

by bluGill12 hours ago|

[-]

Every programmer is trained on the copyrighted works of others. there a vanishingly few modern programs with available source code in the public domain.

it isn't clear how/if llm is different from the brain but we all have training by looking at copywrited source code at some time.

reply

upvote

by jacquesm12 hours ago|

[-]

> it isn't clear how/if llm is different from the brain

It's very clear: the one is a box full of electronics, the other is part of the central nervous system of a human being.

> but we all have training by looking at copywrited source code at some time.

That may be so, but not usually the copyrighted source code that we are trying to reproduce. And that's the bit that matters.

You can attempt to whitewash it but at its core it is copyright infringement and the creation of derived works.

reply

upvote

by SiempreViernes11 hours ago|

[-]

> but we all have training by looking at copywrited[sic] source code at some time.

The single word "training" is here being used to describe two very different processes; what an LLM does with text during training is at basically every step fundamentally distinct from what a human does with text.

Word embedding and gradient descent just aren't anything at all like reading text!

reply

upvote

by jacquesm10 hours ago|

[-]

Indeed, but that's just a misdirection. We don't actually know how a human brain learns, so it is hard to base any kind of legal definition on that difference. Obviously there are massive differences but what those differences are is something you can debate just about forever.

I have a lot of music in my head that I've listened to for decades. I could probably replicate it note-for-note given the right gear and enough time. But that would not make any of my output copyrightable works. But if I doodle for three minutes on the piano, even if it is going to be terrible that is an original work.

reply

upvote

by pocksuppet11 hours ago|

[-]

Programmer training and AI training are legally distinct processes.

reply

upvote

by graemep12 hours ago|

[-]

> humans can't either as they haven't produced it. If there is guaranteed no one which can claim ownership it often seen as being in the public domain.

Says who?. The US ruling the article refers to does not cover this.

It is different in other countries. Even if US law says it is public domain (which is probably not the case) you had better not distribute it internationally. For example, UK law explicitly says a human is the author of machine generated content: https://news.ycombinator.com/item?id=47260110

reply

upvote

by m4rtink12 hours ago|

[-]

I would be totally fine with all code generated by LLMs being considered to be under GPL v3 unless the model authors can prove without any doubt it was not trained on any GPL v3 code - viral licensing to the max. ;-)

reply

upvote

by adrianN18 hours ago|

[-]

We‘ll have to wait until the technology progresses sufficiently that AI cuts into Disney’s profit.

reply

upvote

by shevy-java14 hours ago|

[-]

"We still have no legal conclusion on whether AI model generated code, that is trained on all publicly available source (irrespective of type of license), is legal or not."

I think it will depend on the way HOW the AI arrived to the new code.

If it was using the original source code then it probably is guilty-by-association. But in theory an AI model could also generate a rewrite if being fed intermediary data not based on that project.

reply

upvote

by dathinab13 hours ago|

[-]

> "We still have no legal conclusion on whether AI model generated code, that is trained on all publicly available source (irrespective of type of license), is legal or not."

it depends on the country you are in

but overall in the US judges have mostly consistently ruled it as legal

and this is extremely unlikely to change/be effectively interpreted different

but where things are more complex is:

- model containing training data (instead of generic abstractions based on it), determined by weather or not it can be convinced to produce close to verbatim output of the training data the discussion is about

- model producing close to verbatim training data

the later seems to be mostly? always? be seen as copyright violation, with the issue that the person who does the violation (i.e. uses the produced output) might not known

the former could mean that not just the output but the model itself can count as a form of database containing copyright violating content. In which case they model provider has to remove it, which is technically impossible(1)... The pain point with that approach is that it will likely kill public models, while privately kept models will for every case put in a filter and _claim_ to have removed it and likely will get away with it. So while IMHO it should be a violation conceptually, it probably is better if it isn't.

But also the case the original article refers to is more about models interacting/using with code base then them being trained on.

(1): For LLMs, it is very much removable for knowledge based used by LLMs.

reply

upvote

by amelius13 hours ago|

[-]

You should just look at it as a giant computation graph. If some of the inputs in this graph are tainted by copyright and an output depends on these inputs (changing them can change the output) then the output is tainted too.

reply

upvote

by d1sxeyes15 hours ago|

[-]

> We still have no legal conclusion on whether AI model generated code, that is trained on all publicly available source (irrespective of type of license), is legal or not.

That horse has bolted. No one knows where all the AI code any more, and it would no longer possible to be compliant with a ruling that no one can use AI generated code.

There may be some mental and legal gymnastics to make it possible, but it will be made legal because it’s too late to do anything else now.

reply

upvote

by 13 hours ago|

[-]

deleted

reply

upvote

by conartist614 hours ago|

[-]

I hate that this may be true, but I also don't think the law will fix this for us.

I think this is down the community and the culture to draw our red lines on and enforce them. If we value open source, we will find a way to prevent its complete collapse through model-assisted copyright laundering. If not, OSS will be slowly enshittified as control of projects slowly flows to the most profit-motivated entities.

reply

upvote

by d1sxeyes12 hours ago|

[-]

But what tools do we have to stop this happening? I agree, we can (and should) all refuse to participate in licence laundering, but there will always be folks less principled.

I don’t know what happens next, honestly.

reply

upvote

by conartist612 hours ago|

[-]

I don't either, but I guess we're both about to find out. There only surety is that there will be moves and countermoves. As far as I could tell the best thing we could do right now is fund software-legal organizations like the EFF which are likely to be the ones to litigate the test cases. What's hurting us most right now is we don't know what law means in this context, so we don't fully understand the scale of what we need to protect against or what tools we have that the courts will recognize

reply

upvote

by thedevilslawyer18 hours ago|

[-]

That's unpractical enough that you might as well wish for UBI and world peace rather than this.

reply

upvote

by kshri2418 hours ago|

[-]

Why is it impractical? Github already has a sponsor system. Also this can be a form of UBI.

reply