That license is called "All Rights Reserved", in which case you wouldn't be able to legally use the output for anything.
There are research models out there which are trained on only permissively licensed data (i.e. no "All Rights Reserved" data), but they're, colloquially speaking, dumb as bricks when compared to state-of-art.
But I guess the funniest consequence of the "model outputs are a derivative work of their training data" would be that it'd essentially wipe out (or at very least force a revert to a pre-AI era commit) every open source project which may have included any AI-generated or AI-assisted code, which currently pretty much includes every major open source project out there. And it would also make it impossible to legally train any new models whose training data isn't strictly pre-AI, since otherwise you wouldn't know whether your training data is contaminated or not.
Models whose authors tried to train only on permissively licensed data.
For example https://huggingface.co/bigcode/starcoder2-15b tried to be a permissively licensed dataset, but it filtered only on repository-level license, not file-level. So when searching for "under the terms of the GNU General Public License" on https://huggingface.co/spaces/bigcode/search-v2 back when it was working, you would find it was trained on many files with a GPL header.
That's what the whole copyright and patent regimes are designed to achieve.
It's to encourage the creation of knowledge.
US Constitution, Article I, section 8:
To promote the Progress of Science and useful Arts, by
securing for limited Times to Authors and Inventors the
exclusive Right to their respective Writings
and Discoveries;That's not what I favor because you are inserting a middleman, the Government, into the mix. The Government ALWAYS wants to maximize tax collections AND fully utilize its budget. There is no concept of "savings" in any Government anywhere in the World. And Government spending is ALWAYS wasteful. Tenders floated by Government will ALWAYS go to companies that have senators/ministers/prime ministers/presidents/kings etc as shareholders. In other words, the tax money collected will be redistributed again amongst the top 500 companies. There is no trickle down. Which is why agreements need to be between creators and those who are enjoying fruits of the creation. What have Governments ever created except for laws that stifle innovation/progress every single time?
https://www.youtube.com/watch?v=Qc7HmhrgTuQ
In all seriousness without the government you would have no innovation and progress, because it's the public school system, functioning roads, research grants a stable and lawful society that allow you to do any kind of innovation.
Apart from that, you have answered to a strawman. I said redistribute, not give to the government. I explicitly worded things that way because I don't think we should not be having a discussion on policy.
I think we are moving to an economy where the share of profits taken by capital becomes much larger than the one take from labor. If that happens then laborers will have very little discretionary income to fuel consumption and even capitalists will end up suffering. We can choose to redistribute now or wait for it to happen naturally, however that usually happens in a much more violent way, be it hyperinflation, famine, war or revolution.
You said: "let's just tax the hell out of AI companies and redistribute.". Only the Government has the power to tax. Question of redistribution does not even arise without first having the power to the coffers of the Company. Which you nor I have. Government CAN have if it wants to by either Nationalizing the Company or as you said "taxing the hell out of" the company. Please explain how you would go about taxing and redistributing without involving the Government?
> In all seriousness without the government you would have no innovation and progress, because it's the public school system, functioning roads, research grants a stable and lawful society that allow you to do any kind of innovation.
These fall under the ambit of governance and hence why you have a Government. That's the only power Governments should have. Governments SHOULD NOT be managing private enterprises.
> I think we are moving to an economy where the share of profits taken by capital becomes much larger than the one take from labor. If that happens then laborers will have very little discretionary income to fuel consumption and even capitalists will end up suffering. We can choose to redistribute now or wait for it to happen naturally, however that usually happens in a much more violent way, be it hyperinflation, famine, war or revolution.
Agreed. Which is why I was proposing private agreements in the first place (without involving a third-party like the Government which, more often than not, mismanages funds).
Just because you have a failure of imagination for how government should work, doesn’t mean it can’t work. And stifling innovation is exactly what I want, when that innovation is “steal from everyone so we can invent the torment nexus” or whatever’s going on these days.
> As its name suggests, the Government Pension Fund Global is invested in international financial markets, so the risk is independent from the Norwegian economy. The fund is invested in 8,763 companies in 71 countries (as of 2024).
Basically what I said above. You give your tax dollars to Government and it will invest it into top 500 companies. In the Norway Pension Fund case it is 8,763 companies in 71 countries. None of them are startups/small businesses/creators.
> And stifling innovation is exactly what I want, when that innovation is “steal from everyone so we can invent the torment nexus” or whatever’s going on these days.
You are confusing current lack of laws regulating this space with innovation being evil. Innovation is not evil. The technology per se is not evil. Every innovation brings with it a set of challenges which requires us to think of new legislation. This has ALWAYS been the case for thousands of years of human innovation.
That wouldn't be fair because these models are not only trained on code. A huge chunk of the training data are just "random" webpages scraped off the Internet. How do you propose those people are compensated in such a scheme? How do you even know who contributed, and how much, and to whom to even direct the money?
I think the only "fair" model would be to essentially require models trained on data that you didn't explicitly license to be released as open weights under a permissive license (possibly with a slight delay to allow you to recoup costs). That is: if you want to gobble up the whole Internet to train your model without asking for permission then you're free to do so, but you need to release the resulting model so that the whole humanity can benefit from it, instead of monopolizing it behind an API paywall like e.g. OpenAI or Anthropic does.
Those big LLM companies harvest everyone's data en-masse without permission, train their models on it, and then not only they don't release jack squat, but have the gall to put up malicious explicit roadblocks (hiding CoT traces, banning competitors, etc.) so that no one else can do it to them, and when people try they call it an "attack"[1]. This is what people should be angry about.
[1] -- https://www.anthropic.com/news/detecting-and-preventing-dist...
well, assuming all data that is itself not permissively licensed is excluded
AI can't claim ownership, humans can't either as they haven't produced it. If there is guaranteed no one which can claim ownership it often seen as being in the public domain.
In general it is irrelevant what the copyright of the AI training data is. At least in the US judges have been relevant clear about that. (Except if the AI reproduced input data close to verbatim. _But in general we aren't speaking about AI being trained on a code base but an AI using/rewriting it_.)
(1): Which isn't the same as no one seems to know who has ownership. It also might be owned by no-one in the sense that no one can grant you can copyright permission (so opposite of public domain), but also no-one can sue (so de-facto public domain).
The main analogy is this one: you take a massive pile of copyrighted works, cut them up into small sections and toss the whole thing in a centrifuge, then, when prompted to produce a work you use a statistical method to pull pieces of those copyrighted works out of the centrifuge. Sometimes you may find that you are pulling pieces out of the laundromat in the order in which they went in, which after a certain number of tokens becomes a copyright violation.
This suggests there are some obvious ways in which AI companies can protect themselves from claims of infringement but as far as I'm aware not a single one has protections in place to ensure that they do not materially reproduce any fraction of the input texts other than that they recognize prompts asking it to do so.
So it won't produce the lyrics of 'Let it be'. But they'll be happy to write you mountains of prose that strongly resembles some of the inputs.
The fact that they are not doing that tells you all you really need to know: they know that everything that their bots spit out is technically derived from copyrighted works. They also have armies of lawyers and technical arguments to claim the opposite.
sure,
but that is completely unrelated to this discussion
which is about AI using code as input to produce similar code as output
not about AI being trained on code
> not about AI being trained on code
The two are very directly connected.
The LLM would not be able to do what it does without being trained, and it was trained on copyrighted works of others. Giving it a piece of code for a rewrite is a clear case of transformation, no matter what, but now it also rests on a mountain of other copyrighted code.
So now you're doubly in the wrong, you are willfully using AI to violate copyright. AI does not create original works, period.
it isn't clear how/if llm is different from the brain but we all have training by looking at copywrited source code at some time.
It's very clear: the one is a box full of electronics, the other is part of the central nervous system of a human being.
> but we all have training by looking at copywrited source code at some time.
That may be so, but not usually the copyrighted source code that we are trying to reproduce. And that's the bit that matters.
You can attempt to whitewash it but at its core it is copyright infringement and the creation of derived works.
The single word "training" is here being used to describe two very different processes; what an LLM does with text during training is at basically every step fundamentally distinct from what a human does with text.
Word embedding and gradient descent just aren't anything at all like reading text!
I have a lot of music in my head that I've listened to for decades. I could probably replicate it note-for-note given the right gear and enough time. But that would not make any of my output copyrightable works. But if I doodle for three minutes on the piano, even if it is going to be terrible that is an original work.
Says who?. The US ruling the article refers to does not cover this.
It is different in other countries. Even if US law says it is public domain (which is probably not the case) you had better not distribute it internationally. For example, UK law explicitly says a human is the author of machine generated content: https://news.ycombinator.com/item?id=47260110
I think it will depend on the way HOW the AI arrived to the new code.
If it was using the original source code then it probably is guilty-by-association. But in theory an AI model could also generate a rewrite if being fed intermediary data not based on that project.
it depends on the country you are in
but overall in the US judges have mostly consistently ruled it as legal
and this is extremely unlikely to change/be effectively interpreted different
but where things are more complex is:
- model containing training data (instead of generic abstractions based on it), determined by weather or not it can be convinced to produce close to verbatim output of the training data the discussion is about
- model producing close to verbatim training data
the later seems to be mostly? always? be seen as copyright violation, with the issue that the person who does the violation (i.e. uses the produced output) might not known
the former could mean that not just the output but the model itself can count as a form of database containing copyright violating content. In which case they model provider has to remove it, which is technically impossible(1)... The pain point with that approach is that it will likely kill public models, while privately kept models will for every case put in a filter and _claim_ to have removed it and likely will get away with it. So while IMHO it should be a violation conceptually, it probably is better if it isn't.
But also the case the original article refers to is more about models interacting/using with code base then them being trained on.
(1): For LLMs, it is very much removable for knowledge based used by LLMs.
That horse has bolted. No one knows where all the AI code any more, and it would no longer possible to be compliant with a ruling that no one can use AI generated code.
There may be some mental and legal gymnastics to make it possible, but it will be made legal because it’s too late to do anything else now.
I think this is down the community and the culture to draw our red lines on and enforce them. If we value open source, we will find a way to prevent its complete collapse through model-assisted copyright laundering. If not, OSS will be slowly enshittified as control of projects slowly flows to the most profit-motivated entities.
I don’t know what happens next, honestly.