The solution documented here seems very pragmatic. You as a contributor simply state that you are making the contribution and that you are not infringing on other people's work with that contribution under the GPLv2. And you document the fact that you used AI for transparency reasons.
There is a lot of legal murkiness around how training data is handled, and the output of the models. Or even the models themselves. Is something that in no way or shape resembles a copyrighted work (i.e. a model) actually distributing that work? The legal arguments here will probably take a long time to settle but it seems the fair use concept offers a way out here. You might create potentially infringing work with a model that may or may not be covered by fair use. But that would be your decision.
For small contributions to the Linux kernel it would be hard to argue that a passing resemblance of say a for loop in the contribution to some for loop in somebody else's code base would be anything else than coincidence or fair use.
When AI output can be copyrighted is when copyrighted elements are expressed in it, like if you put copyrighted content in a prompt and it is expressed in the output, or the output is transformed substantially with human creativity in arrangement, form, composition, etc.
[1] https://newsroom.loc.gov/news/copyright-office-releases-part...
It's also not really clear if you can or cannot copyright AI output. The case that everyone cites didn't even reach the point where courts had to rule on that. The human in that case decided to file the copyright for an AI, and the courts ruled that according to the existing laws copyright must be filed by a person/human/whatever.
So we don't yet have caselaw where someone used AIgen and claimed the output as written by them.
Does a digitally encoded version resemble a copyrighted work in some shape or form? </snark>
Where is this hangup on models being something entirely different than an encoding coming from? Given enough prodding they can reproduce training data verbatim or close to that. Okay, given enough prodding notepad can do that too, so uncertainty is understandable.
This is one of the big reasons companies are putting effort into the so called "safety": when the legal battles are eventually fought, they would have an argument that they made their best so that the amount of prodding required to extract any information potentially putting them under liability is too great to matter.
Well that's different because an encoded image or video clearly intends to reproduce the original perfectly and the end result after decoding is (intentionally) very close to form of the original. Which makes it a clear cut case of being a copy of the original.
The reason so many cases don't get very far is that mostly judges and lawyers don't think like engineers. Copyright law predates most modern technology. So, everything needs to be rephrased in terms of people copying stuff for commercial gain. The original target of the law was people using printing presses to create copies of books written by others. Which was hugely annoying to some publishers who thought they had exclusive deals with authors. But what about academics quoting each other? Or literary reviews. Or summaries. Or people reading from a book on the radio? This stuff gets complicated quickly. Most of those things were settled a long time ago. Fair use is a concept that gets wielded a lot for this. Yes its a copy but its entirely reasonable for the copy holder to be doing what they are doing and therefore not considered an infringement.
The rest is just centuries of legal interpretation of that and how it applies to modern technology. Whether that's DJs sampling music or artists working in visual imagery into their art works. AI is mostly just more of the same here. Yes there are some legally interesting aspects with AI but not that many new ones. Judges are unlikely to rethink centuries of legal interpretations here and are more likely to try to reconcile AI in with existing decisions. Any changes to the law would have to be driven by politicians; judges tend to be conservative with their interpretations.
So if the AI outputs Starry Night or Starry Night in different color theme, that's likely infringement without permission from van Gogh, who would have recourse against someone, either the user or the AI provider.
But a starry-night style picture of an aquarium might not be infringing at all.
>For small contributions to the Linux kernel it would be hard to argue that a passing resemblance of say a for loop in the contribution to some for loop in somebody else's code base would be anything else than coincidence or fair use.
I would argue that if it was a verbatim reproduction of a copyrighted piece of software, that would likely be infringing. But if it was similar only in style, with different function names and structure, probably not infringing.
Folks will argue that some things might be too small to do any different, for example a tiny snippet like python print("hello") or 1+1=2 or a for loop in your example. In that case it's too lacking in original expression to qualify for copyright protection anyway.
But your point still stands.
That is a non sequitur. Also, I'm not sure if copyright applies to humans, or persons (not that I have encountered particularly creative corporations, but Taranaki Maunga has been known for large scale decorative works)
However, if the code has been slightly changed by a human, it can be copyrighted again. I think.
US Copyright Office guidance in 2023 said work created with the help of AI can be registered as long as there is "sufficient human creative input". I don't believe that has ever been qualified with respect to code, but my instinct is that the way most people use coding agents (especially for something like kernel development) would qualify.
Though I guess such a suit is unlikely if the defendant could just AI wash the work in the first place.
I don't believe the idea that humans can or can't claim copyright over AI-authored works has been tested. The Copyright Office says your prompt doesn't count and you need some human-authored element in the final work. We'll have to see.
Copyright requires some amount of human originality. You could copyright the prompt, and if you modify the generated code you can claim copyright on your modifications.
The closest applicable case would be the monkey selfie.
https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...
No, my understanding is that AI generated content can't be copyrighted by the AI. A human can still copyright it, however.
Whether a person can claim copyright of the output of a computer program is generally understood as depending on whether there was sufficient creative effort from said person, and it doesn't really matter whether the program is Photoshop or ChatGPT.
But you shouldn't be right. I mean, morally.
The law is a compromise between what the people in power want and what they can get away with without people revolting. It has nothing to do with morality, fairness or justice. And we should change that. The promise of democracy was (among other things) that everyone would be equal, everybody would get to vote and laws would be decided by the moral system of the majority. And yet, today, most people will tell you they are unhappy about the rising cost of living and rising inequality...
The law should be based on complete and consistent moral system. And then plagiarism (taking advantage of another person's intellectual work without credit or compensation) would absolutely be a legal matter.
LLMs are not persons, not even legal ones (which itself is a massive hack causing massive issues such as using corporate finances for political gain).
A human has moral value a text model does not. A human has limitations in both time and memory available, a model of text does not. I don't see why comparisons to humans have any relevance. Just because a human can do something does not mean machines run by corporations should be able to do it en-masse.
The rules of copyright allow humans to do certain things because:
- Learning enriches the human.
- Once a human consumes information, he can't willingly forget it.
- It is impossible to prove how much a human-created intellectual work is based on others.
With LLMs:
- Training (let's not anthropomorphize: lossily-compressing input data by detecting and extracting patterns) enriches only the corporation which owns it.
- It's perfectly possible to create a model based only on content with specific licenses or only public domain.
- It's possible to trace every single output byte to quantifiable influences from every single input byte. It's just not an interesting line of inquiry for the corporations benefiting from the legal gray area.
If it's too hard to check outputs, don't use the tool.
Your arguments about copyright being different for LLMs: at the moment that's still being defined legally. So for now it's an ethical concern rather than a legal one.
For what it's worth I agree that LLMs being trained on copyright material is an abuse of current human oriented copyright laws. There's no way this will just continue to happen. Megacorps aren't going to lie down if there's a piece of the pie on the table, and then there's precedent for everyone else (class action perhaps)
As for checking outputs - I don't believe that's sufficient. Maybe the letter of the law is flawed but according to the spirit the model itself is derivative work.
A model takes several orders of magnitude more work as training data than it takes to code the training algorithm itself, to any reasonable and sane person, that makes it a derivative work of the training data by nearly 100% - we can only argue how many nines it should be.
> precedent
Yeah but the US system makes me very uneasy about it. The right way to do this is to sit down, talk about the options and their downstream implications, talking about fairness and justice and then deciding what the law should be. If we did that, copyright law would look very different in the first place and this whole thing would have an obvious solution.
That is not the case when using AI generated code. There is no way to use it without the chance of introducing infringing code.
Because of that if you tell a user they can use AI generated code, and they introduce infringing code, that was a foreseeable outcome of your action. In the case where you are the owner of a company, or the head of an organization that benefits from contributors using AI code, your company or organization could be liable.
But if a lawsuit was later brought who would be sued? The individual author or the organization? In other words can an organization reduce its liability if it tells its employees "You can break the law as long as you agree you are solely responsible for such illegal actions?
It would seem to me that the employer would be liable if they "encourage" this way of working?
I think you’re looking for problems that don’t really exist here, you seem committed to an anti AI stance where none is justified.
If you don't think this is a problem take a look at the terms of the enterprise agreements from OpenAI and Anthropic. Companies recognize this is an issue and so they were forced to add an indemnification clause, explicitly saying they'll pay for any damages resulting in infringement lawsuits.
Humans routinely produce code similar to or identical to existing copyrighted code without direct copying.
On independent creation: you are conflating the tool with the user. The defense applies to whether the developer had access to the copyrighted work, not whether their tools did. A developer using an LLM did not access the training set directly, they used a synthesis tool. By your logic, any developer who has read GPL code on GitHub should lose independent creation defense because they have "demonstrated capability to produce code directly from" their memory.
LLM memorization/regurgitation is a documented failure mode, not normal operation (nor typical case). Training set contamination happens, but it is rare and considered a bug. Humans also occasionally reproduce code from memory: we do not deny them independent creation defense wholesale because of that capability!
In any case, the legal question is not settled, but the argument that LLM-assisted code categorically cannot qualify for independent creation defense creates a double standard that human-written code does not face.
Practically speaking humans do not produce code that would be found in court to be infringing without intent.
It is theoretically possible, but it is not something that a reasonable person would foresee as a potential consequence.
That’s the difference.
> LLM memorization/regurgitation is a documented failure mode, not normal operation (nor typical case).
Exactly. It is a documented failure mode that you as a user have no capacity to mitigate or to even be aware is happening.
Double standards are perfectly fine. LLMs are not conscious beings that deserve protection under the law.
>not settled.
What appears to likely be settled is that human authorship is required, so there’s no way that an LLM could qualify for independent creation.
They wouldn't be some patsy that is around just to take blame, but the actual responsible party for the issue.
You hire an independent contractor and tell him that he can drive 60 miles per hour if he wants to but if it explodes he accepts responsibility.
He does and it explodes killing 10 people. If the family of those 10 people has evidence you created the conditions to cause the explosion in order to benefit your company, you're probably going to lose in civil court.
Linus benefits from the increase velocity of people using AI. He doesn't get to put all the liability on the people contributing.
Anyone who thinks they have a strong infringement case isn’t going to stop at the guy who authored the code, they’re going to go after anyone with deep pockets with a good chance of winning.
There is still the "mens rea" principle. If you distribute infringing material unknowingly, it would very likely not result in any penalties.
As long as everything is GPLv2-compatible it‘s okay.
Surely the person doing so would be responsible for doing so, but are they doing anything wrong?
You're perfectly at liberty to relicense public domain code if you wish.
The only thing you can't do is enforce the new license against people who obtain the code independently - either from the same source you did, or from a different source that doesn't carry your license.
If I use public domain code in a project under a license, the whole work remains under the license, but not the public domain code.
I'm not sure what the hullabaloo is about.
No, because they've independently obtained it from the same source that you did, so their copy is "upstream" of your imposing of a new license.
Realistically, adding a license to public domain work is only really meaningful when you've used it as a starting point for something else, and want to apply your license to the derivative work.
Remember that licenses are powered by copyright - granting a license to non-copyrighted code doesn't do anything, because there's no enforcement mechanism.
This is also why copyright reform for software engineering is so important, because code entering the public domain cuts the gordian knot of licensing issues.
If your license allows others to take the code and redistribute it with extra conditions, your code can be imported into the kernel. AFAIK there are parts of the kernel that are BSD-licensed.
Claiming copyright on an unmodified public domain work is a lie, so in some circumstances could be an element of fraud, but still wouldn’t be a copyright violation.
LLM-creation ("training") involves detecting/compressing patterns of the input. Inference generates statistically probable based on similarities of patterns to those found in the "training" input. Computers don't learn or have ideas, they always operate on representations, it's nothing more than any other mechanical transformation. It should not erase copyright any more than synonym substitution.
There's a pretty compelling argument that this is essentially what we do, and that what we think of as creativity is just copying, transforming, and combining ideas.
LLMs are interesting because that compression forces distilling the world down into its constituent parts and learning about the relationships between ideas. While it's absolutely possible (or even likely for certain prompts) that models can regurgitate text very similar to their inputs, that is not usually what seems to be happening.
They actually appear to be little remix engines that can fit the pieces together to solve the thing you're asking for, and we do have some evidence that the models are able to accomplish things that are not represented in their training sets.
Kirby Ferguson's video on this is pretty great: https://www.youtube.com/watch?v=X9RYuvPCQUA
If people find this cool and wanna play with it, they can, just make sure to only mix compatible licenses in the training data and license the output appropriately. Well, the attribution issue is still there, so maybe they can restrict themselves to public domain stuff. If LLMs are so capable, it shouldn't limit the quality of their output too much.
Now for the real issue: what do you think the world will look like in 5 or 10 years if LLMs surpass human abilities in all areas revolving around text input and output?
Do you think the people who made it possible, who spent years of their life building and maintaining open source code, will be rewarded? Or will the rich reap most of the benefit while also simultaneously turning us into beggars?
Even if you assume 100% of the people doing intellectual work now will convert to manual work (i.e. there's enough work for everyone) and robots don't advance at all, that'll drive the value of manual labor down a lot. Do you have it games out in your head and believe somehow life will be better for you, let alone for most people? Or have yo not thought about it at all yet?
I think they should be rewarded more than they are currently. But isn't the GNU Public License bassically saying you can use such source-code without giving any rewards what so ever?
But I see your The reward for Open Source developers is the public recognition for their works. LLMs can take that recognition away.