I am not against AI coding in general. But there are too many people "contributing" AI generated code to open source projects even when they can't understand what's going on in their code just so they can say in their resumes that they contributed to a big open source project once. And when the maintainer call them out they just blame it on the AI coding tools they are using as if they are not opening PRs under their own names. I can't blame any open source maintainer for being at least a little sceptical when it comes to AI generated contributions.
It looks to me like a more restrictive policy will be flat-out impossible.
Even people I trust are going along with this stuff, akin to CAD replacing drafting. Code is logic as language, and starting with web code and rapidly metastasizing to C++ (due to complexity and the sheer size of the extant codebase, good and bad) the AI has turned slop-coding to a 'solved problem'. If you don't mean to do the best possible thing or a new thing there is no excuse for existing as a coder in the world of AI.
If you do expect to do a new thing or a best thing, in theory you're required to put out the novel information as AI cannot reach it until you've entered it into the corpus of existing code the AI's built on. However, if you're simply recombining existing aspects of the code language in a novel way, that might be more reachable… that's probably where 'AI escape velocity' will come from should it occur.
In practice, everybody I know is relegating the busywork of coding to AI. I don't feel social pressure to do the same but I'm not a coder. I'm something else that produces MIT-licensed codebases for accomplishing things that aren't represented in code AS code, rather it's for accomplishing things that are specific and experiential. I write code to make specific noises I'm not hearing elsewhere, and not hearing out of the mainstream of 'sound-making code artifacts'.
Therefore, it's impractical for Linux to take any position forbidding AI-assisted code. People will just lie and claim they did it. Is primitive tab-complete also AI? Where's the line? What about when coding tools uniformly begin to tab-complete with extensive reasoning and code prototyping? I already see this in the JetBrains Rider editor I use for Godot hacking, even though I've turned off everything I can related to AI. It'll still try to tab-complete patterns it thinks it recognizes, rarely with what I intend.
And so the choice is to enforce responsibility. I think this is appropriate because that's where the choices will matter. Additions and alterations will be the responsibility of specific human people, which won't handle everything negative that's happening but will allow for some pressures and expectations that are useful.
I don't think you can be a collaborative software project right now and not deal with this in some way. I get out of it because I'm read-only: I'm writing stuff on a codebase that lives on an antique laptop without internet access that couldn't run AI if it tried. Very likely the only web browsers it can run are similarly unable to handle 2026 web pages, though I've not checked in years. You've only got my word for that, though, and your estimation of my veracity based on how plausible it seems (I code publically on livestreams, and am not at all an impressive coder when I do that). Linux can't do what I do, so it's going to do what Linux does, and this seems the best option.
… my dad is 86 and only after I signed him up to Claude could he write Arduino code without a phone call to me after 5 minutes of trying himself. So now, he’s spending 4+ hours at a time focused writing code and building circuits of things he only dreamt about creating for decades.
Unless you’re doing something for the personal love of the craft and sharpening your tools, use every advantage you can get in order to do the job.
But… as above, if you’re doing it for the love of it, sure - hand crafted code does taste better and you know all the ingredients are organic
Can't really blame people for reducing their level of effort. It's very easy to put in a lot of effort and end up with absolutely nothing to show for it. Before AI came along, my realization was that begging the maintainers to implement the features I wanted was the right move. They have all the context and can do it better than us in a fraction of the time it'd take us to do it. Actually cloning someone else's repository and working on it should only be attempted if one is willing to literally fork it and own the project should things go south. Now that we have AI, it's actually possible to easily understand and modify complex codebases, and I simply cannot find the will to blame people for using it to the fullest extent. Getting the AI to maintain the fork is really easy too.
I don't think it's insane. It seems reasonable that people could disagree about how much attribution and disclosure there should be about AI assistance, or if it's even allowed, etc.
Every document in that `process` directory explains stuff that could be obvious to some people but not others.
What's missed is that neither contributors nor maintainers are usually paid for their effort and nobody has standing to demand that they do anything they are not doing already. Don't like a messy vibe coded PR but need functionality? Then clean it up yourself and send improved version for review. Or let it be unmerged. But don't assign work to others you don't employ.
On the other hand, companies like NVIDIA should be publicly taken to task for changing their mind about instruction set for every new GPU and then not supporting them properly in popular inference engines, they certainly have enough money to hire people who will learn vLLM inside out and ensure high quality patches.
Plenty see Torvalds as a traitor for this policy and will never contribute again if any clearly labeled AI generated code is actually allowed to merge.
Obviously these issues existed before AI, but they required active deception before. Regurgitating others people's code just becomes the norm now.
It obviously depends on how powerful AI is going to become. These scenarios are mutually exclusive because some assume that AI is actually not very powerful and some assume that it is very powerful. I think one of these things happening is not at all unlikely.
In essence, we get the output without the matching mental structures being developed in humans.
This is great if you have nothing left to learn, its not that great if you are a newbie, or have low confidence in your skill.
> LLM users also struggled to accurately quote their own work. While LLMs offer immediate convenience, our findings highlight potential cognitive costs. Over four months, LLM users consistently underperformed at neural, linguistic, and behavioral levels.
> https://arxiv.org/abs/2506.08872
> https://www.media.mit.edu/publications/your-brain-on-chatgpt...
But in the present case the authorship is just removed by shredding the library and then piecing back together the sentences. The fact that under some circumstances AIs will happily reproduce code that was in the training data is proof positive they are to some degree lossy compressors. The more generic something is ("for (i=0;i<MAXVAL;i++) {") the lower the claim for copyright infringement. But higher level constructs past a couple of lines that are unique in the training set that are reproduced in the output modulo some name changes and/or language changes should count as automatic transformation (and hence infringing or creating a derivative work).
The people using GenAI should be the ones doing the verification. The maintainer's job should not meaningfully change (other than the maintainer using AI to review on incoming code, of course).
Why does everyone who hears "AI code" automatically think "vibe-coded"?
People are generally against change that forces them to change the way they used to do things. I'm sure most will have their reasons why they are against this particular change, but I don't think it will affect anything. The genie is out of the bottle, AI is here to stay. You either adapt or you will slowly wither away.
You missed the whole arab spring thing?
It needs to be modified by a human. No amount of prompting counts, and you can only copyright the modified parts.
Any license on "100% vibecoded" projects can be safely ignored.
I expect litigations in a few years where people argue about how much they can steal and relicense "since it was vibecoded anyway".
> In these cases, copyright will only protect the human-authored aspects of the work, which are “independent of” and do “not affect” the copyright status of the AI-generated material itself.
[0] https://www.federalregister.gov/documents/2023/03/16/2023-05...
There's really 2 ways to argue this:
- Either AI exists and then it's something new and the laws protecting human creativity and work clearly could not have taken it into account and need to be updated.
- Or AI doesn't exist, LLMs are nothing more than lossily compressed models violating the licenses of the training data, their probabilistically decompressed output is violating the licenses as well and the LLM companies and anyone using them will be punished.
Ultimately LLMs (the first L stands for large and for a good reason) are only possible to create by taking unimaginable amounts of work performed by humans who have not consented to their work being used that way, most of whom require at least being credited in derivative works and many of whom have further conditions.
Now, consent in law is a fairly new concept and for now only applied to sexual matters but I think it should apply to every human interaction. Consent can only be established when it's informed and between parties with similar bargaining power (that's one reason relationships with large age gaps are looked down upon) and can be revoked at any time. None of the authors knew this kind of mass scraping and compression would be possible, it makes sense they should reevaluate whether they want their work used that way.
There are 3 levels to this argument:
1) The letter of the law - if you understand how LLMs work, it's hard to see them as anything more than mechanical transformers of existing work so the letter should be sufficient.
2) The intent of the law - it's clear it was meant to protect human authors from exploitation by those who are in positions where they can take existing work and benefit from it without compensating the authors.
3) The ethics and morality of the matter - here it's blatantly obvious that using somebody's work against their wishes and without compensating them is wrong.
In an ideal world, these 3 levels would be identical but they're not. That means we should strive to make laws (in both intent and letter) more fair and just by changing them.
You could even say it strongly would very strongly incentivize the LLM companies to be on their best behavior, otherwise people would start revoking consent en-masse and they'd have to keep training new models all the time.
If you want something more realistic, there would probably be time limits how long they have to comply and how much they have to compensate the authors for the time it took them to comply.
There absolutely are ways to make it work in mutually beneficial ways, there's just no political will because of the current hype and because companies have learned they can get away with anything (including murder BTW).
(Much of the apparent gain of the automatic search-copy-paste is wasted by skipping the review phase that would have been done at that time when that were done manually, which must then be done in a slower manner when you must review the harder-to-understand entire program generated by the AI assistant.)
Despite the fact that AI coding assistants are copyright breaking tricks, the fact that this has become somehow allowed is an overall positive development.
The concept of copyright for programs has been completely flawed from its very beginning. The reason is that it is absolutely impossible to write any kind of program that is not a derivative of earlier programs.
Any program is made by combining various standard patterns and program structures. You can construct a derivation sequence between almost any 2 programs, where you decompose the first in some typical blocks, than compose the second program from such blocks, while renaming all identifiers.
It is quite subjective to decide when a derivation sequence becomes complex enough that the second program should not be considered as a derivative of the first from the point of view of copyright.
The only way to avoid the copyright restrictions is to exploit loopholes in the law, e.g. if translating an algorithm to a different programming language does not count as being derivative or when doing other superficial automatic transformations of a source program changes its appearance sufficiently that it is not recognized as derivative, even if it actually is. Or when combining a great number of fragments from different programs is again not recognized as derivative, though it still kind of is.
The only way how it became possible for software companies like Microsoft or Adobe to copyright their s*t is because the software industry based on copyrighted programs has been jumpstarted by a few decades of programming during which programs were not copyrighted, which could then be used as a base by the first copyrighted programs.
So AI coding agents allow you to create programs that you could not have written when respecting the copyright laws. They also may prevent you from proving that a program written by someone else infringes upon the copyright that you claim for a program written with assistance.
I believe that both these developments are likely to have more positive consequences than negative consequences. The methods used first in USA and then also in most other countries (due to blackmailing by USA) for abusing the copyright laws and the patent laws have been the most significant blockers of technical progress during the last few decades.
The most ridiculous claim about the copyright of programs is that it is somehow beneficial for "creators". Artistic copyrights sometimes are beneficial for creators, but copyrights on non-open-source programs are almost never owned by creators, but by their employers, and even those have only seldom any direct benefit from the copyright, but they use it with the hope that it might prevent competition.
And that's why copyright has exceptions for humans.
You're right copyright was the wrong tool for code but for the wrong reasons.
It shouldn't be binary. And the law should protect all work, not just creative. Either workers would come to a mutual agreement how much each contributed or the courts would decide based on estimates. Then there'd be rules about how much derivation is OK, how much requires progressively more compensation and how much the original author can plainly tell you what to do and not do with the derivative.
It's impossible to satisfy everyone but every person has a concept of fairness (it has been demonstrated even in toddlers). Many people probably even have an internally consistent theory of fairness. We should base laws on those.
> abusing the copyright laws and the patent laws have been the most significant blockers of technical progress during the last few decades
Can you give examples?
> copyrights on non-open-source programs are almost never owned by creators, but by their employers
Yes and that's another thing that's wrong with the system, employment is a form of abusive relationship because the parties are not equal. We should fix that instead of throwing out the whole system. Copyright which belongs to creators absolutely does give creators more leverage and negotiating power.
Look, if you think I am wrong, you can surely put it into words. OTOH, if you don't think I am wrong but feel that way, then it explains why I see no coherent criticism of my statements.
The signal you’re sending is that you are not open to discussing the issue.
The playing field is level now, and corpo moats no longer exist. I happily take that trade.
They can wash the copyright by AI training, but the AIs don't get trained on closed source.
"corpo" also has a ton of patents, which still can't be AI-washed.
What will become unenforceable are Open Source Licenses exclusively, how does that make it a "level field"?
It's going to be very interesting to see 'cleanroom' kind of development in the AI age but I suspect it's not going to be such a walk in the park as some seem to think it will be. There are just too many vested interests. But: it would be nice to see someone do a release of say the Oracle source code as rewritten by AI through this progress, just to see how fast the IP hammer will come down on this kind of trick.
If the argument is just "They won't catch me", then yes you are correct.
But some of us are still forced to follow the law, whatever it might be.
Also: They still have patents on it.
Not to mention companies will try to mandate hardware decryption keys so the binary is encrypted and your AI never even gets to analyze the code which actually runs.
It's not sci-fi, it's a natural extension of DRM.
1) The financial aspect: As you say, more and more advanced DRM requires more and more advanced tools. Even assuming advanced AI can guide any human to do the physical part, that still means you have to pay for the hardware. And the hardware has to be available (companies have been known to harass people into giving up perfectly moral and legal projects).
2) The legal aspect: Possession of burglary tools is illegal in some places. How about possession of hacking tools? Right now it's not a priority for company lobbying, what about when that's the only way to decompile? Even today, reverse engineering is a legal minefield. Did you know in some countries you can technically legally reverse engineer but under some conditions such as having disabilities necessitating it and only using the result for personal use?[0]
3) The TOS aspect: What makes you think AI will help you? If the company owning the AI says so, you're on your own.
---
You need to understand 2 things:
- Just because something is possible doesn't mean somebody is gonna do it. Effort, cost and risk play huge roles. And that assumes no active hostile interference.
- History is a constant struggle between groups with various goals and incentives. Some people just want to live a happy life, have fun and build things in their free time. Other people want to become billionaires, dream about private islands, desire to control other people's lives and so on. People are good at what they focus on. There's perhaps more of the first group but the second group is really good at using their money and connections to create more money and connections which they in turn use to progress towards their primary objectives, usually at the expense of other people. People died[1] over their right to unionize. This can happen again.
Somebody might believe historical people were dumb or uncivilized and it can't happen today because we've advanced so much. That's bullshit. People have had largely the same wetware for hundreds of thousands of years. The tools have evolved but their users have not.
[0]: https://pluralistic.net/2026/03/16/whittle-a-webserver/ - "... aren't tools exemptions, they're use exemptions ... You have that right. Your mechanic does not have that right."
[1]: https://en.wikipedia.org/wiki/Pinkerton_(detective_agency)
AI proponents completely ignore the disparity of resources available to an individual and a corporation. If I and a company of 1000 people create the same product and compete for customers, the company's version will win. Every single time. Or maybe at least 1000:1 if you're an optimist.
They have access to more money for advertising, they have an already established network of existing customers, they have legal and marketing experts on payroll. Or just look at Microsoft, they don't even need advertising, they just install their product by default and nobody will even hear about mine.
Not to mention as you said, the training advances only goes from open source to closed source, not the other way around.
AI proponents who talk about "democratization" are nuts, it would be laughable if it wasn't so sad.
As a person who works for a company with 25k people, I would disagree. You, a single person will often get to the basic product that a lot of people will want much faster than a company with 1k, 5k and 25k people.
Bigger companies are constrained by internal processes, piles of existing stuff, and inability to hire at the scale they need and larger required context. Also regulation and all that. Bigger companies are also really slow to adapt, so they would rather let you build the product and then buy out your company with your product and people who build it. They are at at a temporary disadvantage every time the landscape shifts.
Besides that, your whole arguments hinges on large companies being inflexible, inefficient and poorly run. Isn't that exactly the kind of problem AI promises to solve? Complete AI surveillance of every employee, tasks and instructions tailored to each individual and superhuman planning. Of course at that point, the only employees will be manual workers because actual AI will be much better and cheaper at everything than every human, except those things where it needs to interact with the physical world. Even contract negotiations with both employees and customers will be done with AI instead of humans, the human will only sign off on it for legal requirements just like today you technically enter a contract with a representative of the company who is not even there when you talk to a negotiator.
If/when superhuman AI is achieved, those limitations will all go away. An owner will just give it money and control and tell it to optimize for more money or political power or whatever he wants.
That's a much scarier future than a paperclip maximizer because it's much closer and it doesn't require complete takeover first, it'll be just business as usual, except more somehow more sociopathic.
Nitpicking on the license here, but please don't use MIT, it has no patent grant protections.
And those are never covered in any AI-washing anyway.
There are equivalent licenses with patent grant protection, like 'Apache2+LLVM exception' or 'Mozilla Public License 2' and others...
You cannot keep a purely legally-enforced moat in the face of advancing technology.
In the USA the DMCA can make it illegal to even own and use tools meant to bypass even the weakest of protection.
This law has already been used to ruin lives.
"They might catch the individual but not us all" is nice and fine until it is your turn, so check your legislation.
IP law means nothing once tens of millions of people are openly violating it.
The software industry is about to learn this lesson too.
Uhm... yes? The cost of downloading pirated music is essentially zero. The only reason why people use services like Spotify is because it's extremely cheap while being a bit more convenient. But jack up the price and the masses will move to sail the sea again.
That is not necessarily true, depending on the level of enforcement and the availability of opportunities to steal.
> Same argument can be made for streaming, and yet Netflix is neither cheap nor struggling for subscribers.
Netflix is still pretty cheap for the convenience it provides. Again, jack up the price and see the masses move to torrent movies/shows again.
Yet.
A whole bunch of people I watch on youtube (politics, analysts, a weatherman) are already seeing AI impersonation videos, sometimes misrepresenting their positions and identities. This will grow.
So, you can't create art because that's extruded at scale in such a way that it's just turning on the tap to fill a specified need, and you can't be a person because that can also be extruded at scale pretty soon, either to co-opt whatever you do that's distinct, or to contradict whatever you're trying to say, as you.
As far as being a person able to exist and function through exchanging anything you are or anything you do for recompense, to survive, I'm not sure that's in the cards. Which seems weird for a technology in the guise of aiding people.
As far as I know that has only been decided in US so far, which is far from the whole world.
Everything else is various shades of "No, unless a human modified it"
edit: https://www.copyright.gov/ai/Copyright-and-Artificial-Intell...
How am I gonna prove I did?
They can just generate the same code with an AI assistant, and then it is you who cannot claim that their code infringes the copyright that you claim for the code that you have written with assistance.
So neither of the 2 parties that have used an AI assistant is able to prevent the other party to use the generated code.
I consider this as a rather good outcome and not as a disadvantage of using AI assistants. However, this may be construed as a problem by the stupid corporate lawyers who insist that any product of the company must use only software IP than is the property of the company.
These kind of lawyers are encountered in many companies and they are the main reason for the low software productivity that was typical in many places before the use of AI assistants.
I wonder how many of those lawyers have already understood that this new fashion of using AI is incompatible with their mandated policies, which have always been the main blocker against efficient software reuse.
Who can prove that I didn't write the code myself? And if I did, how am I to prove it?
That goes in both directions.
It's not like there is a watermark in the code telling the whole wide world that this was AI generated or human made.
So I write code (with or without an AI assistant) and claim copyright... they generate the same code. I sue them.
How does any of us prove that we wrote the code by hand?
It’s weird how people on HN state legal opinion as fact… e.g if someone in the Philippines vibecodes an app and a person in Equador vibecodes a 100% copy of the source, what now?
Model outputs are not copyrightable at all, only human work. That means the prompt, and whatever modifications done to output by human, are copyrighted, but nothing else.
HOWEVER, that does not mean the output can not violate copyright. Output of the model falls under same "derivative work" rules as anything else, AI just can't add its own "authorship". So if you accidentally or not recover script for a movie with serial numbers filed off, then its derivative work, etc. Same with code.
Everywhere else in the world is in various shades of "No, unless a human modified it"
https://www.copyright.gov/ai/Copyright-and-Artificial-Intell...
There's a threshold where you modify it enough, it is no longer recognizable as being a modification of the original and you might get away with it, unless you confess what process you used to create it.
This is different to learning from the original and then building something equivalent from scratch using only your memory without constantly looking back and forth between your copy and the original.
This is how some companies do "clear room reimplementations" - one team looks at the original and writes a spec, another team which has never seen the original code implements an entirely standalone version.
And of course there are people who claim this can be automated now[0]. This one is satire (read the blog) but it is possible if the law is interpreted the way LLM companies work and there are reports the website works as advertised by people who were willing to spend money to test it.
[0]: https://malus.sh/
These sorts of things are almost never tested legally and it seems even less likely now.
Plenty see {{some_woodworker}} as a traitor for this policy and will never contribute again if any clearly labeled table saw cuts is actually allowed to be used in furniture making.
A table saw isn't a probabilistic device.
Also I, a programmer, can immediately see whether the "probabilistic device" generated code that looks like it should.
Both just let me get to the same result faster with good enough quality for the situation.
I can grab a tape measure or calipers and examine the piece of wood I cut on the table saw and check if it has the correct measurements. I can also use automated tests and checks to see that the code produced looks as it should and acts as it should.
If it looks like a duck and quacks like a duck... Do we really need to care if the duck was generated by an AI?
I highly doubt that.
Empirical studies show that humans have very little effect on error rates when reviewing code. That effect disappears quickly the more code you read.
Most programmers are bad at detecting UB and memory ownership and lifetime errors.
A piece of wood comes off the table it’s cut or it’s not.
Code is far more complex.
And this is why we have languages and tooling that takes care of it.
There's only a handful of people who can one-shot perfect code in a language that doesn't guard against memory ownership or lifetime errors every time.
But even the crappiest programmer has to actually work against the tooling in a language like Rust to ownership issues. Add linters, formatters and unit tests on top of that and it becomes nigh-impossible.
Now put an LLM in the same position, it's also unable to create shitty code when the tooling prevents it from doing so.
These tools are nothing alike and the reductionism of this metaphor isn’t helpful.
Maybe someone bumped the fence aw while you were on a break, or the vibration of it caused the jig to get a bit out of alignment.
The basic point is that whether a human or some kind of automated process, probabilistic or not, is producing something you still need to check the result. And for code specifically, we've had deterministic ways of doing that for 20 years or so.
As with LLMs, where careless use results in you dropping prod db or exposing user data.
The worst part about all reactionary scares is that, because the behaviors are driven by emotion and feeling as opposed to any intentional course of action, the outcomes are usually counter productive. The current AI scare is exactly what you would want if you are OpenAI. Convince OSS, not to mention "free" software people, to run around dooming and ant milling each other about "AI bad" and pretty soon OSS is a poisonous minefield for any actual open AI, so OSS as a whole just sabotages itself and is mostly out of the fight.
I'm currently in the middle of trying to blow straight past this gatekeepy outer layer of the online discourse. What is a bit frustrating is knowing that while the seed will find the niches and begin spreading through invisible channels, in the visible channels, there's going to be all kinds of knee-jerk pushback from these anti-AI hardliners who can't distinguish between local AI and paying Anthropic for a license to use a computer. Worse, they don't care. The social psychosis of being empowered against some "others" is more important. Either that or they are bots.
And all of this is on top of what I've been saying for over a year. VRAM efficiency will kill the datacenter overspend. Local, online training will make it so that skilled users get better models over time, on their own data. Consultative AI is the future.
I have to remind myself that this entire misstep is a result of a broken information space, late-stage traditional social, filled with people (and "people") who have been programmed for years on performative clap-backs and middling ideas.
So fortunate to have some life before internet perspective to lean back on. My instinct and old-world common sense can see a way out, but it is nonetheless frustrating to watch the online discourse essentially blinding itself while doubling down on all this hand wringing to no end, accomplishing nothing more than burning a few witches and salting their own lands. You couldn't want it any better if you were busy entrenching.
The linux foundation itself, is just one big, woke, leftist mess, with CV-stuffers from corporations in every significant position.
The rest of the world looks on in wonder at both sides of this.
The solution documented here seems very pragmatic. You as a contributor simply state that you are making the contribution and that you are not infringing on other people's work with that contribution under the GPLv2. And you document the fact that you used AI for transparency reasons.
There is a lot of legal murkiness around how training data is handled, and the output of the models. Or even the models themselves. Is something that in no way or shape resembles a copyrighted work (i.e. a model) actually distributing that work? The legal arguments here will probably take a long time to settle but it seems the fair use concept offers a way out here. You might create potentially infringing work with a model that may or may not be covered by fair use. But that would be your decision.
For small contributions to the Linux kernel it would be hard to argue that a passing resemblance of say a for loop in the contribution to some for loop in somebody else's code base would be anything else than coincidence or fair use.
When AI output can be copyrighted is when copyrighted elements are expressed in it, like if you put copyrighted content in a prompt and it is expressed in the output, or the output is transformed substantially with human creativity in arrangement, form, composition, etc.
[1] https://newsroom.loc.gov/news/copyright-office-releases-part...
It's also not really clear if you can or cannot copyright AI output. The case that everyone cites didn't even reach the point where courts had to rule on that. The human in that case decided to file the copyright for an AI, and the courts ruled that according to the existing laws copyright must be filed by a person/human/whatever.
So we don't yet have caselaw where someone used AIgen and claimed the output as written by them.
Does a digitally encoded version resemble a copyrighted work in some shape or form? </snark>
Where is this hangup on models being something entirely different than an encoding coming from? Given enough prodding they can reproduce training data verbatim or close to that. Okay, given enough prodding notepad can do that too, so uncertainty is understandable.
This is one of the big reasons companies are putting effort into the so called "safety": when the legal battles are eventually fought, they would have an argument that they made their best so that the amount of prodding required to extract any information potentially putting them under liability is too great to matter.
Well that's different because an encoded image or video clearly intends to reproduce the original perfectly and the end result after decoding is (intentionally) very close to form of the original. Which makes it a clear cut case of being a copy of the original.
The reason so many cases don't get very far is that mostly judges and lawyers don't think like engineers. Copyright law predates most modern technology. So, everything needs to be rephrased in terms of people copying stuff for commercial gain. The original target of the law was people using printing presses to create copies of books written by others. Which was hugely annoying to some publishers who thought they had exclusive deals with authors. But what about academics quoting each other? Or literary reviews. Or summaries. Or people reading from a book on the radio? This stuff gets complicated quickly. Most of those things were settled a long time ago. Fair use is a concept that gets wielded a lot for this. Yes its a copy but its entirely reasonable for the copy holder to be doing what they are doing and therefore not considered an infringement.
The rest is just centuries of legal interpretation of that and how it applies to modern technology. Whether that's DJs sampling music or artists working in visual imagery into their art works. AI is mostly just more of the same here. Yes there are some legally interesting aspects with AI but not that many new ones. Judges are unlikely to rethink centuries of legal interpretations here and are more likely to try to reconcile AI in with existing decisions. Any changes to the law would have to be driven by politicians; judges tend to be conservative with their interpretations.
So if the AI outputs Starry Night or Starry Night in different color theme, that's likely infringement without permission from van Gogh, who would have recourse against someone, either the user or the AI provider.
But a starry-night style picture of an aquarium might not be infringing at all.
>For small contributions to the Linux kernel it would be hard to argue that a passing resemblance of say a for loop in the contribution to some for loop in somebody else's code base would be anything else than coincidence or fair use.
I would argue that if it was a verbatim reproduction of a copyrighted piece of software, that would likely be infringing. But if it was similar only in style, with different function names and structure, probably not infringing.
Folks will argue that some things might be too small to do any different, for example a tiny snippet like python print("hello") or 1+1=2 or a for loop in your example. In that case it's too lacking in original expression to qualify for copyright protection anyway.
But your point still stands.
That is a non sequitur. Also, I'm not sure if copyright applies to humans, or persons (not that I have encountered particularly creative corporations, but Taranaki Maunga has been known for large scale decorative works)
However, if the code has been slightly changed by a human, it can be copyrighted again. I think.
US Copyright Office guidance in 2023 said work created with the help of AI can be registered as long as there is "sufficient human creative input". I don't believe that has ever been qualified with respect to code, but my instinct is that the way most people use coding agents (especially for something like kernel development) would qualify.
Though I guess such a suit is unlikely if the defendant could just AI wash the work in the first place.
I don't believe the idea that humans can or can't claim copyright over AI-authored works has been tested. The Copyright Office says your prompt doesn't count and you need some human-authored element in the final work. We'll have to see.
Copyright requires some amount of human originality. You could copyright the prompt, and if you modify the generated code you can claim copyright on your modifications.
The closest applicable case would be the monkey selfie.
https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...
No, my understanding is that AI generated content can't be copyrighted by the AI. A human can still copyright it, however.
Whether a person can claim copyright of the output of a computer program is generally understood as depending on whether there was sufficient creative effort from said person, and it doesn't really matter whether the program is Photoshop or ChatGPT.
But you shouldn't be right. I mean, morally.
The law is a compromise between what the people in power want and what they can get away with without people revolting. It has nothing to do with morality, fairness or justice. And we should change that. The promise of democracy was (among other things) that everyone would be equal, everybody would get to vote and laws would be decided by the moral system of the majority. And yet, today, most people will tell you they are unhappy about the rising cost of living and rising inequality...
The law should be based on complete and consistent moral system. And then plagiarism (taking advantage of another person's intellectual work without credit or compensation) would absolutely be a legal matter.
LLMs are not persons, not even legal ones (which itself is a massive hack causing massive issues such as using corporate finances for political gain).
A human has moral value a text model does not. A human has limitations in both time and memory available, a model of text does not. I don't see why comparisons to humans have any relevance. Just because a human can do something does not mean machines run by corporations should be able to do it en-masse.
The rules of copyright allow humans to do certain things because:
- Learning enriches the human.
- Once a human consumes information, he can't willingly forget it.
- It is impossible to prove how much a human-created intellectual work is based on others.
With LLMs:
- Training (let's not anthropomorphize: lossily-compressing input data by detecting and extracting patterns) enriches only the corporation which owns it.
- It's perfectly possible to create a model based only on content with specific licenses or only public domain.
- It's possible to trace every single output byte to quantifiable influences from every single input byte. It's just not an interesting line of inquiry for the corporations benefiting from the legal gray area.
If it's too hard to check outputs, don't use the tool.
Your arguments about copyright being different for LLMs: at the moment that's still being defined legally. So for now it's an ethical concern rather than a legal one.
For what it's worth I agree that LLMs being trained on copyright material is an abuse of current human oriented copyright laws. There's no way this will just continue to happen. Megacorps aren't going to lie down if there's a piece of the pie on the table, and then there's precedent for everyone else (class action perhaps)
As for checking outputs - I don't believe that's sufficient. Maybe the letter of the law is flawed but according to the spirit the model itself is derivative work.
A model takes several orders of magnitude more work as training data than it takes to code the training algorithm itself, to any reasonable and sane person, that makes it a derivative work of the training data by nearly 100% - we can only argue how many nines it should be.
> precedent
Yeah but the US system makes me very uneasy about it. The right way to do this is to sit down, talk about the options and their downstream implications, talking about fairness and justice and then deciding what the law should be. If we did that, copyright law would look very different in the first place and this whole thing would have an obvious solution.
That is not the case when using AI generated code. There is no way to use it without the chance of introducing infringing code.
Because of that if you tell a user they can use AI generated code, and they introduce infringing code, that was a foreseeable outcome of your action. In the case where you are the owner of a company, or the head of an organization that benefits from contributors using AI code, your company or organization could be liable.
But if a lawsuit was later brought who would be sued? The individual author or the organization? In other words can an organization reduce its liability if it tells its employees "You can break the law as long as you agree you are solely responsible for such illegal actions?
It would seem to me that the employer would be liable if they "encourage" this way of working?
I think you’re looking for problems that don’t really exist here, you seem committed to an anti AI stance where none is justified.
If you don't think this is a problem take a look at the terms of the enterprise agreements from OpenAI and Anthropic. Companies recognize this is an issue and so they were forced to add an indemnification clause, explicitly saying they'll pay for any damages resulting in infringement lawsuits.
Humans routinely produce code similar to or identical to existing copyrighted code without direct copying.
On independent creation: you are conflating the tool with the user. The defense applies to whether the developer had access to the copyrighted work, not whether their tools did. A developer using an LLM did not access the training set directly, they used a synthesis tool. By your logic, any developer who has read GPL code on GitHub should lose independent creation defense because they have "demonstrated capability to produce code directly from" their memory.
LLM memorization/regurgitation is a documented failure mode, not normal operation (nor typical case). Training set contamination happens, but it is rare and considered a bug. Humans also occasionally reproduce code from memory: we do not deny them independent creation defense wholesale because of that capability!
In any case, the legal question is not settled, but the argument that LLM-assisted code categorically cannot qualify for independent creation defense creates a double standard that human-written code does not face.
Practically speaking humans do not produce code that would be found in court to be infringing without intent.
It is theoretically possible, but it is not something that a reasonable person would foresee as a potential consequence.
That’s the difference.
> LLM memorization/regurgitation is a documented failure mode, not normal operation (nor typical case).
Exactly. It is a documented failure mode that you as a user have no capacity to mitigate or to even be aware is happening.
Double standards are perfectly fine. LLMs are not conscious beings that deserve protection under the law.
>not settled.
What appears to likely be settled is that human authorship is required, so there’s no way that an LLM could qualify for independent creation.
They wouldn't be some patsy that is around just to take blame, but the actual responsible party for the issue.
You hire an independent contractor and tell him that he can drive 60 miles per hour if he wants to but if it explodes he accepts responsibility.
He does and it explodes killing 10 people. If the family of those 10 people has evidence you created the conditions to cause the explosion in order to benefit your company, you're probably going to lose in civil court.
Linus benefits from the increase velocity of people using AI. He doesn't get to put all the liability on the people contributing.
Anyone who thinks they have a strong infringement case isn’t going to stop at the guy who authored the code, they’re going to go after anyone with deep pockets with a good chance of winning.
There is still the "mens rea" principle. If you distribute infringing material unknowingly, it would very likely not result in any penalties.
As long as everything is GPLv2-compatible it‘s okay.
Surely the person doing so would be responsible for doing so, but are they doing anything wrong?
You're perfectly at liberty to relicense public domain code if you wish.
The only thing you can't do is enforce the new license against people who obtain the code independently - either from the same source you did, or from a different source that doesn't carry your license.
If I use public domain code in a project under a license, the whole work remains under the license, but not the public domain code.
I'm not sure what the hullabaloo is about.
No, because they've independently obtained it from the same source that you did, so their copy is "upstream" of your imposing of a new license.
Realistically, adding a license to public domain work is only really meaningful when you've used it as a starting point for something else, and want to apply your license to the derivative work.
Remember that licenses are powered by copyright - granting a license to non-copyrighted code doesn't do anything, because there's no enforcement mechanism.
This is also why copyright reform for software engineering is so important, because code entering the public domain cuts the gordian knot of licensing issues.
If your license allows others to take the code and redistribute it with extra conditions, your code can be imported into the kernel. AFAIK there are parts of the kernel that are BSD-licensed.
Claiming copyright on an unmodified public domain work is a lie, so in some circumstances could be an element of fraud, but still wouldn’t be a copyright violation.
LLM-creation ("training") involves detecting/compressing patterns of the input. Inference generates statistically probable based on similarities of patterns to those found in the "training" input. Computers don't learn or have ideas, they always operate on representations, it's nothing more than any other mechanical transformation. It should not erase copyright any more than synonym substitution.
There's a pretty compelling argument that this is essentially what we do, and that what we think of as creativity is just copying, transforming, and combining ideas.
LLMs are interesting because that compression forces distilling the world down into its constituent parts and learning about the relationships between ideas. While it's absolutely possible (or even likely for certain prompts) that models can regurgitate text very similar to their inputs, that is not usually what seems to be happening.
They actually appear to be little remix engines that can fit the pieces together to solve the thing you're asking for, and we do have some evidence that the models are able to accomplish things that are not represented in their training sets.
Kirby Ferguson's video on this is pretty great: https://www.youtube.com/watch?v=X9RYuvPCQUA
If people find this cool and wanna play with it, they can, just make sure to only mix compatible licenses in the training data and license the output appropriately. Well, the attribution issue is still there, so maybe they can restrict themselves to public domain stuff. If LLMs are so capable, it shouldn't limit the quality of their output too much.
Now for the real issue: what do you think the world will look like in 5 or 10 years if LLMs surpass human abilities in all areas revolving around text input and output?
Do you think the people who made it possible, who spent years of their life building and maintaining open source code, will be rewarded? Or will the rich reap most of the benefit while also simultaneously turning us into beggars?
Even if you assume 100% of the people doing intellectual work now will convert to manual work (i.e. there's enough work for everyone) and robots don't advance at all, that'll drive the value of manual labor down a lot. Do you have it games out in your head and believe somehow life will be better for you, let alone for most people? Or have yo not thought about it at all yet?
I think they should be rewarded more than they are currently. But isn't the GNU Public License bassically saying you can use such source-code without giving any rewards what so ever?
But I see your The reward for Open Source developers is the public recognition for their works. LLMs can take that recognition away.
That is at the moment: - Nobody knows for sure what agents might add and their long term effects on codebases.
- It's at best unclear that AI content in a codebase can be reliably determined automatically.
- Even if it's not malicious, at least some of its contributions are likely to be deleterious and pass undetected by human review.
It's different from the regular single purpose static tools.
> AI agents MUST NOT add Signed-off-by tags. Only humans can legally certify the Developer Certificate of Origin (DCO).
They mention an Assisted-by tag, but that also contains stuff like "clang-tidy". Surely you're not interpreting that as people "attributing" the work to the linter?