upvote
The short answer is that we don't know. The longer answer based purely on this case is that there's an argument that training is fair use and so copyleft doesn't have any impact on the model, but this is one case in California and doesn't inherently set precedent in the US in general and has no impact at all on legal interpretations in other countries.
reply
The dearth of case law here still makes a negative outcome for FSF pretty dangerous, even if they don't appeal it and set precedent in higher courts. It might not be binding but every subsequent case will be able to site it, potentially even in other common law countries that lack case law on the topic.

And then there is the chilling effect. If FSF can't enforce their license, who is going to sue to overturn the precedent? Large companies, publishers, and governments have mostly all done deals with the devil now. Joe Blow random developer is going to get a strip mall lawyer and overturn this? Seems unlikely

reply
I don't think this argument is a winner. It fails on a few grounds:

First, unless you can point to regurgitation of memorized code, you're not able to make an argument about distribution or replication. This is part of the problem that most publishers are having with prose text and LLMs. Modern LLMs don't memorize harry potter like GPT3 did. The memorization older models showed came from problems in the training data, e.g. harry potter and people writing about harry potter are extraordinarily over-represented. It's similar to how with stable diffusion you could prompt for anything in the region of "Van Gogh's Starry Night" and get it, since it was in the training data 50-100 different ways. You can't reliably do this with Opus or GPT5. If they're not redistributing the code verbatim, they're not in violation of the license. One could argue that the models produce "derivative works, but..."

The derivative works argument is inapt. The point of it is to disrupt someone's end-run around the license by saying that building on top of GPL code is not enough to non-GPL it. We imagine this will still work for LLMs because of the GPLs virality--I can't enclose a critical GPL module in non-GPL code and not release the GPL code. But the models aren't DOING THAT. They're not reaching for XYZ GPL'd project to build with. They're vibing out a sparsely connected network of information about literally trillions of lines of software. What comes out is a mishmash of code from here and there, and only coincidentally resembles GPL code, when it does. In order to make this argument work, you need a theory of how LLMs are trained and operate that supports it. Regardless of whether or not one of those theories exist, in court, you'd need to show that your theory was better than the company's expert witness's theory. Good luck.

Second, infringement would need discovery to uncover and would be contingent on user input. This is why the NYT sued for deleted user prompts to ChatGPT--the plaintiffs can't show in public that the content is infringing, so they need to seek discovery to find evidence. That's only going to work in cases where you survive a motion to dismiss--which is EXACTLY where a few of these suits have failed. You need to show first that you can succeed on the merits, then you proceed. That will cut down many of these challenges since they just can't show the actual infringement.

Third, and I think this is the most important, the license protections here are enforced by *copyright*. For copyright it very much matters if something is lifted verbatim vs modified. It is not like patent protection where things like clean room design are shown to have mattered to real courts on real matters. In additional contrast to patents, copyright doesn't care if the outcome is close. That's very much a concern for patents. If I patent a gizmo and you produce a gizmo that operates through nearly identical mechanisms to those I patented, then you can be sued--they don't need to be exact. If I write a novel about a boy wizard with glasses who takes a train to a school in Scotland and you write a novel about a boy wizard with glasses who takes a boat to a school in Inishmurray, I can't sue you for copyright infringement. You need to copy the words I wrote and distribute them to rise to a violation.

reply
> Modern LLMs don't memorize harry potter like GPT3 did. [...] You can't reliably do this with Opus or GPT5.

If you try any modern LLM, you will find that you can. Easily [0], reliably [1], consistently [2]. All these examples are with models released in 2025/26.

[0] https://arxiv.org/html/2601.02671?amp=&amp=

[1] https://arxiv.org/abs/2506.12286

[2] https://ai.stanford.edu/blog/verbatim-memorization/

reply
You can't do that without already having the contents of the book, in which case getting an LLM to regurgitate it with partial prompting shouldn't be legally relevant at all. What it regurgitates will have errors, and if you try to chain that as prompt cues without re-basing each cue to the actual text (which you have separately), the LLM's output will rapidly lose coherence with the original work.

If its responses were perfect so that you could chain them, or if you could ask "please give me words 10-15 of chapter 3 paragraph 4 of HPatSS, and it did so, then you'd have a better case to complain. Still, the counterargument is that repeated prompting like that, explicitly asking for copyright violation, is the real crime. Are you going to throw someone in prison if they memorize the entirety of HPatSS and recite arbitrary parts of it on demand?

Combining both issues: that LLMs are only regurgitating mostly accurate continuations, and they're only providing that to the person who explicitly asked... any meaningful copyright violation moves downstream. If you record someone reciting HPatSS from memory, and post it on youtube, you are (or should be considered) the real copyright violator, not them.

If you ask for an identifiable short segment of writing, or a piece of art, and get something close enough that violates copyright, that should really be your problem if you redistribute it (whether manually or because you've coded something to allow 3rd parties to submit LLM prompts and feed answers back to them, and they go on to redistribute it).

Blaming LLMs for "copyright violation" is like persuading a retarded person to do something illegal and then blaming them for it.

reply
So, they have to do anything special to those models in order to get them to regurgitate ~ 100%? Any special prompts they needed to use to get sonnet to cough that up?

What is the real copyright risk of there being an arcane procedure to sometimes recover most of a text? So far it’s nothing. Which is what I’m saying. Pragmatically this is a loser of an argument in a court room. It is too easy for the chain of reasoning to be disrupted and even undisrupted the argument for model maker liability is attenuated.

reply
> unless you can point to regurgitation of memorized code

I have, on many occasions, gotten an LLM to do just this. It's not particularly hard. In the most recent case google's search bar LLM happily regurgitated a digital ocean article as if it was it's own output. Searching for some strings in the comments located the original page and it was a 95% match between origin and output.

> The memorization older models showed came from problems in the training data,

And what proof do you have that they "fixed" this? And what was the fix?

> harry potter and people writing about harry potter

I'm not sure that's how you get GPT to reproduce upwards of 85% of Harry Potter novels.

> Second, infringement would need discovery to uncover and would be contingent on user input.

That's not at all how copyright infringement works. That would be if you wanted to prove malice and get triple damages. Copyright infringement is an exceptionally simple violation of the law. You either copied, or you did not.

> For copyright it very much matters if something is lifted verbatim vs modified.

Transformation is a valid defense for _some_ uses. It is not for commercial uses. Using LLM generated code for commercial purposes is a hazard.

reply
This must be why all of these copyright plaintiffs are having tremendous days in court! If even half of this were correct, they wouldn’t be losing in summary judgment.

We have yet to see a single judgment come down against a model maker for distributing the gist of content. We have yet to see a single judgment come down against a model maker for infringement at all.

Copyright is just an inapt tool here. It’s not going to do the job. It is not as though big interests have not tried to use this tool. It just doesn’t reflect what’s actually happening and it’s going to lose again and again.

We can imagine a theoretical legal regime where what is done with large language models counts as copyright infringement, we just don’t live in a world where that regime holds.

reply