That's not what the law says [1]. If two people happen to independently create the same thing they each have their own copyright.
If it's highly improbable that two works are independent (eg. the gcc code base), the first author would probably go to court claiming copying, but their case would still fail if the second author could show that their work was independent, no matter how improbable.
[1] https://lawhandbook.sa.gov.au/ch11s13.php?lscsa_prod%5Bpage%...
It is also true that in all the cases that I know about where that has occurred the courts have taken a very, very, very close look at the situation and taken extensive evidence to convince the court that there really wasn't any copying. It was anything but a "get out of jail free" card; it in fact was difficult and expensive, in proportion to the size of the works under question, to prove to the court's satisfaction that the two things really were independent. Moreover, in all the cases I know about, they weren't actually identical, just, really really close.
No rational court could possibly ever come to that conclusion if someone claimed a line-by-line copy of gcc was written by them, they must have independently come up with it. The probably of that is one out of ten to the "doesn't even remotely fit in this universe so forget about it". The bar to overcoming that is simply impossibly high, unlike two songs that happen to have similar harmonies and melodies, given the exponentially more constrained space of "simple song" as compared to a compiler suite.
I know it's a popular misconception that "impossible" = a strict, statistical, mathematical 0, but if you try to use that in real life it turns out to be pretty useless. It also tends to bother people that there isn't a bright shining line between "possible" and "impossible" like there is between "0 and strictly not 0", but all you can really do is deal with it. Where ever the line is, this is literally millions of orders of magnitude on the wrong side of it. Not a factor of millions, a factor of ten to the millions. It's not possible to "accidentally" duplicate a work of that size.
I suppose a different way of stating my position is that some activities that don't look like copying are in fact copying. For instance it would not be required to find a literal copy of the GCC codebase inside of the LLM somehow, in order for the produced work to be a copy. Likewise if I specify that "Harry Potter and the Philosopher's Stone is the text file with hash 165hdm655g7wps576n3mra3880v2yzc5hh5cif1x9mckm2xaf5g4" and then someone else uses a computer to brute force find a hash collision, I suspect this would still be considered a copy.
I think there is a substantial risk that the automatic translation done in this case is, at least in part, copying in the above sense.
It's an interesting case. As I understand it, there is an ongoing debate within the AI research community as to whether neural nets are encoding verbatim blocks of information or creating a model which captures the "essence" or "ideas" behind a work. If they are capturing ideas, which are not copyrightable, it would suggest that LLMs can be used to "launder" copyright. In this case, I get the feeling that, for legal clarity, we would both say that the work in question (or works derived from it) should not be part of the training set or prompt, emulating a clean room implementation by a human. (Is that a fair comment?)
I've no direct experience here, but I would come down on the side of "LLMs are encoding (copyrightable) verbatim text", because others are reporting that LLMs do regurgitate word-for-word chunks of text. Is this always the case though? Do different AI architectures, or models that are less well fitted, encode ideas rather than quotes?
[1] https://en.wikipedia.org/wiki/Hutter_Prize
Edit: It would be an interesting experiment to use two LLMs to emulate a clean room implementation. The first is instructed to "produce a description of this program". The second, having never seen the program, in its prompt or training set, would be prompted to "produce a program based on this description". A human could vet the description produced by the first LLM for cleanliness. Surely someone has tried this, though it might be a challenge to get an LLM that is guaranteed not to have been exposed to a particular code base or its derivatives?
Patent law is different and doesn't rely on information flow in the same way.
"Like any reader aspiring to be a writer, Anthropic's LLMs trained upon works not to race ahead and replicate or supplant them — but to turn a hard corner and create something different"
However, describing the path you need to get there requires copyright infringement.
I know you were simplifying, and not to take away from your well-made broader point, but an API-derived implementation can still result in problems, as in Google vs Oracle [1]. The Supreme Court found in favor of Google (6-2) along "fair use" lines, but the case dodged setting any precedent on the nature of API copyrightability. I'm unaware if future cases have set any precedent yet, but it just came to mind.
[1]: https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_....
Also, I find it important that here the API is really minimal (compared to the Java std lib), the real value of the library is in the internal detection logic.
I think there is precedence that says exactly this - for example the BIOS rewrites for the IBM PC from people like Phoenix. And it would be trivial to instruct an LLM to prefer to use (say, in assembler) register C over register B wherever that was possible, resulting in different code.
If you somehow actually randomly produce the same code without a reference, it's not a copy and doesn't violate copyright. You're going to get sued and lose, but platonically, you're in the clear. If it's merely somewhat similar, then you're probably in the clear in practice too: it gets very easy very fast to argue that the similarities are structural consequences of the uncopyrightable parts of the functionality.
> The actual meaning of a "clean room implementation" is that it is derived from an API and not from an implementation (I am simplifying slightly).
This is almost the opposite of correct. A clean room implementation's dirty phase produces a specification that is allowed to include uncopyrightable implementation details. It is NOT defined as producing an API, and if you produce an API spec that matches the original too closely, you might have just dirtied your process by including copyrightable parts of the shape of the API in the spec. Google vs Oracle made this more annoying than it used to be.
> Whether the reimplementation is actually a "new implementation" is a subjective but empirical question that basically hinges on how similar the new codebase is to the old one. If it's too similar, it's a copy.
If you follow CRRE, it's not a copy, full stop, even if it's somehow 1:1 identical. It's going to be JUDGED as a copy, because substantial similarity for nontrivial amounts of code means that you almost certainly stepped outside of the clean room process and it no longer functions as a defense, but if you did follow CRRE, then it's platonically not a copy.
> What the chardet maintainers have done here is legally very irresponsible.
I agree with this, but it's probably not as dramatic as you think it is. There was an issue with a free Japanese font/typeface a decade or two ago that was accused of mechanically (rather than manually) copying the outlines of a commercial Japanese font. Typeface outlines aren't copyrightable in the US or Japan, but they are in some parts of Europe, and the exact structure of a given font is copyrightable everywhere (e.g. the vector data or bitmap field for a digital typeface, as opposed to the idea of its shape). What was the outcome of this problem? Distros stopped shipping the font and replaced it with something vaguely compatible. Was the font actually infringing? Probably not, but better safe than sorry.
I don't believe this, and I doubt that the sense of copying in copyright law is so literal. For instance, if I generated the exact text of a novel by looking for hash collisions, or by producing random strings of letters, or by hammering the middle button on my phone's autosuggestion keyboard, I would still have produced a copy and I would not be safe to distribute it. There need not have been any copy anywhere near me for this to happen. Whether it is likely or not depends on the technique used - naive techniques make this very unlikely, but techniques can improve.
It is also true that similarity does not imply copying - if you and I take an identical photograph of the same skyline, I have not copied you and you have not copied me, we have just fixed the same intangible scene into a medium. The true subjective test for copying is probably quite nuanced, I am not sure whether it is triggered in this case, but I don't think "clean room LLMs" are a panacea either.
> dirty phase produces a specification ... it is NOT defined as producing an API
This does not really sound like "the opposite of correct". APIs are usually not copyrightable, the truth is of course more complicated, if you are happy to replace "API" with "uncopyrightable specification" then we can probably agree and move on.
> it's probably not as dramatic as you think it is
In reality I am very cynical and think nothing will come of this, even if there are verbatim snippets in the produced code. People don't really care very much, and copyright cases that aren't predicated on millions of dollars do not survive the court system very long.
It is actually that literal, really.
> For instance, if I generated the exact text of a novel by looking for hash collisions,
This is a copyright violation because you're using the original to construct the copy. It's not a pure RNG.
> or by producing random strings of letters,
This wouldn't be a copyright violation, but nobody would believe you.
> or by hammering the middle button on my phone's autosuggestion keyboard, I would still have produced a copy and I would not be safe to distribute it.
This would probably be a copyright violation.
You probably think that this is hypothetical, but problems like this do actually go to court all the time, especially in the music industry, where people try to enforce copyright on melodies that have the informational uniqueness of an eight-word sentence.
> APIs are usually not copyrightable,
This was commonly believed among developers for a long time, but it turned out to not be true.
> This does not really sound like "the opposite of correct".
The important part is that information about the implementation can absolutely be in the spec without necessarily being copyrightable (and in real world clean room RE, you end up with a LOT of implementation details). You were saying the opposite, that it was a spec of the API as opposed to a spec of the implementation.
What color are your bits? That's all the law cares about.
The first sentence is the title of an essay.
a bunch of people get together, rewrite something while making a pinky promise not to look at the original source code
guaranteeing the premise is basically impossible, it sounds like some legal jester dance done to entertain the already absurd existing copyright laws
Clean room implementations are a jester dance around the judiciary. The whole point is to avoid legal ambiguity.
You are not required to do this by law, you are doing this voluntarily to make potential legal arguments easier.
The alternative is going over the whole codebase in question and arguing basically line by line whether things are derivative or not in front of a judge (which is a lot of work for everyone involved, subjective, and uncertain!).
I've always taken "clean room" to be the kind of manufacturing clean room (sealed/etc). You're given a device and told "make our version". You're allowed to look, poke, etc but you don't get the detailed plans/schematics/etc.
In software, you get the app or API and you can choose how to re-implement.
In open source, yes, it seems like a silly thing and hard to prove.
This is incorrect and thinking this can get you sued
https://en.wikipedia.org/wiki/Structure,_sequence_and_organi...
Per your link, the Supreme Court's thinking on "structure, sequence and organization" (Oracle's argument why Google shouldn't even be allowed to faithfully produce a clean-room implementation of an an API) has changed since the 1980s out of concern that using it to judge copyright infringement risks handing copyright holders a copyright-length monopoly over how to do a thing:
> enthusiasm for protection of "structure, sequence and organization" peaked in the 1980s [..] This trend [away from "SS&O"] has been driven by fidelity to Section 102(b) and recognition of the danger of conferring a monopoly by copyright over what Congress expressly warned should be conferred only by patent
The Supreme Court specifically recognised Google's need to copy the structure, sequence and organization of Java APIs in order to produce a cleanroom Android runtime library that implemented Java APIs so that that existing Java software could work correctly with it.
Similarly, see Oracle v. Rimini Street (https://cdn.ca9.uscourts.gov/datastore/opinions/2024/12/16/2...) where Rimini Street has been producing updates that work with Oracle's products, and Oracle claimed this made them derivative works. The Court of Appeals decided that no, the fact A is written to interoperate with B does not necessarily make A a derivative work of B.
When a developer reimplements a complete new version of code from scratch, with an understanding only, a new implementation generally should be an improvement on any source code not equal.
In today’s world, letting LLMs replicate anything will generate average code as “good” and generally create equivalent or more bloat anyways unless well managed.
https://www.joelonsoftware.com/2000/04/06/things-you-should-...
> They did it by making the single worst strategic mistake that any software company can make: They decided to rewrite the code from scratch.
Finding a middle ground of building a roadmap to refactoring your way forward is often much better.
Appreciate the Joel link, nice to see that kind of stuff again.
With that being said if it's the same small team that built the first version, there can be a calculated risk to driving a refactor towards a rewrite with the right conditions. I says this because I have been able to do it in this conditions a few times, it still remains very risky. If it's a new or different team later on trying to rewrite, all bets are off anyways.
We have to remember 70% of software projects fail at the best of times, independent of rewrites.
Perhaps the maintainer wants to force the issue?
> Any downstream user of the library is at risk of the license switching from underneath them.
Checking the license of the transitive closure of your dependencies is table stakes for using them.
I doubt it, and I don't see any evidence that's what they're doing. There are probably better ways, if that's what they want.
> Checking the license of the transitive closure of your dependencies is table stakes for using them.
Checking the license of the transitive closure of your dependencies is only feasible when the library authors behave responsibly.
i.e. a re-implementation
which can either
- be still derived work, i.e. seen as you just obfuscating a copyright violation
- be a new work doing the same
nothing prevents an AI from producing a spec based on a API, API documentation and API usage/fuzzing and then resetting the AI and using that spec to produce a rewrite
I mean "doing the same" is NOT copyright protection, you need patent law for that. Except even with patent law you need to innovations/concepts not the exact implementation details. Which means that even if there are software patents (theoretically,1) most things done in software wouldn't be patentable (as they are just implementation details, not inventions)
(1): I say theoretically because there is a very long track record of a lot of patents being granted which really should never be granted. This combined with the high cost of invalidating patents has caused a ton of economical damages.
Ted Nelson was years ahead of the future where we really needed his Xanadu to keep track of fractional copyright. Likely if we had such a mechanism, and AI authors respected it then we would be able to say that your work is derived from 3000 other original works and that you added 6 lines of new code.
AI/ML is complex, so as a simpler analogy: If I watch The Simpsons, and I create an amusing infographic of how often Homer says "D'oh!" over time, my infographic would be an original work. AI training follows the same principle.
> AI training follows the same principle.
If you really believe that then we can't have a meaningful conversation about this, that's not even ELIF territory, that's just disconnected. You should be asking questions, not telling people how it works.
In fact we could make this concrete: use the model as the prediction stage in a compressor, and compress gcc with it. The residual is the extent to which it doesn't contain gcc.
https://osyuksel.github.io/blog/reconstructing-moby-dick-llm...
I see a test where one model managed to 85% reproduce a paragraph given 3 input paragraphs under 50% of the time.
So it can't even produce 1 paragraph given 3 as input, and it can't even get close half the time.
"Contains Moby Dick" would be something like you give it the first paragraph and it produces the rest of the book. What we have here instead is a statistical model that when given passages can do an okay job at predicting a sentence or two, but otherwise quickly diverges.
Getting close less than half the time given three paragraphs as input still sounds like red-handed copyright infringement to me.
If I sample a copyrighted song in my new track, clip it, slow it down, and decimate the bit rate, a court would not let me off the hook.
It doesn't matter how much context you push into these things. If I feed them 50% of Moby Dick and they produce the next word, and I can repeatedly do that to produce the entire book (I'm pretty sure the number of attempts is wholly irrelevant: we're impossibly far from monkeys on typewriters) then we can prove the statistical model encodes the book. The further we are from that (and the more we can generate with less) then the stronger the case is. It's a pretty strong case!
> If I feed them 50% of Moby Dick and they produce the next word and I can repeatedly do that to produce the entire book... then we can prove the statistical model encodes the book.
It can't because it doesn't. That's what it means to say it diverges.
The "number of attempts" is you cheating. You're giving it the book when you let it try again word by word until it gets the correct answer, and then claiming it produced the book. That's exactly the residual that I said characterizes the extent to which it doesn't know the book. Trivially, no matter how bad the model is, if you give it the residual, it can losslessly compress anything at all.
If you had a simple model that just predicts next word given current word (trained on word pair frequency across all English text, or even all text excluding Moby Dick), and then give it retries until it gets the current word right, it will also quickly produce the book. Because it was your retry policy that encoded the book, not the model. Without that policy, it will get it wrong within a few words, just like these models do.
If you had access to a model's top p selection then I'd bet the book is in there consistently for every token. Is it statistically significant? Might be!
I'm not cheating because the number of attempts is so low it's irrelevant.
If I were to take a copyrighted work and chunk it up into 1000 pieces and encrypt each piece with a unique key, and give you all the pieces and keys, would it still be the copyrighted work? What if I shave off the last bit of each key before I give them to you, so you have a 50% chance of guessing the correct key for each piece? What if I shave two bits? What if it's a million pieces? When does it become transformative or no longer infringing for me to distribute?
The answer might surprise you.
Consider a password consisting of random words each chosen from a 4k dictionary. Say you choose 10 words. Then your password has log_2(4k)*10 entropy.
Now consider a validator that tells you when you gets a word right. Then you can guess one word at a time, and your password strength is log_2(4k*10). Exponentially weaker.
You're constructing the second scenario and pretending it's the first.
Also in your 50% probability scenario, each word is 1 bit, and even 50-100 bits is unguessable. A 1000 word key where each word provides 1 bit would be absurdly strong.
I wonder what the results would be if I spent time to train a model up from scratch without any such constraints. But I'm much too busy with other stuff right now, but that would be an interesting challenge.
These companies just don't want to deal with people complaining that it reproduces something when they don't understand that they're literally giving it the answer.
For a fan fiction episode that is different from all official episodes, you may cross your fingers.
For a remake of one of the episodes with a different camera angle and similar dialog, I expect that you will get in problems.
I beg to differ. Please examine any of my recent codebases on github (same username); I have cleanroom-reimplemented par2 (par2z), bzip2 (bzip2z), rar (rarz), 7zip (z7z), so maybe I am a good test case for this (I haven't announced this anywhere until now, right here, so here we go...)
https://github.com/pmarreck?tab=repositories&type=source
I was most particular about the 7zip reimplementation since it is the most likely to be contentious. Here is my repo with the full spec that was created by the "dirty team" and then worked off of by the LLM with zero access to the original source: https://github.com/pmarreck/7z-cleanroom-spec
Not only are they rewritten in a completely different language, but to my knowledge they are also completely different semantically except where they cannot be to comply with the specification. I invite you and anyone else to compare them to the original source and find overt similarities.
With all of these, I included two-way interoperation tests with the original tooling to ensure compatibility with the spec.
Researchers have shown that an LLM was able to reproduce the verbatim text of the first 4 Harry Potter books with 96% accuracy.
Kinda weird argument, in their research (https://forum.gnoppix.org/t/researchers-extract-up-to-96-of-...) LLM was explicitly asked to reproduce the book. There are people that can do so without LLMs out there, by this logic everything they write is a copyright infringement an every book they can reproduce.
> Yes if you are solving the exact problem that the original code solved and that original code was labeled as solving that exact problem then that’s very good reason for the LLM to produce that code.
I think you're overestimating LLM ability to generalize.
My understanding of cleanroom is that the person/team programming is supposed to have never seen any of the original code. The agent is more like someone who has read the original code line by line, but doesn't remember all the details - and isn't allowed to check.