undefined

upvote

points

by danlitt13 hours ago |

upvote

by femto12 hours ago|

[-]

> If you had a hermetically sealed code base that just happened to coincide line for line with the codebase for GCC, it would still be a copy.

That's not what the law says [1]. If two people happen to independently create the same thing they each have their own copyright.

If it's highly improbable that two works are independent (eg. the gcc code base), the first author would probably go to court claiming copying, but their case would still fail if the second author could show that their work was independent, no matter how improbable.

[1] https://lawhandbook.sa.gov.au/ch11s13.php?lscsa_prod%5Bpage%...

reply

upvote

by jerf10 hours ago|

[-]

It is true that if two people happen to independently create the same thing, they each have their own copyright.

It is also true that in all the cases that I know about where that has occurred the courts have taken a very, very, very close look at the situation and taken extensive evidence to convince the court that there really wasn't any copying. It was anything but a "get out of jail free" card; it in fact was difficult and expensive, in proportion to the size of the works under question, to prove to the court's satisfaction that the two things really were independent. Moreover, in all the cases I know about, they weren't actually identical, just, really really close.

No rational court could possibly ever come to that conclusion if someone claimed a line-by-line copy of gcc was written by them, they must have independently come up with it. The probably of that is one out of ten to the "doesn't even remotely fit in this universe so forget about it". The bar to overcoming that is simply impossibly high, unlike two songs that happen to have similar harmonies and melodies, given the exponentially more constrained space of "simple song" as compared to a compiler suite.

reply

upvote

by gruez10 hours ago|

[-]

All of this is moot for the purposes of LLM, because it's almost certain that the LLMs were trained on the code base, and therefore is "tainted". You can't do this with humans either. Clean room design requires separate people for the spec/implementation.

reply

upvote

by wareya10 hours ago|

[-]

That's the "but their case would still fail if the second author could show that their work was independent, no matter how improbable" part of the post you're responding to.

reply

upvote

by jerf9 hours ago|

[-]

One out of ten to the power of "forget about it" is not improbable, it's impossible.

I know it's a popular misconception that "impossible" = a strict, statistical, mathematical 0, but if you try to use that in real life it turns out to be pretty useless. It also tends to bother people that there isn't a bright shining line between "possible" and "impossible" like there is between "0 and strictly not 0", but all you can really do is deal with it. Where ever the line is, this is literally millions of orders of magnitude on the wrong side of it. Not a factor of millions, a factor of ten to the millions. It's not possible to "accidentally" duplicate a work of that size.

reply

upvote

by wareya9 hours ago|

[-]

It sounds to me like you're responding to a different argument than they're actually making and reading intent into it that isn't written into it.

reply

upvote

by danlitt7 hours ago|

[-]

Thank you for providing a reference! I certainly admit that "very similar photographs are not copies" as the reference states. And certainly physical copying qualifies as copying in the sense of copyright. However I still think copying can happen even if you never have access to a copy.

I suppose a different way of stating my position is that some activities that don't look like copying are in fact copying. For instance it would not be required to find a literal copy of the GCC codebase inside of the LLM somehow, in order for the produced work to be a copy. Likewise if I specify that "Harry Potter and the Philosopher's Stone is the text file with hash 165hdm655g7wps576n3mra3880v2yzc5hh5cif1x9mckm2xaf5g4" and then someone else uses a computer to brute force find a hash collision, I suspect this would still be considered a copy.

I think there is a substantial risk that the automatic translation done in this case is, at least in part, copying in the above sense.

reply

upvote

by femto2 hours ago|

[-]

I fully agree with you. (A small information theory nit pick with your example. The hash and program would have to be at least as long as a perfectly compressed copy of Harry Potter and the Philosopher's Stone. If not you've just invented a better compressor and are in the running for a Hutter Prize[1]! A hash and "decomporessor" of the required length would likely be considered to embody the work.)

It's an interesting case. As I understand it, there is an ongoing debate within the AI research community as to whether neural nets are encoding verbatim blocks of information or creating a model which captures the "essence" or "ideas" behind a work. If they are capturing ideas, which are not copyrightable, it would suggest that LLMs can be used to "launder" copyright. In this case, I get the feeling that, for legal clarity, we would both say that the work in question (or works derived from it) should not be part of the training set or prompt, emulating a clean room implementation by a human. (Is that a fair comment?)

I've no direct experience here, but I would come down on the side of "LLMs are encoding (copyrightable) verbatim text", because others are reporting that LLMs do regurgitate word-for-word chunks of text. Is this always the case though? Do different AI architectures, or models that are less well fitted, encode ideas rather than quotes?

[1] https://en.wikipedia.org/wiki/Hutter_Prize

Edit: It would be an interesting experiment to use two LLMs to emulate a clean room implementation. The first is instructed to "produce a description of this program". The second, having never seen the program, in its prompt or training set, would be prompted to "produce a program based on this description". A human could vet the description produced by the first LLM for cleanliness. Surely someone has tried this, though it might be a challenge to get an LLM that is guaranteed not to have been exposed to a particular code base or its derivatives?

reply

upvote

by brians12 hours ago|

[-]

I do not agree with your interpretation of copyright law. It does ban copies: there has to be information flow from the original to the copy for it to be a "copy." Spontaneous generation of the same content is often taken by the courts to be a sign that it's purely functional, derived from requirements by mathematical laws.

Patent law is different and doesn't rely on information flow in the same way.

reply

upvote

by kevin_thibedeau10 hours ago|

[-]

Derivative works can also run afoul of copyright. An LLM trained on a corpus of copyrighted code is creating derivative works no matter how obscure the process is.

reply

upvote

by wareya10 hours ago|

[-]

This actually isn't what legal precedent currently says. The precedent is currently looking at actual output, not models being tainted. If you think this is morally wrong, look into getting the laws changed (serious).

reply

upvote

by Georgelemental10 hours ago|

[-]

What about a human trained on having 30 years of experience working with copyrighted codebases?

reply

upvote

by mftrhu7 hours ago|

[-]

Said human would likely not be able to create a clean-room implementation of any of the codebases they worked on.

reply

upvote

by aeon_ai9 hours ago|

[-]

Judge Alsup -- U.S. District Judge William Alsup said Anthropic made "fair use" of books, deeming it "exceedingly transformative."

"Like any reader aspiring to be a writer, Anthropic's LLMs trained upon works not to race ahead and replicate or supplant them — but to turn a hard corner and create something different"

reply

upvote

by danlitt7 hours ago|

[-]

I disagree that information flow is required. Do you have a reference for that? Certainly it is an important consideration. But consider all the real literary works contained in the infinite library of babel.[1] Are they original works just because no copy was used to produce them?

[1]: https://libraryofbabel.info/

reply

upvote

by Filligree6 hours ago|

[-]

Yes; the works are original.

However, describing the path you need to get there requires copyright infringement.

reply

upvote

by BoredPositron11 hours ago|

[-]

Well discovery might be a fun exercise to see if the code is in the dataset of the llm.

reply

upvote

by bjord11 hours ago|

[-]

if?

reply

upvote

by petercooper12 hours ago|

[-]

The actual meaning of a "clean room implementation" is that it is derived from an API and not from an implementation

I know you were simplifying, and not to take away from your well-made broader point, but an API-derived implementation can still result in problems, as in Google vs Oracle [1]. The Supreme Court found in favor of Google (6-2) along "fair use" lines, but the case dodged setting any precedent on the nature of API copyrightability. I'm unaware if future cases have set any precedent yet, but it just came to mind.

[1]: https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_....

reply

upvote

by lokar8 hours ago|

[-]

Yeah, a cleanroom re-write, or even "just" a copy of the API spec is something to raise as a defense during a trial (along with all other evidence), it's not a categorical exemption from the law.

Also, I find it important that here the API is really minimal (compared to the Java std lib), the real value of the library is in the internal detection logic.

reply

upvote

by danlitt8 hours ago|

[-]

This is exactly what I had in mind when I said I was simplifying :) it is a valid point.

reply

upvote

by zabzonk12 hours ago|

[-]

> It does not mean "as long as you never read the original code, whatever you write is yours"

I think there is precedence that says exactly this - for example the BIOS rewrites for the IBM PC from people like Phoenix. And it would be trivial to instruct an LLM to prefer to use (say, in assembler) register C over register B wherever that was possible, resulting in different code.

reply

upvote

by danlitt7 hours ago|

[-]

As long as you never read the original code, it is very likely that whatever you write is yours. So I would not be surprised to read judges indicating in this direction. But I would be a little surprised to find out this was an actual part of the test, rather than an indication that the work was considered to have been copied. There are for instance lots of ways of reproducing copyrighted work without using a copy directly, but naive methods like generating random pieces of text are very time consuming, so there is not much precedence around them. LLMs are much more efficient at it!

reply

upvote

by bandrami12 hours ago|

[-]

Different but still derivative

reply

upvote

by zabzonk12 hours ago|

[-]

Well, I am not exactly a hotshot 8086 programmer (though I do alright) but if I was asked to reproduce the IBM BIOS (which I have seen) I think I would come up with something very similar but not identical - it is really not rocket science code, so the LLM replacing me would have rather few alternatives to choose from.

reply

upvote

by fc417fc8023 hours ago|

[-]

I believe those are actually separate matters. A proper clean room implementation on the one hand, and the question of whether or not a particular outcome was a foregone conclusion on the other. I don't recall where I saw the latter but it might have come up during Google v Oracle?

reply

upvote

by wareya11 hours ago|

[-]

> If you had a hermetically sealed code base that just happened to coincide line for line with the codebase for GCC, it would still be a copy.

If you somehow actually randomly produce the same code without a reference, it's not a copy and doesn't violate copyright. You're going to get sued and lose, but platonically, you're in the clear. If it's merely somewhat similar, then you're probably in the clear in practice too: it gets very easy very fast to argue that the similarities are structural consequences of the uncopyrightable parts of the functionality.

> The actual meaning of a "clean room implementation" is that it is derived from an API and not from an implementation (I am simplifying slightly).

This is almost the opposite of correct. A clean room implementation's dirty phase produces a specification that is allowed to include uncopyrightable implementation details. It is NOT defined as producing an API, and if you produce an API spec that matches the original too closely, you might have just dirtied your process by including copyrightable parts of the shape of the API in the spec. Google vs Oracle made this more annoying than it used to be.

> Whether the reimplementation is actually a "new implementation" is a subjective but empirical question that basically hinges on how similar the new codebase is to the old one. If it's too similar, it's a copy.

If you follow CRRE, it's not a copy, full stop, even if it's somehow 1:1 identical. It's going to be JUDGED as a copy, because substantial similarity for nontrivial amounts of code means that you almost certainly stepped outside of the clean room process and it no longer functions as a defense, but if you did follow CRRE, then it's platonically not a copy.

> What the chardet maintainers have done here is legally very irresponsible.

I agree with this, but it's probably not as dramatic as you think it is. There was an issue with a free Japanese font/typeface a decade or two ago that was accused of mechanically (rather than manually) copying the outlines of a commercial Japanese font. Typeface outlines aren't copyrightable in the US or Japan, but they are in some parts of Europe, and the exact structure of a given font is copyrightable everywhere (e.g. the vector data or bitmap field for a digital typeface, as opposed to the idea of its shape). What was the outcome of this problem? Distros stopped shipping the font and replaced it with something vaguely compatible. Was the font actually infringing? Probably not, but better safe than sorry.

reply

upvote

by danlitt7 hours ago|

[-]

> If you somehow actually randomly produce the same code without a reference, it's not a copy and doesn't violate copyright.

I don't believe this, and I doubt that the sense of copying in copyright law is so literal. For instance, if I generated the exact text of a novel by looking for hash collisions, or by producing random strings of letters, or by hammering the middle button on my phone's autosuggestion keyboard, I would still have produced a copy and I would not be safe to distribute it. There need not have been any copy anywhere near me for this to happen. Whether it is likely or not depends on the technique used - naive techniques make this very unlikely, but techniques can improve.

It is also true that similarity does not imply copying - if you and I take an identical photograph of the same skyline, I have not copied you and you have not copied me, we have just fixed the same intangible scene into a medium. The true subjective test for copying is probably quite nuanced, I am not sure whether it is triggered in this case, but I don't think "clean room LLMs" are a panacea either.

> dirty phase produces a specification ... it is NOT defined as producing an API

This does not really sound like "the opposite of correct". APIs are usually not copyrightable, the truth is of course more complicated, if you are happy to replace "API" with "uncopyrightable specification" then we can probably agree and move on.

> it's probably not as dramatic as you think it is

In reality I am very cynical and think nothing will come of this, even if there are verbatim snippets in the produced code. People don't really care very much, and copyright cases that aren't predicated on millions of dollars do not survive the court system very long.

reply

upvote

by wareya7 hours ago|

[-]

> I don't believe this, and I doubt that the sense of copying in copyright law is so literal.

It is actually that literal, really.

> For instance, if I generated the exact text of a novel by looking for hash collisions,

This is a copyright violation because you're using the original to construct the copy. It's not a pure RNG.

> or by producing random strings of letters,

This wouldn't be a copyright violation, but nobody would believe you.

> or by hammering the middle button on my phone's autosuggestion keyboard, I would still have produced a copy and I would not be safe to distribute it.

This would probably be a copyright violation.

You probably think that this is hypothetical, but problems like this do actually go to court all the time, especially in the music industry, where people try to enforce copyright on melodies that have the informational uniqueness of an eight-word sentence.

> APIs are usually not copyrightable,

This was commonly believed among developers for a long time, but it turned out to not be true.

> This does not really sound like "the opposite of correct".

The important part is that information about the implementation can absolutely be in the spec without necessarily being copyrightable (and in real world clean room RE, you end up with a LOT of implementation details). You were saying the opposite, that it was a spec of the API as opposed to a spec of the implementation.

reply

upvote

by fc417fc8023 hours ago|

[-]

> I don't believe this, and I doubt that the sense of copying in copyright law is so literal.

What color are your bits? That's all the law cares about.

The first sentence is the title of an essay.

reply

upvote

by 11 hours ago|

[-]

deleted

reply

upvote

by thousand_nights11 hours ago|

[-]

the whole concept of a "clean room" implementation sounds completely absurd.

a bunch of people get together, rewrite something while making a pinky promise not to look at the original source code

guaranteeing the premise is basically impossible, it sounds like some legal jester dance done to entertain the already absurd existing copyright laws

reply

upvote

by myrmidon11 hours ago|

[-]

> it sounds like some legal jester dance done to entertain [...] copyright laws

Clean room implementations are a jester dance around the judiciary. The whole point is to avoid legal ambiguity.

You are not required to do this by law, you are doing this voluntarily to make potential legal arguments easier.

The alternative is going over the whole codebase in question and arguing basically line by line whether things are derivative or not in front of a judge (which is a lot of work for everyone involved, subjective, and uncertain!).

reply

upvote

by bandrami10 hours ago|

[-]

In the archetypal example IBM (or whoever it was) had to make sure the two engineering teams were never in the cafeteria together at the same time

reply

upvote

by dudeinhawaii11 hours ago|

[-]

It usually refers to situations without access to the source code.

I've always taken "clean room" to be the kind of manufacturing clean room (sealed/etc). You're given a device and told "make our version". You're allowed to look, poke, etc but you don't get the detailed plans/schematics/etc.

In software, you get the app or API and you can choose how to re-implement.

In open source, yes, it seems like a silly thing and hard to prove.

reply

upvote

by Forgeties7911 hours ago|

[-]

Halt and Catch Fire did a pretty funny rendition of this song and dance

reply

upvote

by foooorsyth11 hours ago|

[-]

>The actual meaning of a "clean room implementation" is that it is derived from an API and not from an implementation

This is incorrect and thinking this can get you sued

https://en.wikipedia.org/wiki/Structure,_sequence_and_organi...

reply

upvote

by amiga3868 hours ago|

[-]

Whether you get sued is more on the plaintiff than you.

Per your link, the Supreme Court's thinking on "structure, sequence and organization" (Oracle's argument why Google shouldn't even be allowed to faithfully produce a clean-room implementation of an an API) has changed since the 1980s out of concern that using it to judge copyright infringement risks handing copyright holders a copyright-length monopoly over how to do a thing:

> enthusiasm for protection of "structure, sequence and organization" peaked in the 1980s [..] This trend [away from "SS&O"] has been driven by fidelity to Section 102(b) and recognition of the danger of conferring a monopoly by copyright over what Congress expressly warned should be conferred only by patent

The Supreme Court specifically recognised Google's need to copy the structure, sequence and organization of Java APIs in order to produce a cleanroom Android runtime library that implemented Java APIs so that that existing Java software could work correctly with it.

Similarly, see Oracle v. Rimini Street (https://cdn.ca9.uscourts.gov/datastore/opinions/2024/12/16/2...) where Rimini Street has been producing updates that work with Oracle's products, and Oracle claimed this made them derivative works. The Court of Appeals decided that no, the fact A is written to interoperate with B does not necessarily make A a derivative work of B.

reply

upvote

by danlitt7 hours ago|

[-]

I did not expect people to take "API" so literally. This point is what I was referring to when I said "I am simplifying slightly". The point is that a clean room impl begins from a specification of what the software does, and that the new implementation is purported to be derived only from this. What I am trying to say is that "not looking at the implementation" is not exactly the point of the test - that is a rule of thumb, which works quite well for avoiding copyright infringement, but only when humans do it.

reply

upvote

by fc417fc8023 hours ago|

[-]

It probably works great for a machine too at least when it comes to a closed source product. The issue is specifically the part where the LLM was almost certainly trained on the code in question which is going to be an issue for any code published to the internet.

reply

upvote

by umvi9 hours ago|

[-]

You can be sued for any reason if a company feels threatened (see: Oracle v Google)

reply

upvote

by j458 hours ago|

[-]

This reminds me of a full rewrite.

When a developer reimplements a complete new version of code from scratch, with an understanding only, a new implementation generally should be an improvement on any source code not equal.

In today’s world, letting LLMs replicate anything will generate average code as “good” and generally create equivalent or more bloat anyways unless well managed.

reply

upvote

by StilesCrisis8 hours ago|

[-]

The world is chock-full of rewrites that came out disastrously worse than the thing they intended to replace. One of Spolsky's most-quoted articles of all time was about this.

https://www.joelonsoftware.com/2000/04/06/things-you-should-...

> They did it by making the single worst strategic mistake that any software company can make: They decided to rewrite the code from scratch.

reply

upvote

by j455 hours ago|

[-]

Oh, for sure, rewrites generally do fail especially if the incoming lessons from the existing version aren't clear.

Finding a middle ground of building a roadmap to refactoring your way forward is often much better.

Appreciate the Joel link, nice to see that kind of stuff again.

With that being said if it's the same small team that built the first version, there can be a calculated risk to driving a refactor towards a rewrite with the right conditions. I says this because I have been able to do it in this conditions a few times, it still remains very risky. If it's a new or different team later on trying to rewrite, all bets are off anyways.

We have to remember 70% of software projects fail at the best of times, independent of rewrites.

reply

upvote

by jen209 hours ago|

[-]

> What the chardet maintainers have done here is legally very irresponsible.

Perhaps the maintainer wants to force the issue?

> Any downstream user of the library is at risk of the license switching from underneath them.

Checking the license of the transitive closure of your dependencies is table stakes for using them.

reply

upvote

by fc417fc8023 hours ago|

[-]

The problem is that the transitive closure isn't clear here. One of the entries is being claimed to be one thing but might in fact turn out to be another.

reply

upvote

by danlitt7 hours ago|

[-]

> Perhaps the maintainer wants to force the issue?

I doubt it, and I don't see any evidence that's what they're doing. There are probably better ways, if that's what they want.

> Checking the license of the transitive closure of your dependencies is table stakes for using them.

Checking the license of the transitive closure of your dependencies is only feasible when the library authors behave responsibly.

reply

upvote

by aaron6951 hours ago|

[-]

[dead]

reply

upvote

by dathinab13 hours ago|

[-]

the author speaks about code which is syntactically completely different but semantically does the same

i.e. a re-implementation

which can either

- be still derived work, i.e. seen as you just obfuscating a copyright violation

- be a new work doing the same

nothing prevents an AI from producing a spec based on a API, API documentation and API usage/fuzzing and then resetting the AI and using that spec to produce a rewrite

I mean "doing the same" is NOT copyright protection, you need patent law for that. Except even with patent law you need to innovations/concepts not the exact implementation details. Which means that even if there are software patents (theoretically,1) most things done in software wouldn't be patentable (as they are just implementation details, not inventions)

(1): I say theoretically because there is a very long track record of a lot of patents being granted which really should never be granted. This combined with the high cost of invalidating patents has caused a ton of economical damages.

reply

upvote

by jacquesm13 hours ago|

[-]

No, that depends on whether or not the AI work product rests on key contributions to its training set without which it would not be able to the the work, see other comment. In that case it looks like 'a new work doing the same' but it still a derived work.

Ted Nelson was years ahead of the future where we really needed his Xanadu to keep track of fractional copyright. Likely if we had such a mechanism, and AI authors respected it then we would be able to say that your work is derived from 3000 other original works and that you added 6 lines of new code.

reply

upvote

by uyzstvqs12 hours ago|

[-]

No, training and inference are two separate processes. Training data is never redistributed, only obtained and analyzed. What matters is what data is put into context during inference. This is controlled by the user.

AI/ML is complex, so as a simpler analogy: If I watch The Simpsons, and I create an amusing infographic of how often Homer says "D'oh!" over time, my infographic would be an original work. AI training follows the same principle.

reply

upvote

by jacquesm12 hours ago|

[-]

> my infographic would be an original work.

> AI training follows the same principle.

If you really believe that then we can't have a meaningful conversation about this, that's not even ELIF territory, that's just disconnected. You should be asking questions, not telling people how it works.

reply

upvote

by ndriscoll10 hours ago|

[-]

How exactly is it different? All the model itself is is a probability distribution for next token given input, fitted to a giant corpus. i.e. a description of statistical properties. On its own it doesn't even "do" anything, but even if you wrap that in a text generator and feed it literal gcc source code fragments as input context, it will quickly diverge. Because it's not a copy of gcc. It doesn't contain a copy of gcc. It's a description of what language is common in code in general.

In fact we could make this concrete: use the model as the prediction stage in a compressor, and compress gcc with it. The residual is the extent to which it doesn't contain gcc.

reply

upvote

by jacquesm10 hours ago|

[-]

There already have been multiple documented cases of LLMs spitting out fairly large chunks of the input corpus. There have been some experiments to get it to replicate the entirety of 'Moby Dick' with some success for one model but less success with others most likely due to output filtering to prevent the generation of such texts, but that doesn't mean they're not in there in some form. And how could they not be, it is just a lossy compression mechanism, the degree of loss is not really all that relevant to the discussion.

reply

upvote

by ndriscoll10 hours ago|

[-]

Are you referring to this?

https://osyuksel.github.io/blog/reconstructing-moby-dick-llm...

I see a test where one model managed to 85% reproduce a paragraph given 3 input paragraphs under 50% of the time.

So it can't even produce 1 paragraph given 3 as input, and it can't even get close half the time.

"Contains Moby Dick" would be something like you give it the first paragraph and it produces the rest of the book. What we have here instead is a statistical model that when given passages can do an okay job at predicting a sentence or two, but otherwise quickly diverges.

reply

upvote

by xyzzy_plugh10 hours ago|

[-]

I'm no longer certain what point you're trying to make.

Getting close less than half the time given three paragraphs as input still sounds like red-handed copyright infringement to me.

If I sample a copyrighted song in my new track, clip it, slow it down, and decimate the bit rate, a court would not let me off the hook.

It doesn't matter how much context you push into these things. If I feed them 50% of Moby Dick and they produce the next word, and I can repeatedly do that to produce the entire book (I'm pretty sure the number of attempts is wholly irrelevant: we're impossibly far from monkeys on typewriters) then we can prove the statistical model encodes the book. The further we are from that (and the more we can generate with less) then the stronger the case is. It's a pretty strong case!

reply

upvote

by ndriscoll10 hours ago|

[-]

That's... not how this works.

> If I feed them 50% of Moby Dick and they produce the next word and I can repeatedly do that to produce the entire book... then we can prove the statistical model encodes the book.

It can't because it doesn't. That's what it means to say it diverges.

The "number of attempts" is you cheating. You're giving it the book when you let it try again word by word until it gets the correct answer, and then claiming it produced the book. That's exactly the residual that I said characterizes the extent to which it doesn't know the book. Trivially, no matter how bad the model is, if you give it the residual, it can losslessly compress anything at all.

If you had a simple model that just predicts next word given current word (trained on word pair frequency across all English text, or even all text excluding Moby Dick), and then give it retries until it gets the current word right, it will also quickly produce the book. Because it was your retry policy that encoded the book, not the model. Without that policy, it will get it wrong within a few words, just like these models do.

reply

upvote

by xyzzy_plugh9 hours ago|

[-]

But it does encode it! Each subsequent token's probability space encodes the next word(s) of the book with a non-zero probability that is significantly higher than random noise.

If you had access to a model's top p selection then I'd bet the book is in there consistently for every token. Is it statistically significant? Might be!

I'm not cheating because the number of attempts is so low it's irrelevant.

If I were to take a copyrighted work and chunk it up into 1000 pieces and encrypt each piece with a unique key, and give you all the pieces and keys, would it still be the copyrighted work? What if I shave off the last bit of each key before I give them to you, so you have a 50% chance of guessing the correct key for each piece? What if I shave two bits? What if it's a million pieces? When does it become transformative or no longer infringing for me to distribute?

The answer might surprise you.

reply

upvote

by ndriscoll9 hours ago|

[-]

Your test is more like the following:

Consider a password consisting of random words each chosen from a 4k dictionary. Say you choose 10 words. Then your password has log_2(4k)*10 entropy.

Now consider a validator that tells you when you gets a word right. Then you can guess one word at a time, and your password strength is log_2(4k*10). Exponentially weaker.

You're constructing the second scenario and pretending it's the first.

Also in your 50% probability scenario, each word is 1 bit, and even 50-100 bits is unguessable. A 1000 word key where each word provides 1 bit would be absurdly strong.

reply

upvote

by xyzzy_plugh8 hours ago|

[-]

You're still missing the point. The numbers don't matter because it's copyright infringement as long as I can get the book out. As long as I know the key, or the seed, I can get the book out. In court, how would you prove it's not infringement?

reply

upvote

by ndriscoll8 hours ago|

[-]

Because you put the book in. Again, this is measurable. Compress the book with a model as the predictor. The residual is you having to give it the answer. It's literally you telling it the book.

reply

upvote

by jacquesm9 hours ago|

[-]

The point is that the AI's themselves and their backers are on the record as saying that the AI could reproduce copyrighted works in their entirety but that there are countermeasures in place to stop them from doing so.

I wonder what the results would be if I spent time to train a model up from scratch without any such constraints. But I'm much too busy with other stuff right now, but that would be an interesting challenge.

reply

upvote

by ndriscoll9 hours ago|

[-]

Yeah just like a star could appear inside of Earth from quantum pair production at any given moment. But realistically, it can't. And you can't even show a test where any model can get more than a few tokens in a row correct.

These companies just don't want to deal with people complaining that it reproduces something when they don't understand that they're literally giving it the answer.

reply

upvote

by jacquesm7 hours ago|

[-]

You do realize you are now arguing against your own case don't you?

reply

upvote

by gus_massa9 hours ago|

[-]

For an infographic, perhaps you claim claim fair use. I think it makes a lot of sense, but IANAL.

For a fan fiction episode that is different from all official episodes, you may cross your fingers.

For a remake of one of the episodes with a different camera angle and similar dialog, I expect that you will get in problems.

reply

upvote

by ndriscoll9 hours ago|

[-]

Is the claim that these models can 1 shot a Simpsons episode remake with different camera angle and similar dialog from a prompt like "produce Simpsons episode S01E04"? Or are we falling into the "the user doesn't notice that they told the model the answer, and the model in fact did not memorize the thing" trap?

reply

upvote

by pmarreck12 hours ago|

[-]

> With LLMs, the probability is much higher (since in truth they are very much not a "clean room" at all).

I beg to differ. Please examine any of my recent codebases on github (same username); I have cleanroom-reimplemented par2 (par2z), bzip2 (bzip2z), rar (rarz), 7zip (z7z), so maybe I am a good test case for this (I haven't announced this anywhere until now, right here, so here we go...)

https://github.com/pmarreck?tab=repositories&type=source

I was most particular about the 7zip reimplementation since it is the most likely to be contentious. Here is my repo with the full spec that was created by the "dirty team" and then worked off of by the LLM with zero access to the original source: https://github.com/pmarreck/7z-cleanroom-spec

Not only are they rewritten in a completely different language, but to my knowledge they are also completely different semantically except where they cannot be to comply with the specification. I invite you and anyone else to compare them to the original source and find overt similarities.

With all of these, I included two-way interoperation tests with the original tooling to ensure compatibility with the spec.

reply

upvote

by ostacke12 hours ago|

[-]

Bu that's not really what danlitt said, right? They did not claim that it's impossible for an LLM to generate something different, merely that it's not a clean room implementation since the LLM, one must assume, is trained on the code it's re-implementing.

reply

upvote

by galaxyLogic8 hours ago|

[-]

BUt LLM has seen millions (?) of other code-bases too. If you give it a functional spec it has no reason to prefer any one of those code-bases in particular. Except perhaps if it has seen the original spec (if such can be read from public sources) associated with the old implementation, and the new spec is a copy of the old spec.

reply

upvote

by sarchertech6 hours ago|

[-]

Yes if you are solving the exact problem that the original code solved and that original code was labeled as solving that exact problem then that’s very good reason for the LLM to produce that code.

Researchers have shown that an LLM was able to reproduce the verbatim text of the first 4 Harry Potter books with 96% accuracy.

reply

upvote

by 0x4572 hours ago|

[-]

> that an LLM was able to reproduce the verbatim text of the first 4 Harry Potter books with 96% accuracy.

Kinda weird argument, in their research (https://forum.gnoppix.org/t/researchers-extract-up-to-96-of-...) LLM was explicitly asked to reproduce the book. There are people that can do so without LLMs out there, by this logic everything they write is a copyright infringement an every book they can reproduce.

> Yes if you are solving the exact problem that the original code solved and that original code was labeled as solving that exact problem then that’s very good reason for the LLM to produce that code.

I think you're overestimating LLM ability to generalize.

reply

upvote

by galaxyLogic4 minutes ago|

[-]

I guess the text of Harry Potter was used as training material as one big chunk. That would be a copyright violation.

reply

upvote

by pmarreck3 hours ago|

[-]

This is not an argument against coding in a different language, though. It would be like having it restate Harry Potter in a different language with different main character names, and reshuffled plot points.

reply

upvote

by pmarreck3 hours ago|

[-]

Well, if you’re coding it in Zig, and it’s barely seen any Zig, then how exactly would that argument hold up in that case?

reply

upvote

by airza12 hours ago|

[-]

By what means did you make sure your LLM was not trained with data from the original source code?

reply

upvote

by MrManatee7 hours ago|

[-]

Exactly - it very likely was trained on it. I tried this with Opus 4.6. I turned off web searches and other tool calls, and asked it to list some filenames it remembers being in the 7-zip repo. It got dozens exactly right and only two incorrect (they were close but not exact matches). I then asked it to give me the source code of a function I picked randomly, and it got the signature spot on, but not the contents.

My understanding of cleanroom is that the person/team programming is supposed to have never seen any of the original code. The agent is more like someone who has read the original code line by line, but doesn't remember all the details - and isn't allowed to check.

reply

upvote

by pmarreck3 hours ago|

[-]

Because it’s written in an entirely different language, which makes this whole point moot

reply

upvote

by sobjornstad2 hours ago|

[-]

Surely if I took a program written in Python and translated it line for line into JavaScript, that wouldn't allow me to treat it as original work. I don't see how this solves the problem, except very incrementally.

reply

upvote

by pmarreck23 minutes ago|

[-]

but it’s not a line for line translation. it is a functionality for functionality translation, and sometimes very differently.

reply

upvote

by danlitt8 hours ago|

[-]

I only said the probability is higher, not that the probability is 1!

reply