upvote
It is true that if two people happen to independently create the same thing, they each have their own copyright.

It is also true that in all the cases that I know about where that has occurred the courts have taken a very, very, very close look at the situation and taken extensive evidence to convince the court that there really wasn't any copying. It was anything but a "get out of jail free" card; it in fact was difficult and expensive, in proportion to the size of the works under question, to prove to the court's satisfaction that the two things really were independent. Moreover, in all the cases I know about, they weren't actually identical, just, really really close.

No rational court could possibly ever come to that conclusion if someone claimed a line-by-line copy of gcc was written by them, they must have independently come up with it. The probably of that is one out of ten to the "doesn't even remotely fit in this universe so forget about it". The bar to overcoming that is simply impossibly high, unlike two songs that happen to have similar harmonies and melodies, given the exponentially more constrained space of "simple song" as compared to a compiler suite.

reply
All of this is moot for the purposes of LLM, because it's almost certain that the LLMs were trained on the code base, and therefore is "tainted". You can't do this with humans either. Clean room design requires separate people for the spec/implementation.
reply
That's the "but their case would still fail if the second author could show that their work was independent, no matter how improbable" part of the post you're responding to.
reply
One out of ten to the power of "forget about it" is not improbable, it's impossible.

I know it's a popular misconception that "impossible" = a strict, statistical, mathematical 0, but if you try to use that in real life it turns out to be pretty useless. It also tends to bother people that there isn't a bright shining line between "possible" and "impossible" like there is between "0 and strictly not 0", but all you can really do is deal with it. Where ever the line is, this is literally millions of orders of magnitude on the wrong side of it. Not a factor of millions, a factor of ten to the millions. It's not possible to "accidentally" duplicate a work of that size.

reply
It sounds to me like you're responding to a different argument than they're actually making and reading intent into it that isn't written into it.
reply
Thank you for providing a reference! I certainly admit that "very similar photographs are not copies" as the reference states. And certainly physical copying qualifies as copying in the sense of copyright. However I still think copying can happen even if you never have access to a copy.

I suppose a different way of stating my position is that some activities that don't look like copying are in fact copying. For instance it would not be required to find a literal copy of the GCC codebase inside of the LLM somehow, in order for the produced work to be a copy. Likewise if I specify that "Harry Potter and the Philosopher's Stone is the text file with hash 165hdm655g7wps576n3mra3880v2yzc5hh5cif1x9mckm2xaf5g4" and then someone else uses a computer to brute force find a hash collision, I suspect this would still be considered a copy.

I think there is a substantial risk that the automatic translation done in this case is, at least in part, copying in the above sense.

reply
I fully agree with you. (A small information theory nit pick with your example. The hash and program would have to be at least as long as a perfectly compressed copy of Harry Potter and the Philosopher's Stone. If not you've just invented a better compressor and are in the running for a Hutter Prize[1]! A hash and "decomporessor" of the required length would likely be considered to embody the work.)

It's an interesting case. As I understand it, there is an ongoing debate within the AI research community as to whether neural nets are encoding verbatim blocks of information or creating a model which captures the "essence" or "ideas" behind a work. If they are capturing ideas, which are not copyrightable, it would suggest that LLMs can be used to "launder" copyright. In this case, I get the feeling that, for legal clarity, we would both say that the work in question (or works derived from it) should not be part of the training set or prompt, emulating a clean room implementation by a human. (Is that a fair comment?)

I've no direct experience here, but I would come down on the side of "LLMs are encoding (copyrightable) verbatim text", because others are reporting that LLMs do regurgitate word-for-word chunks of text. Is this always the case though? Do different AI architectures, or models that are less well fitted, encode ideas rather than quotes?

[1] https://en.wikipedia.org/wiki/Hutter_Prize

Edit: It would be an interesting experiment to use two LLMs to emulate a clean room implementation. The first is instructed to "produce a description of this program". The second, having never seen the program, in its prompt or training set, would be prompted to "produce a program based on this description". A human could vet the description produced by the first LLM for cleanliness. Surely someone has tried this, though it might be a challenge to get an LLM that is guaranteed not to have been exposed to a particular code base or its derivatives?

reply