upvote
I fully agree with you. (A small information theory nit pick with your example. The hash and program would have to be at least as long as a perfectly compressed copy of Harry Potter and the Philosopher's Stone. If not you've just invented a better compressor and are in the running for a Hutter Prize[1]! A hash and "decomporessor" of the required length would likely be considered to embody the work.)

It's an interesting case. As I understand it, there is an ongoing debate within the AI research community as to whether neural nets are encoding verbatim blocks of information or creating a model which captures the "essence" or "ideas" behind a work. If they are capturing ideas, which are not copyrightable, it would suggest that LLMs can be used to "launder" copyright. In this case, I get the feeling that, for legal clarity, we would both say that the work in question (or works derived from it) should not be part of the training set or prompt, emulating a clean room implementation by a human. (Is that a fair comment?)

I've no direct experience here, but I would come down on the side of "LLMs are encoding (copyrightable) verbatim text", because others are reporting that LLMs do regurgitate word-for-word chunks of text. Is this always the case though? Do different AI architectures, or models that are less well fitted, encode ideas rather than quotes?

[1] https://en.wikipedia.org/wiki/Hutter_Prize

Edit: It would be an interesting experiment to use two LLMs to emulate a clean room implementation. The first is instructed to "produce a description of this program". The second, having never seen the program, in its prompt or training set, would be prompted to "produce a program based on this description". A human could vet the description produced by the first LLM for cleanliness. Surely someone has tried this, though it might be a challenge to get an LLM that is guaranteed not to have been exposed to a particular code base or its derivatives?

reply