It is enough to have read even parts of a work for something to be considered a derivative.
I would also argue that language models who need gargantuan amounts of training material in order to work by definition can only output derivative works.
It does not help that certain people in this thread (not you) edit their comments to backpedal and make the followup comments look illogical, but that is in line with their sleazy post-LLM behavior.
For IP rights, I'll buy that. Not as important when the question is capabilities.
> I would also argue that language models who need gargantuan amounts of training material in order to work by definition can only output derivative works.
For similar reasons, I'm not going to argue against anyone saying that all machine learning today, doesn't count as "intelligent":
It is perfectly reasonable to define "intelligence" to be the inverse of how many examples are needed.
ML partially makes up for being (by this definition) thick as an algal bloom, by being stupid so fast it actually can read the whole internet.
https://arxiv.org/pdf/2601.02671
> For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).
Edit: actually, no, I take that back, that's just very similar to some other research I was familiar with.
Because it _has_ been enough, that if you can recall things, that your implementation ends up not being "clean room", and trashed by the lawyers who get involved.
I mean... It's in the name.
> The term implies that the design team works in an environment that is "clean" or demonstrably uncontaminated by any knowledge of the proprietary techniques used by the competitor.
If it can recall... Then it is not a clean room implementation. Fin.