upvote
Hmm, training on a book’s text smears the content all over the weights, merging it with all other texts. The original text isn’t intentionally supposed to be reproducible in any larger part (although IIRC models were able to emit fairly large chunks verbatim).

Quite unlikely, training on behavior purportedly approximately replicates the behavior. It gets replicated intentionally as a whole.

IANAL, but I see significant differences with intent to copy a significant part as a whole into a competing product, surely shouldn’t fit under legal concept of fair use, no matter whether scanning books for LLM training fits or not.

Whether such things (behaviors) are copyrightable - and should they be so - is another interesting question. Those aren’t algorithms or databases (stuff clearly and explicitly covered in many copyright laws), those are human expectation models, something like how we train animals or teach our own.

reply
It's the exact same training process for both of your examples. I don't really see how you can claim books are not replicated, but that output from other LLMs is.
reply
> Hmm, training on a book’s text smears the content all over the weights, merging it with all other texts. The original text isn’t intentionally supposed to be reproducible in any larger part (although IIRC models were able to emit fairly large chunks verbatim).

I agree with that, however that doesn't make the output copyrightable then.

I think these AI companies live in a legal fantasy where they can take any content they want, put it into the mixer without caring about copyright and then what comes out of it is somehow copyrighted.

They have to pick one or the other, either the content copyright tains the model or it doesn't but the model isn't subject to copyright.

> those are human expectation models, something like how we train animals or teach our own.

But more importantly, made by machines, and one of the requirements for copyright is the human factor.

reply
> I think these AI companies live in a legal fantasy where they can take any content they want, put it into the mixer without caring about copyright and then what comes out of it is somehow copyrighted.

The mixer you're talking about is what they seem to claim to be transformative use, no? Unless I'm misunderstanding something, it's not a legal fantasy.

reply
> The mixer you're talking about is what they seem to claim to be transformative use, no? Unless I'm misunderstanding something, it's not a legal fantasy.

If it's transformative use, then it's transformative use of ... what exactly? Copyrighted works? I think the law is pretty clear on what happens on transformative use of copyrighted works.

reply
Probably, yes. It's likely just a breach in their terms of service. You'll note that they're not suing them – they're trying to get the government to do their work for them.
reply