upvote
You are allowed to quote from copyrighted works without needing permission. Trying to assert copyright because of a quote of, say, a mere 60 words in length would get you thrown out of any judge’s court.

It was shown, in this case, that the llms wouldn’t generate accurate quotes more than 60 words in length.

This is not comparable to encoding a full video file.

reply
I think the better analogy is if you had someone with a superhuman, but not perfect memory read a bunch of stuff, then you were allowed to talk to the person about the things they’d read, does that violate copyright? I’d say clearly no.

Then what if their memory is so good, they repeat entire sections verbatim when asked. Does that violate it? I’d say it’s grey.

But that’s a very specific case - reproducing large chunks of owned work is something that can be quite easily detected and prevented and I’m almost certain the frontier labs are already going this.

So I think it’s just very not clear - the reality is this is a novel situation, the job of the courts is now to basically decide what’s allowed and what’s not. But the rational shouldn’t be ‘this can’t be fair use it’s just compression’. Because it’s clearly something fundamentally different and existing laws just aren’t applicable imo

reply
This a strawman, in the sense that it is not accurate to think about AI models as a compressed form of their training data, since the lossiness is so high. One of the insights from the trial is the LLMs are particularly poor at reproducing original texts (60 tokens was the max found in this trial, IIRC). This is taken into account when considering fair use based on the fourth fair use factor: how the work impacts the market for the original work. It's hard to make an argument that LLMs are replacing long-form text works, since they have so much trouble actually producing them.

There's a whole related topic here in the realm of news (since it's shorter form), but it also has a much shorter half-life. Not sure what I think there yet.

reply