The main analogy is this one: you take a massive pile of copyrighted works, cut them up into small sections and toss the whole thing in a centrifuge, then, when prompted to produce a work you use a statistical method to pull pieces of those copyrighted works out of the centrifuge. Sometimes you may find that you are pulling pieces out of the laundromat in the order in which they went in, which after a certain number of tokens becomes a copyright violation.
This suggests there are some obvious ways in which AI companies can protect themselves from claims of infringement but as far as I'm aware not a single one has protections in place to ensure that they do not materially reproduce any fraction of the input texts other than that they recognize prompts asking it to do so.
So it won't produce the lyrics of 'Let it be'. But they'll be happy to write you mountains of prose that strongly resembles some of the inputs.
The fact that they are not doing that tells you all you really need to know: they know that everything that their bots spit out is technically derived from copyrighted works. They also have armies of lawyers and technical arguments to claim the opposite.
sure,
but that is completely unrelated to this discussion
which is about AI using code as input to produce similar code as output
not about AI being trained on code
> not about AI being trained on code
The two are very directly connected.
The LLM would not be able to do what it does without being trained, and it was trained on copyrighted works of others. Giving it a piece of code for a rewrite is a clear case of transformation, no matter what, but now it also rests on a mountain of other copyrighted code.
So now you're doubly in the wrong, you are willfully using AI to violate copyright. AI does not create original works, period.
it isn't clear how/if llm is different from the brain but we all have training by looking at copywrited source code at some time.
It's very clear: the one is a box full of electronics, the other is part of the central nervous system of a human being.
> but we all have training by looking at copywrited source code at some time.
That may be so, but not usually the copyrighted source code that we are trying to reproduce. And that's the bit that matters.
You can attempt to whitewash it but at its core it is copyright infringement and the creation of derived works.
The single word "training" is here being used to describe two very different processes; what an LLM does with text during training is at basically every step fundamentally distinct from what a human does with text.
Word embedding and gradient descent just aren't anything at all like reading text!
I have a lot of music in my head that I've listened to for decades. I could probably replicate it note-for-note given the right gear and enough time. But that would not make any of my output copyrightable works. But if I doodle for three minutes on the piano, even if it is going to be terrible that is an original work.
Says who?. The US ruling the article refers to does not cover this.
It is different in other countries. Even if US law says it is public domain (which is probably not the case) you had better not distribute it internationally. For example, UK law explicitly says a human is the author of machine generated content: https://news.ycombinator.com/item?id=47260110