It's perfectly reasonable to say it's okay for humans to do something but not okay for a computer program to do the same thing. We don't have to equate AI to humans, that's a choice and usually a bad one.
It would not be reasonable to allow machines to do that at unlimited scale without restrictions.
(Hopefully the fossil fuels industry won't draw inspiration from the legal arguments made by AI companies...)
Is there any line past which it becomes unreasonable?
> It would not be reasonable to allow machines to do that at unlimited scale without restrictions.
If the machines were a replacement for a damaged respiratory system in a human would it reasonable?
What about if the machine were being used by a human to do something else that was important?
Where is the line where it becomes reasonable?
That's exactly the question we should be asking about AI and fair use.
Now, if you'll excuse me, I need to catch a metal shuttle that chucks itself through the air on wings.
The relevant extension of your analogy is should birds be required to obey FAA rules? Or should plane factories be protected as nesting sites?
The mental calisthenics required to justify this stuff must be exhausting.
It's only exhausting if you think copyright ever reasonably settled the matter of ownership of knowledge and want to morally justify an incoherent set of outcomes that they personally favor. In practice it's primarily been a tool for the powerful party in any dispute to hammer others for disrupting their business model. I think that's pretty much the only way attempting to apply ownership semantics to knowledge or information can end up.
Knowledge consists of, roughly speaking, thoughts.
(a "justified true belief" - per https://plato.stanford.edu/entries/knowledge-analysis/ - is a kind of thought)
The "thinking" part of a "thinking being" - that also consists of thoughts.
If your knowledges are someone's property, you are someone's property.
A society where all knowledge is proprietary, is a society of ubiquitous slavery.
Maybe multi-layered, maybe fractional, maybe with a smiley-face drawn on top.
Doesn't matter.
I mean I don't think think I could find a better description for following the derivatives of error in reproducing a set of works as creating a "derivative work".
I agree. However, the reverse is also likely true, i.e., it cannot currently be denied that learning in humans is different from learning in artificial neural networks from the point of view of production of works that mix ideas/memes from several works processed/read. Surely, as the article says, copyright law talks exclusively about humans, not machines, not animals.
Edit*: Or perhaps put more pseudo legally that the created works infringe on the copyrights of the original human creators.
The above does not follow from, imply or conclude anything about learning in artificial neural networks and humans being similar or dissimilar.
Copy/pasting at scale, yes
Code gets turned into tokens and then it learns the next most likely token.
The issue that I see most people talk about it the scale at which is learnt.
A human will learn from other people’s code but not from every persons code.
Copyright law is very clear that if a machine does it, the original copyright on the input is kept. This is why your distributed binaries are still copyrighted, because the machine transformed, very significantly, the source code into binary which maintains the copyright throughout.
It would be inconsistent for the courts to suddenly decide that "actually, this specific type of machine transformation is actually innovative."
I know this is generally really bad for the AI industry, so they just ignore it until a court tells them they can't anymore. And they might get away with it as I don't have faith that the courts will be consistent.
And the specifics of autoregressive pretraining is that it is lossy compression. Good luck finding which copyrighted materials have made it into the final weights.
Yup, it absolutely does. In fact, that's why you are still violating copyright law by using bittorrent even though each of the users is only giving out a small slice or shred of the original content.
The US has a granted defense in the case of something like shredding called "Fair Use" but that doesn't mean or imply that a copyright is void simply because of a fair use claim.
> And the specifics of autoregressive pretraining is that it is lossy compression.
That doesn't matter. Why would it? If I take a FLAC recording and change it to an MP3. The fact that it was a lossy transform doesn't suddenly give me the legal right to distribute the MP3.
> Good luck finding which copyrighted materials have made it into the final weights.
That's what the NYT v. OpenAI lawsuit is all about. And for earlier models they could, in fact, pull out full NYT articles which proved they made it into the final weights.
Further, the NYT is currently in discovery which means OpenAI must open up to the NYT what goes into their weights. A move that, if OpenAI loses, other litigants can also use because there's a real good shot that OpenAI also included their works in the dataset.
Well, it's not the first time when the law contradicts laws of nature (for the entertainment of the future generations). Bittorent is not a relevant example, because the system is designed to restore the work in its fullness.
> in fact, pull out full NYT articles
That's when they used their knowledge of the exact text they wanted to "retrieve" to get the text? It wouldn't be so efficient with a random number generator, but it's doable.
You can restore shredded documents with enough time and effort. And if you did that and started making photo copies, even if they are incomplete, you will run afoul of copyright law.
Bittorrent is a relevant example because it shows that shredding doesn't destroy copyright.
Remember, copyright is about the right to copy something. Simply shredding or destroying a thing isn't applicable to copyright. Nor is giving that thing away. What's applicable is when you start to actually copy the thing.
EDIT: I don't say that neural networks can't rote learn extensive passages (it's an effect of data duplication). I'm saying that they are not designed to do that and it's possible to prevent that (as demonstrated by the latest models).
The way I arrive at that is imagine you add just 1 pixel of static to a video, that'd still be a copyright violation. Now imagine you slowly keep adding those random pixels. Eventually you get to the point where the whole video is just static, but at some point it wasn't.
Now, would any media company or court sue over that? Probably not. But I believe that still falls under copy right (but maybe fair use?).
The issue with neural networks is they aren't people. Even when you point your LLM at a website and say "summarize this" the output of that summation would be owned by the website itself by nature of it being a machine transformed work.
Remembered, it's not just mere rote recitation which violates the law, any transformation counts as well. The fact that AI companies are preventing it doesn't really solve the problem that they are in fact transforming multiple copyrighted works into their responses.
What would violate copyright is if you took that rendered page, turned it into a jpeg, and then hosted that jpeg from your own servers. That's the copying that would run afowl of copyright law.
I have seen LLMs do all sorts of crap which was clearly reproduction of training material.
This is also why people are most impressed with how much better it is at reproducing boilerplate rather than, say, imaginative new ideas.