> AI training follows the same principle.
If you really believe that then we can't have a meaningful conversation about this, that's not even ELIF territory, that's just disconnected. You should be asking questions, not telling people how it works.
In fact we could make this concrete: use the model as the prediction stage in a compressor, and compress gcc with it. The residual is the extent to which it doesn't contain gcc.
https://osyuksel.github.io/blog/reconstructing-moby-dick-llm...
I see a test where one model managed to 85% reproduce a paragraph given 3 input paragraphs under 50% of the time.
So it can't even produce 1 paragraph given 3 as input, and it can't even get close half the time.
"Contains Moby Dick" would be something like you give it the first paragraph and it produces the rest of the book. What we have here instead is a statistical model that when given passages can do an okay job at predicting a sentence or two, but otherwise quickly diverges.
Getting close less than half the time given three paragraphs as input still sounds like red-handed copyright infringement to me.
If I sample a copyrighted song in my new track, clip it, slow it down, and decimate the bit rate, a court would not let me off the hook.
It doesn't matter how much context you push into these things. If I feed them 50% of Moby Dick and they produce the next word, and I can repeatedly do that to produce the entire book (I'm pretty sure the number of attempts is wholly irrelevant: we're impossibly far from monkeys on typewriters) then we can prove the statistical model encodes the book. The further we are from that (and the more we can generate with less) then the stronger the case is. It's a pretty strong case!
> If I feed them 50% of Moby Dick and they produce the next word and I can repeatedly do that to produce the entire book... then we can prove the statistical model encodes the book.
It can't because it doesn't. That's what it means to say it diverges.
The "number of attempts" is you cheating. You're giving it the book when you let it try again word by word until it gets the correct answer, and then claiming it produced the book. That's exactly the residual that I said characterizes the extent to which it doesn't know the book. Trivially, no matter how bad the model is, if you give it the residual, it can losslessly compress anything at all.
If you had a simple model that just predicts next word given current word (trained on word pair frequency across all English text, or even all text excluding Moby Dick), and then give it retries until it gets the current word right, it will also quickly produce the book. Because it was your retry policy that encoded the book, not the model. Without that policy, it will get it wrong within a few words, just like these models do.
If you had access to a model's top p selection then I'd bet the book is in there consistently for every token. Is it statistically significant? Might be!
I'm not cheating because the number of attempts is so low it's irrelevant.
If I were to take a copyrighted work and chunk it up into 1000 pieces and encrypt each piece with a unique key, and give you all the pieces and keys, would it still be the copyrighted work? What if I shave off the last bit of each key before I give them to you, so you have a 50% chance of guessing the correct key for each piece? What if I shave two bits? What if it's a million pieces? When does it become transformative or no longer infringing for me to distribute?
The answer might surprise you.
Consider a password consisting of random words each chosen from a 4k dictionary. Say you choose 10 words. Then your password has log_2(4k)*10 entropy.
Now consider a validator that tells you when you gets a word right. Then you can guess one word at a time, and your password strength is log_2(4k*10). Exponentially weaker.
You're constructing the second scenario and pretending it's the first.
Also in your 50% probability scenario, each word is 1 bit, and even 50-100 bits is unguessable. A 1000 word key where each word provides 1 bit would be absurdly strong.
I wonder what the results would be if I spent time to train a model up from scratch without any such constraints. But I'm much too busy with other stuff right now, but that would be an interesting challenge.
These companies just don't want to deal with people complaining that it reproduces something when they don't understand that they're literally giving it the answer.
For a fan fiction episode that is different from all official episodes, you may cross your fingers.
For a remake of one of the episodes with a different camera angle and similar dialog, I expect that you will get in problems.