upvote
But it does encode it! Each subsequent token's probability space encodes the next word(s) of the book with a non-zero probability that is significantly higher than random noise.

If you had access to a model's top p selection then I'd bet the book is in there consistently for every token. Is it statistically significant? Might be!

I'm not cheating because the number of attempts is so low it's irrelevant.

If I were to take a copyrighted work and chunk it up into 1000 pieces and encrypt each piece with a unique key, and give you all the pieces and keys, would it still be the copyrighted work? What if I shave off the last bit of each key before I give them to you, so you have a 50% chance of guessing the correct key for each piece? What if I shave two bits? What if it's a million pieces? When does it become transformative or no longer infringing for me to distribute?

The answer might surprise you.

reply
Your test is more like the following:

Consider a password consisting of random words each chosen from a 4k dictionary. Say you choose 10 words. Then your password has log_2(4k)*10 entropy.

Now consider a validator that tells you when you gets a word right. Then you can guess one word at a time, and your password strength is log_2(4k*10). Exponentially weaker.

You're constructing the second scenario and pretending it's the first.

Also in your 50% probability scenario, each word is 1 bit, and even 50-100 bits is unguessable. A 1000 word key where each word provides 1 bit would be absurdly strong.

reply
You're still missing the point. The numbers don't matter because it's copyright infringement as long as I can get the book out. As long as I know the key, or the seed, I can get the book out. In court, how would you prove it's not infringement?
reply
Because you put the book in. Again, this is measurable. Compress the book with a model as the predictor. The residual is you having to give it the answer. It's literally you telling it the book.
reply
The point is that the AI's themselves and their backers are on the record as saying that the AI could reproduce copyrighted works in their entirety but that there are countermeasures in place to stop them from doing so.

I wonder what the results would be if I spent time to train a model up from scratch without any such constraints. But I'm much too busy with other stuff right now, but that would be an interesting challenge.

reply
Yeah just like a star could appear inside of Earth from quantum pair production at any given moment. But realistically, it can't. And you can't even show a test where any model can get more than a few tokens in a row correct.

These companies just don't want to deal with people complaining that it reproduces something when they don't understand that they're literally giving it the answer.

reply
You do realize you are now arguing against your own case don't you?
reply