upvote
You’re missing the point. An LLM is not going to memorise a whole book just because it’s seen a few copies. An LLM might be able to memorise the Bible in particular simply because Bible quotes are everywhere. There is a vast difference between being able to find a handful of copies online and having it constantly quoted everywhere that humans communicate. Bible quotes get literally everywhere. People put them on bumper stickers, tattoo themselves with it, put it in their email signatures, etc. Bible quotes are so omnipresent, they have become part of our language – a lot of idioms people use every day come from the Bible.

The Bible isn’t just a book, it’s been a massive part of human culture for millennia, to the point of it shaping language itself. LLMs might be able to memorise the Bible, but it’s not because they can memorise books, it’s because the Bible is far more than just a book.

reply
I went to check and it seems like it works fine for plenty of other public domain books. The picture of Dorian Grey, Pride and prejudice and what have you. I can ask for x amount of paragraphs from a specific and such.

I doubt every part of those books get quoted everywhere on a numbered basis like the bible might be. For only recently public domain books it seems to be overly cautious trough the retroactively applied filtering where it refuses if it suspects there might be a single country where copyright still applies.

reply
I can’t reproduce that. What model were you using and what prompt?
reply