undefined

upvote

points

by JimDabell225 days ago |

upvote

by dns_snek225 days ago|

[-]

> That’s clearly below the proportionality threshold for copyright to matter.

This type of reasoning keeps coming up with seemingly zero consideration for why copyright actually exists. The goal of copyright, under US law, is "To promote the progress of science and useful arts".

The goal of companies creating these LLMs is to supersede the use of source material they draw from, like books. You use an LLM because it has all the answers without having to spend the money compensating the original authors, or put in the work digesting it yourself, that's their entire value proposition.

Their end game is to create a product so good that nobody has a reason to ever buy a book again. A few hours after you publish your book, the LLM will gobble it up and distribute the insights contain within to all of their users for free, "it's fair use", they say. There won't be any economic incentive to write books at that point, and so "the progress of science and useful arts" will crawl to a halt. Copyright defeated.

If LLM companies are allowed to produce market substitutes of original works then the goal of copyright is being defeated on a technicality and this ought to be a discussion about whether copyright should be abolished completely, not a discussion about whether big tech should be allowed to get away with it.

reply

upvote

by JimDabell225 days ago|

[-]

> The goal of companies creating these LLMs is to supersede the use of source material they draw from, like books.

Nobody is going to stop buying Harry Potter books because they can get an LLM to spit out ~50 words from one of the books. The proportionality factor is very clearly relevant here.

> If LLM companies are allowed to produce market substitutes of original works

Did Meta publish a book written by an LLM?

> The goal of copyright, under US law, is "To promote the progress of science and useful arts".

I would consider training LLMs to be very much in line with those goals.

reply

upvote

by dns_snek220 days ago|

[-]

> Nobody is going to stop buying Harry Potter books because they can get an LLM to spit out ~50 words from one of the books.

Not yet, but they'll stop buying books on niche technical subjects.

> Did Meta publish a book written by an LLM?

They don't need to publish a book to substitute original works. They substitute the original work every time they generate a response that is based on the book they substituted.

> I would consider training LLMs to be very much in line with those goals.

Because you're misunderstanding the premise. Original works are the ones that advance art and science. Those are the ones that are supposed to be protected by copyright.

reply

upvote

by happa225 days ago|

[-]

Quoting Judge Alsup from his recent ruling in Bartz v. Anthropic.

> Instead, Authors contend generically that training LLMs will result in an explosion of works competing with their works — such as by creating alternative summaries of factual events, alternative examples of compelling writing about fictional events, and so on. This order assumes that is so (Opp. 22–23 (citing, e.g., Opp. Exh. 38)). But Authors’ complaint is no different than it would be if they complained that training schoolchildren to write well would result in an explosion of competing works. This is not the kind of competitive or creative displacement that concerns the Copyright Act. The Act seeks to advance original works of authorship, not to protect authors against competition.

reply

upvote

by dns_snek220 days ago|

[-]

That's unrelated to the reasoning that I provided.

reply

upvote

by mattigames225 days ago|

[-]

Copyright was build to protect the artist from unauthorized copy by a human not by a machine (a machine wildly beyond their imagination at the time I mean), so the input and output limitations of humans were absolutely taken into account when writing such laws, if LLMs were treated in similar fashion authors would have had a say in wether their works can be used as inputs in such models or if they forbid it.

reply

upvote

by JimDabell225 days ago|

[-]

This reply doesn’t seem to relate to either of the points I made.

reply

upvote

by mattigames225 days ago|

[-]

Yes it does, the spirit of the law matters in many one cases. A fair ruling would have declared that authors must be able to forbid the usage of their work as training data for any given model because the "transformative" processes that are being executed are wildly beyond what the writers of the law knew were even possible at the time of the writing of such laws.

reply

upvote

by Ukv225 days ago|

[-]

> Copyright was build to protect the artist from unauthorized copy by a human not by a machine (a machine wildly beyond their imagination at the time I mean), so the input and output limitations of humans were absolutely taken into account when writing such laws

Copyright law was spurred by the spread of the printing press, a machine which has ability to output full replicas. It does not assume human-like input/output limitations.

> A fair ruling would have declared that authors must be able to forbid the usage of their work as training data for any given model because the "transformative" processes that are being executed are wildly beyond what the writers of the law knew were even possible

Copyright's basis in the US is "To promote the Progress of Science and useful Arts". Declaring a transformative use illegal because it's so novel would seem to run directly counter to that.

To my understanding it's generally the opposite (a pre-existing use with an established market that the rightsholder had expected to exploit) that would weigh against a finding of fair use.

reply

upvote

by StackRanker3000225 days ago|

[-]

The spirit of the law matters, but there are limits to how much existing statutes can be stretched to cover novel scenarios. Seems to me like new laws may be necessary to keep up (whatever the people would prefer them to be).

reply

upvote

by JimDabell225 days ago|

[-]

I made two points:

- It is not accurate to describe training as “encoding works into the model”.

– A model cannot recreate a Harry Potter book.

Neither of these have anything to do with “the spirit of the law”.

reply

upvote

by modo_mario225 days ago|

[-]

Can it not recreate a book?

I kind of assumed I could ask it for verses from the bible one by one till i have the full book?

When i ask chatgpt for a specific page or so from HP I get the impression that the model would be perfectly capable of doing so but is hindred by extra work openAI put in to prevent the answer specifically because of copyright. In which case the question: What if someone manages to do some prompt trickery again to get past it? Are they then responsible?

reply

upvote

by JimDabell225 days ago|

[-]

No, it can’t recreate a book. Well, maybe it could get most of the way for the Bible. That is an exceptional case because its adherents are constantly quoting verses religiously. I expect it’s the most reproduced, quoted, and translated book in history by a very significant margin. It’s also not copyrighted.

Can you do this for the general case? No, not even for extremely popular books. People might quote Harry Potter a lot, but they don’t quote the entire thing over and over, chapter and verse, on hundreds of thousands of different websites. The number of times Bible verses appear in the training data is going to absolutely dwarf the number of times Harry Potter quotes appear, and people aren’t quoting all parts of Harry Potter, just the interesting parts.

> When i ask chatgpt for a specific page or so from HP I get the impression that the model would be perfectly capable of doing so but is hindred by extra work openAI put in to prevent the answer specifically because of copyright.

They do put extra work in to filter this stuff out, but even if they didn’t the model wouldn’t be able to reproduce entire chapters, let alone entire books.

You can test this for yourself. Remember, this lawsuit isn’t against OpenAI, it’s against Meta. Download Llama and try to get it to reproduce Harry Potter. There won’t be any guardrails imposed on top of the model if you run it locally.

reply

upvote

by modo_mario225 days ago|

[-]

>People might quote Harry Potter a lot, but they don’t quote the entire thing over and over, chapter and verse, on hundreds of thousands of different websites.

I'm fairly certain I could find the entire thing in plain text in multiple places online. A quick google gives the philosophers stone as the second result in pdf format on the internet archive but i'm sure with a bit of looking i'd bump into a lot of plaintext copies.

They might have taken measures to prevent this from being anywhere their training data (i think it would be fairly easy and something they'd likely do) but if they at any point failed for a book or so that they didn't consider wouldn't my original question stand?

reply

upvote

by JimDabell225 days ago|

[-]

You’re missing the point. An LLM is not going to memorise a whole book just because it’s seen a few copies. An LLM might be able to memorise the Bible in particular simply because Bible quotes are everywhere. There is a vast difference between being able to find a handful of copies online and having it constantly quoted everywhere that humans communicate. Bible quotes get literally everywhere. People put them on bumper stickers, tattoo themselves with it, put it in their email signatures, etc. Bible quotes are so omnipresent, they have become part of our language – a lot of idioms people use every day come from the Bible.

The Bible isn’t just a book, it’s been a massive part of human culture for millennia, to the point of it shaping language itself. LLMs might be able to memorise the Bible, but it’s not because they can memorise books, it’s because the Bible is far more than just a book.

reply

upvote

by modo_mario225 days ago|

[-]

I went to check and it seems like it works fine for plenty of other public domain books. The picture of Dorian Grey, Pride and prejudice and what have you. I can ask for x amount of paragraphs from a specific and such.

I doubt every part of those books get quoted everywhere on a numbered basis like the bible might be. For only recently public domain books it seems to be overly cautious trough the retroactively applied filtering where it refuses if it suspects there might be a single country where copyright still applies.

reply

upvote

by JimDabell225 days ago|

[-]

I can’t reproduce that. What model were you using and what prompt?

reply

upvote

by modo_mario225 days ago|

[-]

Don't have access to the account i was using before right now but when i'm using chatgpt free tier which i believe is GPT-4o I at first thought i got it right again.

I decided to ask it: Can you give me the first 4 paragraphs of chapter 3 of the book The picture of Dorian Grey?

And it gave me something and it looked alright to me. It read right and i went to gutenberg and glanced over it and the first lines of each paragraph seemed correct but only the short ones were. The first paragraph which was longer after the opening lines suddenly had an entire section randomly replaced with hallucination.

A followup asking it to not hallucinate had it search the web to fetch the correct thing which isn't valid in this context.

I suspect it starts hallucinating once the bit of text gets long so i asked for specific sentences of chapters (and to do so without web search). the 1st, 2nd, 3rd and such.

It managed to not outright hallucinate lines then but did get the chapter i asked for wrong sometimes. I presume that with sufficiently careful prompting one can get the book out properly in sequential order with a lot of prompts but it takes quite some effort to get there. But that's where my curiosity ends for the night. My bed calls.

reply

upvote

by JimDabell225 days ago|

[-]

> I presume that with sufficiently careful prompting one can get the book out properly

You failed to get it to reproduce one paragraph. Why on earth would you presume you can do it for the entire book‽

reply

upvote

by modo_mario221 days ago|

[-]

Did you read what I said? I got plenty of correct paragraphs. They just had to be short. Breaking up the big paragraphs seems to help the issue.

reply

upvote

by mattigames225 days ago|

[-]

> proportionality threshold for copyright to matter.

This is the part I have a problem with, that threshold was put there for humans based on their capabilities, it's an extremely dishonest assessment that the same threshold must apply for a LLM and it's outputs, those works were created to be read by humans not a for-profit statistical inference machine, the derivative nature were also expected to be caused by the former no the later, so the judge should have admitted that the context of the law is insufficient and that copyright must include the power of forbidding the usage of one's work into such model for copyright to continue fulfilling it's intended purpose (or move the case to the supreme court I guess)

reply

upvote

by JimDabell225 days ago|

[-]

> that threshold was put there for humans based on their capabilities

It wasn’t. It’s there because a small proportion being reproduced doesn’t harm the copyright holder in the same way a full reproduction does.

Nobody is going to stop buying Harry Potter books because they can get an LLM to spit out ~50 words from the book. This is entirely in line with the spirit of the law. This is exactly why proportionality is a factor in fair use.

reply