undefined

points

[-]

LLMs have other ways of accessing the content, they don’t need the Web Archive.

Every LLM company can afford to spin up a new subscriber account every day, proxying to appear different IPs from all sorts of ASNs, do some crawling until the account gets banned, and then do it again, and again, and again.

by overfeed10 hours ago|

parent|

[-]

> LLMs have other ways of accessing the content, they don’t need the Web Archive.

What's the conclusion from this train if thought? Just because some burglars can pick locks doesn't mean you should leave your front door unlocked.

Locking a door (or robots.txt) is how one can establish mens rea for those who bypass the barrier.

by AnthonyMouse7 hours ago|

parent|

[-]

This is like arguing that services can't provide access to libraries that provide public WiFi because it would give the public legal permission to pirate TV shows. They're two unrelated things. And then some members of the public argue that they're making fair use rather than pirating anything, but that still has nothing to do with the library.

by stephen_g6 hours ago|

parent|

prev|

[-]

But as I understand it, the Web Archive does respect robots.txt, while LLM scrapers absolutely do not and use all sorts of dodgy methods to get around it already...

The actual root cause is that we're allowing LLM companies to completely disregard copyright laws for their profit. Whether the LLM companies scrape the Web Archive or the original source doesn't change the copyright infringement implications in any way, and cutting off the web archive doesn't practically change anything (because as I understand, LLM scraping is already prolific all over the web).

by Gigachad10 hours ago|

parent|

prev|

[-]

The legal implications would be different vs scraping publicly available content.

by AnthonyMouse7 hours ago|

parent|

[-]

Is there a case that actually says this? Why would whether something is fair use depend on that? For that matter, how would they even show that a given AI model was trained on something from a recursive crawler rather than the same articles added to the training data after being downloaded by hand?

by Gigachad5 hours ago|

parent|

[-]

There was a similar case where a web scraper was bypassing prevention mechanisms on linked in

https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

by AnthonyMouse3 hours ago|

parent|

[-]

That case seems to imply the opposite?

by switzer2 hours ago|

prev|

[-]

LLMs would then license content from news orgs and other publishers, which is what should happen.

by userbinator9 hours ago|

prev|

[-]

"stealing" is BS because the original still exists. Copyright infringement is more correct.

by jasonfarnon7 hours ago|

parent|

[-]

they're stealing page views

by Gigachad9 hours ago|

parent|

prev|

[-]

You can call it whatever you want but it’s killing journalism when LLMs can automatically scrape and reword all the news. Sucking up the profits without contributing anything back to the people who created the work.

by AnthonyMouse3 hours ago|

parent|

[-]

The general problem here is that as soon as something is news, there will be not only numerous articles about it from multiple publications but also discussion of it on social media.

Which means LLMs have a zillion sources to get the story. Removing any given subset isn't going to prevent it from having the information in the training data, all it does is prevent that subset from being archived for future humans.