undefined

upvote

points

by ctippett11 hours ago |

upvote

by Paracompact11 hours ago|

[-]

Don't know if it helps your musings at all, but there's a good chance that if a high-profile crawler like archive.org disrespected their robots.txt, that archive.org would be faced with lawsuits (or some other form of pressure). This is not merely the most moral move; rather it is the only sensible move.

The only reason "others are rewarded with profit" in cases like these are because pinkie-promise-style obligations don't affect players too small or shadowy to bother litigating.

reply

upvote

by GolfPopper10 hours ago|

[-]

>pinkie-promise-style obligations don't affect players too small or shadowy to bother litigating

I think you're looking at the wrong end of the spectrum there. It's some of the biggest players who flaunt the rules.

"Several AI companies said to be ignoring robots dot txt exclusion, scraping content without permission: report" (2024) https://www.tomshardware.com/tech-industry/artificial-intell...

reply

upvote

by Wowfunhappy1 hours ago|

[-]

But AI companies don’t publicly redistribute the content they scrape, whereas Internet Archive does.

Even if you believe what the AI companies are doing is or should be a copyright violation, the Internet Archive is redistributing in a more direct manner.

reply

upvote

by Paracompact9 hours ago|

[-]

Fair point. Being small and shadowy is a sufficient condition to avoid litigation, but not a necessary one. Another sufficient condition is having billions of dollars to throw around. Unfortunately, archive.org is well known, well loved, and fundamentally harmless.

reply

upvote

by ryandrake8 hours ago|

[-]

Side note: You probably mean "flout" instead of "flaunt."

reply

upvote

by cmeacham9811 hours ago|

[-]

Correct. Example snippet from the nytimes.com robots.txt:

    User-agent: archive.org_bot
    Disallow: /

reply

upvote

by mjmas2 hours ago|

[-]

Is there a difference between that and User-agent: ia_archiver ?

reply

upvote

by joecool102910 hours ago|

[-]

Which they don’t respect. I’ve had it for my blog for years and they still added it to wayback machine, see my last comment for their official announcement of the ignore robots.txt policy, it is not new.

reply

upvote

by socalgal28 hours ago|

[-]

robots.txt means they shouldn't auto-scan your site. Any user though can go to the wayback machine and type in a URL and the wayback machine will read that URL. That was the intent of robots.txt (don't scan) not (don't read period). It's spelled out in the spec for robots.txt

reply

upvote

by keane8 hours ago|

[-]

The <meta name="robots"> tag and robots.txt serve different roles: robots.txt controls crawling, while the robots meta tag influences indexing and other behavior. https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/...

I wonder how archive.org_bot behaves when <meta name="robots" content="noindex, noarchive, nocache" /> is present.

reply

upvote

by ninjagoo2 hours ago|

[-]

> I’ve had it for my blog for years

Just out of curiosity, why don't you want your public blog archived? not questioning, just trying to understand the logic/motivations?

Also, I think you're being unfairly downvoted.

reply

upvote

by joecool102910 hours ago|

[-]

No, archive.org does NOT respect robots.txt. You need to reach out to them directly and ask your site not be included: https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

reply

upvote

by input_sh7 hours ago|

[-]

Aren't you choosing to ignore something very specific specified in that article? Why do you make it seem that article implies it's their overall policy?

> A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org).

reply

upvote

by joecool10296 hours ago|

[-]

> Aren't you choosing to ignore something very specific specified in that article?

Of course not, did you ignore the lines right after? “As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.”

The announcement is from 9 years ago. I already mentioned they ignored the robots.txt for my own blog.

reply

upvote

by userbinator9 hours ago|

[-]

It's the same idiocy that DRM created.

Be a pirate, because a pirate is free...

reply

upvote

by Gigachad11 hours ago|

[-]

It's because they want to restrict AI companies from stealing content, but they can't do it if internet archive proxies it all for them.

All of the LLMs would be massively less useful if it wasn't for scraping the latest news.

reply

upvote

by stephen_g11 hours ago|

[-]

LLMs have other ways of accessing the content, they don’t need the Web Archive.

Every LLM company can afford to spin up a new subscriber account every day, proxying to appear different IPs from all sorts of ASNs, do some crawling until the account gets banned, and then do it again, and again, and again.

reply

upvote

by overfeed10 hours ago|

[-]

> LLMs have other ways of accessing the content, they don’t need the Web Archive.

What's the conclusion from this train if thought? Just because some burglars can pick locks doesn't mean you should leave your front door unlocked.

Locking a door (or robots.txt) is how one can establish mens rea for those who bypass the barrier.

reply

upvote

by AnthonyMouse7 hours ago|

[-]

This is like arguing that services can't provide access to libraries that provide public WiFi because it would give the public legal permission to pirate TV shows. They're two unrelated things. And then some members of the public argue that they're making fair use rather than pirating anything, but that still has nothing to do with the library.

reply

upvote

by stephen_g6 hours ago|

[-]

But as I understand it, the Web Archive does respect robots.txt, while LLM scrapers absolutely do not and use all sorts of dodgy methods to get around it already...

The actual root cause is that we're allowing LLM companies to completely disregard copyright laws for their profit. Whether the LLM companies scrape the Web Archive or the original source doesn't change the copyright infringement implications in any way, and cutting off the web archive doesn't practically change anything (because as I understand, LLM scraping is already prolific all over the web).

reply

upvote

by Gigachad10 hours ago|

[-]

The legal implications would be different vs scraping publicly available content.

reply

upvote

by AnthonyMouse7 hours ago|

[-]

Is there a case that actually says this? Why would whether something is fair use depend on that? For that matter, how would they even show that a given AI model was trained on something from a recursive crawler rather than the same articles added to the training data after being downloaded by hand?

reply

upvote

by Gigachad5 hours ago|

[-]

There was a similar case where a web scraper was bypassing prevention mechanisms on linked in

https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

reply

upvote

by AnthonyMouse3 hours ago|

[-]

That case seems to imply the opposite?

reply

upvote

by switzer2 hours ago|

[-]

LLMs would then license content from news orgs and other publishers, which is what should happen.

reply

upvote

by userbinator9 hours ago|

[-]

"stealing" is BS because the original still exists. Copyright infringement is more correct.

reply

upvote

by jasonfarnon7 hours ago|

[-]

they're stealing page views

reply

upvote

by Gigachad9 hours ago|

[-]

You can call it whatever you want but it’s killing journalism when LLMs can automatically scrape and reword all the news. Sucking up the profits without contributing anything back to the people who created the work.

reply

upvote

by AnthonyMouse3 hours ago|

[-]

The general problem here is that as soon as something is news, there will be not only numerous articles about it from multiple publications but also discussion of it on social media.

Which means LLMs have a zillion sources to get the story. Removing any given subset isn't going to prevent it from having the information in the training data, all it does is prevent that subset from being archived for future humans.

reply

upvote

by 8 hours ago|

[-]

deleted

reply