The only reason "others are rewarded with profit" in cases like these are because pinkie-promise-style obligations don't affect players too small or shadowy to bother litigating.
I think you're looking at the wrong end of the spectrum there. It's some of the biggest players who flaunt the rules.
"Several AI companies said to be ignoring robots dot txt exclusion, scraping content without permission: report" (2024) https://www.tomshardware.com/tech-industry/artificial-intell...
Even if you believe what the AI companies are doing is or should be a copyright violation, the Internet Archive is redistributing in a more direct manner.
User-agent: archive.org_bot
Disallow: /I wonder how archive.org_bot behaves when <meta name="robots" content="noindex, noarchive, nocache" /> is present.
Just out of curiosity, why don't you want your public blog archived? not questioning, just trying to understand the logic/motivations?
Also, I think you're being unfairly downvoted.
> A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org).
Of course not, did you ignore the lines right after? “As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.”
The announcement is from 9 years ago. I already mentioned they ignored the robots.txt for my own blog.
Be a pirate, because a pirate is free...
All of the LLMs would be massively less useful if it wasn't for scraping the latest news.
Every LLM company can afford to spin up a new subscriber account every day, proxying to appear different IPs from all sorts of ASNs, do some crawling until the account gets banned, and then do it again, and again, and again.
What's the conclusion from this train if thought? Just because some burglars can pick locks doesn't mean you should leave your front door unlocked.
Locking a door (or robots.txt) is how one can establish mens rea for those who bypass the barrier.
The actual root cause is that we're allowing LLM companies to completely disregard copyright laws for their profit. Whether the LLM companies scrape the Web Archive or the original source doesn't change the copyright infringement implications in any way, and cutting off the web archive doesn't practically change anything (because as I understand, LLM scraping is already prolific all over the web).
Which means LLMs have a zillion sources to get the story. Removing any given subset isn't going to prevent it from having the information in the training data, all it does is prevent that subset from being archived for future humans.