I'm not sure how to articulate my thoughts on this exactly, other than to say it's disappointing that doing the right thing (i.e. respecting robots.txt) is rewarded with the burden of soliciting responses to a petition while at the same time others are rewarded with profit for ignoring those same directives.
The only reason "others are rewarded with profit" in cases like these are because pinkie-promise-style obligations don't affect players too small or shadowy to bother litigating.
I think you're looking at the wrong end of the spectrum there. It's some of the biggest players who flaunt the rules.
"Several AI companies said to be ignoring robots dot txt exclusion, scraping content without permission: report" (2024) https://www.tomshardware.com/tech-industry/artificial-intell...
User-agent: archive.org_bot
Disallow: /I wonder how archive.org_bot behaves when <meta name="robots" content="noindex, noarchive, nocache" /> is present.
Just out of curiosity, why don't you want your public blog archived? not questioning, just trying to understand the logic/motivations?
Also, I think you're being unfairly downvoted.
> A few months ago we stopped referring to robots.txt files on U.S. government and military web sites for both crawling and displaying web pages (though we respond to removal requests sent to info@archive.org).
Of course not, did you ignore the lines right after? “As we have moved towards broader access it has not caused problems, which we take as a good sign. We are now looking to do this more broadly.”
The announcement is from 9 years ago. I already mentioned they ignored the robots.txt for my own blog.
Be a pirate, because a pirate is free...
All of the LLMs would be massively less useful if it wasn't for scraping the latest news.
Every LLM company can afford to spin up a new subscriber account every day, proxying to appear different IPs from all sorts of ASNs, do some crawling until the account gets banned, and then do it again, and again, and again.
What's the conclusion from this train if thought? Just because some burglars can pick locks doesn't mean you should leave your front door unlocked.
Locking a door (or robots.txt) is how one can establish mens rea for those who bypass the barrier.
The actual root cause is that we're allowing LLM companies to completely disregard copyright laws for their profit. Whether the LLM companies scrape the Web Archive or the original source doesn't change the copyright infringement implications in any way, and cutting off the web archive doesn't practically change anything (because as I understand, LLM scraping is already prolific all over the web).
Which means LLMs have a zillion sources to get the story. Removing any given subset isn't going to prevent it from having the information in the training data, all it does is prevent that subset from being archived for future humans.
You can cryptographically verify a timestamp though by piggybacking on bitcoin like opentimestamps do.
In the end, we settled on agreeing that making such stuff available after 30 days, and possibly with access restrictions (can’t be pulled more than N times a day, in case it becomes relevant in the future) struck the right balance.
To my knowledge, the Internet Archive hasn’t done any outreach on this issue. In addition to pressuring the publications, I’d put some pressure on them to negotiate.
Is the Internet Archive regularly used as a paywall workaround? Generally it's archive.is, which has no connection to the IA.
In case it "becomes relevant." Wouldn't that benefit you either way? It makes you wonder if they have a dashboard of unfortunate digital statistics on display somewhere and worship of these numbers have replaced the underlying spirit of journalism.
It's flipped right now. There's no single source of ground truth, but data and information are abundant. Yes, that abundance that includes false data and lies, but it is still abundance.
The work The New York Times and The Atlantic do at their best days, i.e. their investigative journalism team adds to this world, but they try to hide / cloister that work away even though the journalists themselves want to make it accessible.
In an ideal world, every child would learn how to read english via the NYT and The Atlantic, they'd grow up with these sources of record, learn from them, and watch the world through them. But the current model doesn't allow for that.
I think a patronage mixed with wikimedia-style foundation might be a better fit. Readers who love the institution and its mission are invited to pay as much as they want with scaling benefits (let's say you love the NYT so much that you want to give $10k/mo for their work, you should get commensurate access / get to ask questions). And these contributions flow into the endowment, which is invested and the outputs of that are distributed as a part of their operating budget.
I don't think classical journalism can survive an information abundant world without a patronage-based approach.
Maybe. The alternative is most people simply aren’t going to engage with long-form journalism. Keeping the analysis behind subscriptions while video summaries make ad revenue on YouTube and Twitter might be the best fit.
Too often they’ve been caught selectively reporting details and quotes, or reporting facts from an unreliable source that turned out to be outright false. In the latter case they quietly retract the article, so most readers continue believing the lie (maybe that’s why they don’t want to be archived).
Even posting a small blog is better, while it can also be biased and untrustworthy, if it has original thought, supports an individual, and doesn’t have ads. Although the amount of obvious LLM blogs submitted here is another issue.
The primary source of investigative journalism is the newspaper.
If a NY Times article is corroborated or even paraphrased itself by a more trustworthy organization, or has direct links to multiple primary sources, I wouldn’t mind. Except the NY Times article is still paywalled, and there may be a source that’s not, in which case I still think that source should be submitted instead.
A pie chart showing the times I used the wayback machine to read an old NYT article vs the times I visited it due to a highly upvoted top HN comment linking to a relatively new article so we all can bypass the paywall is a solid circle.
That’s how I signed up to The Atlantic. I wanted to read the Signalgate reporting. There are other publications which get upvoted here frequently that have the paywall workarounds. I generally click around their paywall.
It becomes a research resource. It also creates a high-friction interface for potential subscribers.
I wound up subscribing to Le Monde Diplo because of a HN comment referencing a paywalled article. I didn't want to sign up just for one article. So I bypassed using one of the circumvention sites (I think outline was popular then). The article was compelling enough that I signed up for the paper, and remain subscribed to this day.
The work of independent journalists is more important than ever before.
They have a robust paying subscriber base that supports them and don't have an owner whose last name rhymes with Pesos who can axe a story just because he doesn't like what it says.
NYT had $2.82B in revenue in 2025.
I recommend you actually go and read those fiches. The press was not historically high quality. Mass media has had the same problems for decades.
What it used to have was genuine independent competition.
The NYT is of course guilty itself. It did not investigate the possible murder of its star witness Suchir Balaji and is too reserved in examining the consequences of AI in general.
If they don't fulfill their journalistic and societal obligations, soon its own journalists will be replaced by AI bullet point slop like Axios.
I'm grown up now, I understand how things work, and I'd rather see Tide and Coke ads than pay $20/mo to 8 different orgs, while maintaining that ad free option for those who want it.
The children of the internet probably won't sign a truce, so let's just cut them out and let intellectually honest people have a decent internet.
I dunno. That seems like a pretty big fuck you to a paying customer already when all they have to do is provide a sub for a few more bucks a month. But I guess I'm a child of the Internet.
How much faster would consumer software be if adware was made illegal? How much faster would our devices be if we didn't have half the code base supporting malware?
Acting like an ad enabled internet was the only option is extremely foolish, especially when the ad enabled internet was fully chosen and pushed onto the public by very specific people (thanks Newt Gingrich!).
That era vastly predates the Internet, let alone the (relatively) ad-free pre-1980s Internet, neither of which we can return to in any meaningful fashion.
Ah, so, take the money out of it completely? No subscriptions, and no ads? Sounds like a good idea to me.
Nope, two problems
1- Ads is privacy issue not only convenience issue. Targeted ads should not normalized.
2- Companies figures out that even paying doesn't means you don't get ads. You probably are bigger target with more disposable income than average in such case.
…why would they go under if the people who don’t pay for news stop reading them?
The paywalls were one thing, but disallowing archival is practically suicide.
The Times alone pulls a multiple of the Internet Archive’s visitors [1][2].
If posting the link instead implies that the 97% of people not currently willing to subscribe can't read it, then people instead post a link to a publication their audience can read, in which case the first publication gets actually 0%.
I guess I don't really care. As soon as it becomes unworkable to view these publications through archivers I'll just stop viewing them altogether. I don't see this helping their bottom line though.
They also preserved old books. But now I guess they're becoming middlemen for access to limited ebook platforms that ensure books disappear when publishers lose interest.
The "Information Age" is proving to be the setup for a dark age, when nonprofitable things are just thrown out and efforts to preserve them are actively fought.