upvote
Worse, the constant AI scraping is actually costing content providers additional money for no return. At least Google/Bing/Yahoo scraping would then be used to provide links back to your content.
reply
How do you distinguish Google/MS scraping for Gemini/Copilot vs Google Search/Bing? In the case of Google, the UA is the same and you are entirely at their mercy to honor the Google-Extended instructions in robots.txt

Google has further complicated it with new search announcement blurring lines between regular search and AI search. And AI likes to not honor any licenses or instructions when it is hungry for training material.

It is once again an example of Google using its dominant position to abuse and promote cross functional products.

reply
If company like Meta are downloading pirated books etc.. to train their AI, they will surely honor robots.txt.
reply
Not only costing money. Constant AI scraping constitutes a denial-of-service attack that has brought down websites.
reply
> At least Google/Bing/Yahoo scraping would then be used to provide links back

That doesn't work anymore. Google provides AI generated summary, nobody looks at the original site.

reply
About a year ago OpenAI crawled and go DDOS level the company I work. Even despite the robots.txt not allowing it, and despite some recaptcha we could assemble in time.

We found our data in the outputs of their models but who can do anything about it...

reply
> We found our data in the outputs of their models but who can do anything about it...

If the crawlers refuse to voluntarily respect your robots.txt, then you are well within your rights to poison their data.

reply
robots.txt seems like it should be a legally-binding terms of service which would make them outright copyright infringing.

Sue for $180,000 per infringement which should be calculated for each illegal API call.

reply
Was your robots txt written by a lawyer? Does it hold up in the court?
reply
It doesn't matter. Robots.txt is not a license, it's a set of computer parsable directives of how programs should access your site. The actual license doesn't have to be written for computers to parse to be legally binding.

A person should be able to write in a terms of use or license page on their website that says "do not include any content from this website in your AI training data. if you do you will be billed $100 billion dollars." And it should be enforceable. It just turns out that nerds like to say "oh that would be too hard or too expensive, so we're going to ignore it."

reply
Why hasn't your company sued OpenAI and try to argue they're violating the computer abuse and fraud act? Would it really be impossible to argue this?

Unauthorized access, system damage, and maybe even extortion all apply here.

reply
Lawyers can. As long as that data is actually yours I mean, in a strictly legal sense.
reply
I mean, did you check the IPs and make sure they’re from OpenAI? Obviously a fly-by-night AI company is going to set their User Agent to be from a big player.
reply
>Why look at a website when it's all in AI?

well, at least in the case of google, I'm pretty sure that's the point. Or at least, they are doing things that would seem to be moving towards being an oracle with all the answers and not the signpost that points you in the right direction. The destination rather than the gateway.

reply
remember AMP?
reply
It's actually costing them money/time! A friend of mine is a sysadmin at a university and he constantly has to deal with AI crawler DDoS-ing his servers. He said Anthropic is actually one of the worst offenders.

These AI companies are really just a gross example of the motto "Socialize the costs, privatise the profits". It's disgusting!

reply
Is it possible able to host your website in a way so that it couldn't be found via search engines (and thus wouldn't be crawlable I hope)?

I know this has repercussions on findability, but if that wasn't a concern, I'm curious how one might circumvent getting crawled.

reply
Sure, depends on how accessibly to people you want it to be.

Most legit search engines are going to honor robots.txt and you can disallow access.

Next level would be using something like rate limiting controls and/or Cloudflare's bot fight mode to start blocking the bad bots. You start to annoy some people here.

Next would be putting the content behind some form of auth.

reply
I don't know why we are trusting cloudflare when they are the one creating crawlers.

https://developers.cloudflare.com/browser-run/quick-actions/...

reply
Possible yes, probable not likely. The moment you're issued a certificate your domain will be shown in the Certificate Transparency logs which are constantly monitored from anyone who wants to find new sites.
reply
....Yet another vector through which "security experts" has caused a waterbed problem. Let's secure the Internet, oh no! We made a centralized list of operating domains for hostile actors to guide attacks with!
reply
robots.txt is a way of leaving the door unlocked but kindly asking bots to stay outside.
reply
Which in a law-abiding society should be enough. It's also how we do things in the real world in many cases - i.e. here you can just write on your mailbox "no ads" and companies have to respect that.

Even when we do actually put physical locks on things they are mostly there to show that someone breaking in did so intentionally and not at all designed to prevent motivated attackers.

reply
> here you can just write on your mailbox "no ads" and companies have to respect that

Where do you live? In the US it’s actually illegal for anyone except the USPS to deliver to a mailbox.

reply
You might be interested to know that entering an unlocked door into a space you do not have permission to be in is still illegal.
reply
You might be interested to know that the “illegality” depends on the intent. If I rest on your unlocked door handle, it opens, I enter, it’s an accident.
reply
Sorry, what? In this scenario are you claiming that you accidentally fell inside the restricted area because you were leaning on the door? Or are you claiming that you accidentally opened the door and then walked through intentionally? In the former case, you are guilty of breaking and entering in most US jurisdictions if you don’t promptly get out. Any sane court would likely agree an accidental trespass is probably not a criminal act, but it’s not an accident if you stay. In the latter case, you’re clearly trespassing illegally.

Also this has gotten pretty far away from the web scraping scenario. There’s no door accidentally opening here.

reply
Oops, I just accidentally fell into every website. Don't know how that happened ...
reply
You could just put your website content behind its own chat interface. The crawler would just see a form input for a prompt.
reply
If you really wanted and are interested in doing so and perhaps are even happy with just text and normal styling limitations, I recommend you to test out other protocols like creating a gemini website or gopher website. I don't think that scraping happens on even remotely the same scale there as compared to conventional websites

That being said you would require your user to download a compatible browser for gemini/gopher.

reply
It's never been a problem with people ad-blocking for the last 20 years, why is it suddenly a problem now?

We've been celebrating denying creators revenue for decades...

Maybe this is just the internet hypocricy of "When I do it, it's good, when they do it, it's bad".

reply
Total sleight of hand.

Ad blocking has always been a problem for creators but it's aimed at big corps - non-creators. The creators asked people to support them other ways or turn off the blocking. And it's not like the little independent creators wanted this version of commercialized internet in the first place.

The ai marketing teams are spinning everything they can but no AI companies are the conscript, the vultures. No question about it.

reply
The conversion from viewer to donator is around 1%. This is true from wikipedia, to twitch, to podcasts.

The number of people who will not ever load your ads is around 30%.

I can tell you that creators talk about this a lot in private, but will not publicly because the internet has a mass delusion on how creation and compensation works. It's like trying to convince christians that jesus obviously didn't come back from the dead days later, depsite there being no logical system available that would explain it.

If we were to try and map out a functional internet where everyone wins, users and creators, there is no example where ad blocking is anything other net harmful. You either get volunteer net where 0.01% share hobby posts on their own dime for the other 99.9% or you get IRC where 99% of the population doesn't really benefit (ala 1993).

reply
People usually point at the scale when this discussion comes up, in my experience. These companies are doing something at a huge scale spending tons of money to do it so the potential harm is greater.

People can easily justify their own piracy because it’s small scale. Even when they organize, create a whole software and tooling ecosystem around pirating media to stick into jellyfin or plex. AI still did it bigger and worse and is bad, what I’m doing is not so bad because I wasn’t going to buy the movie anyway, etc.

reply
On the whole, about 35% of internet users are ad-blocking. In the tech space it's upwards of 70%.

It's in no way, shape, or form "small scale", and has fundamentally changed the the very nature of the internet for the worse (opinions/views of ad blocking people don't matter).

reply
Don't forget that the money being spent to do said scraping has, in great sums, come from subsidies paid by taxes from public coffers.
reply
I am in favor of severely limiting both copyright and advertising, but for the benefit of everyone, not just for the benefit of a few "AI" companies.
reply
And you will not get it. As the AI pump money into lawyers and politicians - they will be the ones profiting from copyright. Total regulatory capture as US AI companies make it illegal to train AI on their output.
reply
The answer is to simply pay for stuff.

There is no viable model where "have stuff but not pay for it" works out.

reply
Choosing not to look at something is not denying anyone anything.
reply
Choosing not to look at an ad, and blocking it are different things. One is totally ok, the other incurs a monetary loss on the creator. Those services aren't free to run, and the content doesn't take zero time to create. It also incentivizes creating content focused on those who cannot figure out ad blocking.
reply
There is more to life than money.

Many of the websites I read do not collect any appreciable amount of money from ads, or have no ads at all (one example: news.ycombinator.com :) ). They want a recognition, or to share the knowledge, or community, or they are building their brand... And AI is destroying this all - the first result of "zx80" is an AI overview with a link to wikipedia and some youtube videos. If person stops there , they will never get to computinghistory.org.uk link, and won't see any related information about the variants and models.

reply
This website is an ad for Ycombinator. It's in no way, shape, or form a charity place for devs to hang out. It's a feeding ground to lure tech people into a mega VCs pastures.

When you click "news.ycombinator.com" you are clicking on the ad.

:)

reply
Interesting. I suppose the main difference is that we’re ants compared to an 800 pound gorilla.
reply
[dead]
reply
I’ve been thinking of a proof-of-work scheme for accessing content where you effectively need to mine some crypto for the author, but, this idea might not fly today
reply
reply
Yes, but:

> Although Anubis could be altered to mine cryptocurrency to serve as proof of work, Iaso has rejected this idea: "I don't want to touch cryptocurrency with a 20 foot pole."

Which in my mind is a shame. Crypto is an absolute mess, yes, but this seems like an elegant way to get something back for putting things out there.

reply
Mining crypro doesn't materialize money. You have to exchange it for real money which means taking a private individual's money in exchange for scam tokens.

This is the problem crypto fans refuse to acknowledge. The money doesn't magically appear, you're taking it from someone else and letting them hold the bag when whatever cryptocurrency you choose inevitably blows up, fails, or rug-pulls. It's unethical to engage with at all because you're still participating in scamming real money out of private individuals

reply
The problem is that much of the cost is borne by humans accessing the sites. People generally get real mad when they find out you’re using their computers to mine crypto.
reply
But that will be a hassle for human visitors as well. A web doing proof-of-work to browse, will be a disaster for phones with their limited batteries, etc.
reply
To be specific, it would be more of a hassle for human visitors than for the AI companies with infinite money and specialized browsers.
reply
The idea would be that AI companies would still be forced to do this proof of work. Anubis proved the idea
reply
or you know, just charge for your content if you believe it to be valuable enough for the fee being charged.
reply
Yes, but that tends to limit the reach of your content. Hence why a lot of people reach for ads.

Between seeing ads and doing a little bit of proof-of-work for the author, I'd choose the latter.

reply
I agree with this whole heartedly. What's the point of even having copyright law at this point?

What's even crazier to think about is that to use the latest versions of these models for which you supplied training data, you have to pay hundreds of dollars a month. I would love to get a settlement check proportional to my model weights. Even if it's $0.10, at least everyone out there will get what they're owed.

reply
From my perspective, everybody trains on the knowledge and experience of those who came before. AI just does the same thing at scale.

I do not value copyright. All it does is give you standing to sue if somebody reproduces your work. It does not differentiate or account for parallel creation. I cannot count how many times I have "created" something, only to find it in a research paper later.

Part of the reason I think copyright has no value is that, in general, individual copyright owners don't have the deep pockets necessary to sue someone who violates their copyright. If anyone is violating the spirit of copyright, it's corporations that insist you assign your work over to them as a work for hire, or outright ignore your copyright. (looking at you, Disney's Atlantis).

A significant benefit of AI that doesn't get talked about enough is that AI has a much greater reach over all the information it was trained on and can draw connections that would be invisible to someone operating at the human scale.

reply
The fact that these companies are making money off of it negates your argument.
reply
I don't think anyone's "making money" yet. We have a race to build up hardware for AI, and one to train models. There are some profits in there, but who's making money from the work AI performs? Nobody, because any advantage some company claims with AI is quickly replicated by competitors and profit dries up.

Today you can put a coding agent to migrate an existing application to another language (like chardet). Even if you don't have the code, if you can run the app you can still clone it, using it as an oracle for replication. That is why there will be very little profits in AI usage.

reply
No, you don’t have to. There are open weight models you can download and use for free. Many people choose the subscription model but it’s not necessary. And latest doesn’t mean greatest, it’s just most up-to-date.
reply
Perhaps we should go back to back when the internet was about sharing information you liked, not about credit or making money on "content".
reply
You are there today, but some are unhappy that others don’t share the same sentiment.
reply