upvote
I just threw up a public Forjego instance for some lightweight collaboration. About 2 minutes after the certificate was created, I'm guessing they picked up the instance from the transparency logs for certificates, and started going through every commit and so on from the two repositories I had added.

Watched it for a while, thinking eventually it'd end. It didn't, seemed like Claudebot and GPTBot (which was the only two I saw, but could have been forged) went over the same URLs over and over again. They tried a bunch of search queries too at the same time.

The day after I got tired of seeing it so added a robot.txt forbidding any indexing. Waited a few hours, saw that they were still doing the same thing, so threw up basic authentication with `wiki:wiki` as the username:password basically, wrote the credentials on the page where I linked it and as expected they stopped trying after that.

They don't seem to try to bypass anything, whatever you put in front will basically defeat them except blocking them by user-agent, then they just switch to a browser-like user-agent instead, which is why I went the "trivial basic authentication" path instead.

Wasn't really an issue, just annoying when they try to masquerade as normal users. Had the same issue with a wiki instance, added rate limits and eventually they seemingly backed off more than my limits were set too, so I guess they eventually got it. Just checked the logs and seems they've stopped trying completely.

Seemingly it seems like people who are paying for their hosting by usage (which never made sense to me) is the ones hard hit by this. I'm hosting my stuff on a VPS, and don't understand what the big issue is, worst case scenario I'd add more aggressive caching and it wouldn't be an issue anymore.

reply
I had the same issue when I first put up my gitea instance. The bots found the domain through cert registration in minutes, before there were any backlinks. GPTbot, ClaudeBot, PerplexityBot, and others.

I added a robots.txt with explicit UAs for known scrapers (they seem to ignore wildcards), and after a few days the traffic died down completely and I've had no problem since.

Git frontends are basically a tarpit so are uniquely vulnerable to this, but I wonder if these folks actually tried a good robots.txt? I know it's wrong that they ignore wildcards, but it does seem to solve the issue

reply
Where does one find a good robots.txt? Are there any well maintained out there?
reply
Cloudflare actually has this as a free tier feature so even if you don't want to use it for your site you can just setup a throwaway domain on Cloudflare and periodically copy the robots.txt they generate from your scraper allow/block preferences, since they'll be keeping up to date with all the latest.
reply
I will second a good robots.txt. Just checked my metrics and < 100 requests total to my git instance in the last 48 hours. Completely public, most repos are behind a login but there are a couple that are public and linked.
reply
> I wonder if these folks actually tried a good robots.txt?

I suspect that some of these folks are not interested in a proper solution. Being able to vaguely claim that the AI boogeyman is oppressing us has turned into quite the pastime.

reply
> Being able to vaguely claim that the AI boogeyman is oppressing us has turned into quite the pastime.

FWIW, you're literally in a comment thread where GP (me!) says "don't understand what the big issue is"...

reply
Since you had the logs for this, can you confirm the IP ranges they were operating from? You mention "Claudebot and GPTBot" but I'm guessing this is based off of the user-agent presented by the scrapers and could easily be faked to shift blame. I genuinely doubt Anthropic and such would be running scrapers that are this badly written/implemented, it doesnt make economic sense. I'd love to see some of the web logs from this if you'd be willing to share! I feel like this is just some of the old scraper bots now advertising themselves as AI bots to shift blame into the AI companies.
reply
There are a bit too many IPs to list but from my logs they're always of the form 74.7.2XX.* for GPTBot, matching OpenAIs published ip ranges[0].

So yes, they are definitely running scrapers that are this badly written.

Also old scraper bots trying to disguise themselves as GPTBot seems wholly unproductive, they're try to immitate users, not bots.

[0] https://openai.com/gptbot.json

reply
> but I'm guessing this is based off of the user-agent presented by the scrapers and could easily be faked to shift blame

Yes, hence the "which was the only two I saw, but could have been forged".

> I'd love to see some of the web logs from this if you'd be willing to share!

Unfortunately not, I'm deleting any logs from the server after one hour, and also don't even log the full IP. I took a look now and none of the logs that still exists are from any user agent that looks like one of those bots.

reply
Huh, I had a gitea instance in the public web on one of my netcup vps's. I didn't set any logs and was using cloudflare tunnels (with a custom bash script which makes cf tunnels expose PORT SUBDOMAIN).

Maybe its time for me to go ahead and start it again with logs to see if there are any logs.

I will maybe test it in all three 1) With CF tunnels + AI Block, 2) Only CF tunnels, 3) On a static IP directly. Maybe you can try the experiment too and we can compare our findings (also saying because I am lazy and I had misconfigured that cf tunnel so when it quit, I was too lazy to restart the vps given I just use it as a playground and just wanted to play around self hosting but maybe I will do it again now)

reply
I would love to understand this.

Just a few years ago badly behaved scrapers were rare enough not to be worth worrying about. Today they are such a menace that hooking any dynamic site up to a pay-to-scale hosting platform like Vercel or Cloud Run can trigger terrifying bills on very short notice.

"It's for AI" feels like lazy reasoning for me... but what IS it for?

One guess: maybe there's enough of a market now for buying freshly updated scrapes of the web that it's worth a bunch of chancers running a scrape. But who are the customers?

reply
The bar to ingest unstructured data into something usable was lowered, causing more people to start doing it.

Used to be you needed to implement some papers to do sentiment analysis. Reasonably high bar to entry. Now anyone can do it, the result: more people doing scraping (in less competent scrapers too).

reply
I would say there's a couple aspects.

The crawlers for the big famous names in AI are all less well behaved and more voracious than say, Googlebot. Though this is all somewhat muddied by companies that ran the former "good" crawlers all also being in the AI business and sometimes trying to piggyback on people having allowed or whitelisted their search crawling User-Agent, mostly this has settled a little where they're separating Googlebot from GoogleOther, facebookexternalhit from meta-externalagent, etc. This was an earlier "wave" of increased crawling that was obviously attributable to AI development. In some cases it's still problematic but this is generally more manageable.

The other stuff, the ones that are using every User-Agent under the sun and a zillion datacenter IPs and residential IPs and rotate their requests constantly so all your naive and formerly-ok rate-based blocking is useless... that stuff is definitely being tagged as "for AI" on the basis of circumstantial evidence. But from the timing of when it seemed to start, the amount of traffic and addresses, I don't have any problem guessing with pretty high confidence that this is AI. To your question of "who are the customers"... who's got all the money in the world sloshing around at their fingertips and could use a whole bunch of scraped pages about ~everything? Call it lazy reasoning if you'd like.

How much this traces back ultimately to the big familiar brand names vs. would-be upstarts, I don't know. But a lot of sites are blocking their crawlers that admit who they are, so would I be surprised to see that they're also paying some shady subcontractors for scrapes and don't particularly care about the methods? Not really.

reply
For whatever reason, legislation is lax right now if you claim the purpose of scraping is for AI training even for copyrighted material.

May be everyone is trying to take advantage of the situation before law eventually catches up.

reply
> For whatever reason, legislation is lax right now if you claim the purpose of scraping is for AI training even for copyrighted material

I think the reason is that America & China for the most part are also in AI arms race combined with an AI bubble and neither side would wish to lose literally any percieved advantage to them no matter the cost on others.

Also there is an immense lobbying effort against senators who propose for a stricter AI regulation.

https://www.youtube.com/watch?v=DUfSl2fZ_E8 [What OpenAI doesn't want you to know]

It's actually a great watch. Highly recommended because a lot of talks about regulations does feel to me as mirrors and smoke.

reply
> Does anyone know what's the deal with these scrapers, or why they're attributed to AI?

You don't really need to guess, it's obvious from the access logs. I realize not everyone runs their own server, so here are a couple excerpts from mine to illustrate:

- "meta-externalagent/1.1 +https://developers.facebook.com/docs/sharing/webmasters/craw...)"

- "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)"

- "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36"

- "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot)"

- [...] (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"

And to give a sense of scale, my cgit instance recieved 37 212 377 requests over the last 60 days, >99% of which are bots. The access.log from nginx grew to 12 GiB in those 60 days. They scrape everything they can find, indiscriminately, including endpoints that have to do quite a bit of work, leading to a baseline 30-50% CPU utilization on that server right now.

Oh, and of course, almost nothing of what they are scraping actually changed in the last 60 days, it's literally just a pointless waste of compute and bandwidth. I'm actually surprised that the hosting companies haven't blocked all of them yet, this has to increase their energy bills substantially.

Some bots also seem better behaved then others, OpenAI alone accounts for 26 million of those 37 million requests.

reply
Following your link above, https://openai.com/gptbot

> ChatGPT-User is not used for crawling the web in an automatic fashion. Because these actions are initiated by a user, robots.txt rules may not apply.

So, not AI training in this case, nor any other large-batch scraping, but rather inference-time Retrieval Augmented Generation, with the "retrieval" happening over the web?

reply
Likely, at least for some. I've caught various chatbots/CLI harnesses more than once inspecting a github repo file by file (often multiple times, because context rot)

But the sheer volume makes it unlikely that's the only reason. It's not like everybody has constantly questions bout the same tiny website.

reply
Using an LLM to ponder responses for requests is way too costly and slow. Much easier to just use the shotgun approach and fire off a lot of requests and deal with whatever bothers to respond.

This btw is nothing new. Way back when I still used wordpress, it was quite common to see your server logs filling up with bots trying to access endpoints for commonly compromised php thingies. Probably still a thing but I don't spend a lot of time looking at logs. If you run a public server, dealing with maliciously intended but relatively harmless requests like that is just what you have to do. Stuff like that is as old as running stuff on public ports is.

And the offending parties writing sloppy code that barely works is also nothing new.

AI opportunism certainly has added a bit of opportunistic bot and scraper traffic but it doesn't actually change the basic threat model in any fundamental way. Previously version control servers were relatively low value things to scrape. But code just became interesting for LLMs to train on.

Anyway, having any kind of thing responding on any port just invites opportunistic attempts to poke around. Anything that can be abused for DOS purposes might get abused for exactly that. If you don't like that, don't run stuff on public servers or protect them properly. Yes this is annoying and not necessarily easy. Cloud based services exist that take some of that pain away.

Logs filling up with 404, 401, or 400 responses should not kill your server. You might want to implement some logic that tells repeat offenders 429 (too many requests). A bit heavy handed but why not. But if you are going to run something that could be used to DOS your server, don't be surprised if somebody does that.

reply
I think it’s a) volume of scrapers, and b) desire for _all_ content instead of particular content, and c) the scrapers are new and don’t have the decades of patches Googlebot et al do.

5 years ago there were few people with an active interest in scraping ForgeJo instances and personal blogs. Now there are a bajillion companies and individuals getting data to train a model or throw in RAG or whatever.

Having a better scraper means more data, which means a better model (handwavily) so it’s a competitive advantage. And writing a good, well-behaved distributed scraper is non-trivial.

reply
> why they're attributed to AI?

I don’t think they mean scrapers necessarily driven by LLMs, but scrapers collecting data to train LLMs.

reply
I'm hazarding a guess that there are many AI startups that focus on building datasets with the aim to sell those datasets. Still doesn't make total sense, since doing it badly would only hurt them, but maybe they don't really care about the product / outcome, they're just capturing their bit of the AI goldrush?
reply
There's value to be had in ripping the copyright off your stuff so someone else can pass it off as their stuff. LLMs have no technical improvements so all they can do is throw more and more stolen data into it and hope it, somehow, crosses a nebulous "threshold" where it suddenly becomes actually profitable to use and sell.

It's a race to the bottom. What's different is we're much closer to the bottom now.

reply
deleted
reply
I stopped trying to understand. Encountering a 404 on my site leads directly to a 1 year ban.
reply
Damn, as someone who sometimes navigate by guessing URLs and rewriting them manually in the address bar, I hope more don't start doing this, I probably see at least one self-inflicted 404 per day at least.
reply
Why would you do that?
reply
Faster. Wanna know the pricing? $domain/pricing. What's this company about? $domain/about. Switch to another Google account? Change the 1 to a 2 in the URL. I guess mostly to avoid the mouse ultimately.
reply
Sounds like you're keeping all your URLs alive forever? Commendable!
reply
Not so many...

And there are tools to scan for dead links.

reply
Can you scan my bookmarks? :) edit: i.e. if someone has a bookmark to a page on your site and it goes 404, then they are blocked for a year. You have no ability to scan it because it's a file on their local system.
reply
Oh, now I understand.

I never removed anything, but I'll keep this in mind for the future.

reply
They're rotating through huge pools of residential IP addresses.
reply
The 2GB RAM didn't fill up with banned addresses, but YMMV.
reply
I’m guessing, but I think a big portion of AI requests now come from agents pulling data specifically to answer a user’s question. I don’t think that data is collected mainly for training now but are mostly retrieved and fed into LLMs so they can generate the response. Thus so many repeated requests.
reply
> If they're just collecting data to train LLMs, these seem like exceptionally poorly written and abusive scrapers written the normal way, but by more bad actors.

Right, this is exactly what they are.

They're written by people who a) think they have a right to every piece of data out there, b) don't have time (or shouldn't have to bother spending time) to learn any kind of specifics of any given site and c) don't care what damage they do to anyone else as they get the data they crave.

(a) means that if you have a robots.txt, they will deliberately ignore it, even if it's structured to allow their bots to scrape all the data more efficiently. Even if you have an API, following it would require them to pay attention to your site specifically, so by (b), they will ignore that too—but they also ignore it because they are essentially treating the entire process as an adversarial one, where the people who hold the data are actively trying to hide it from them.

Now, of course, this is all purely based on my observations of their behavior. It is possible that they are, in fact, just dumb as a box of rocks...and also don't care what damage they do. (c) is clearly true regardless of other specific motives.

reply
I don't think it has anything to do with LLMs.

I think the big cloud companies (AWS) figured out that they could scrape compute-intensive pages in order to drive up their customers' spend. Getting hammered? Upgrade to more-expensive instances. Not using cloud yet? We'll force you to.

The other possibility is cloudflare punishing anybody who isn't using it.

Probably a combination of these two things. Whoever's behind this has ungodly supplies of cheap bandwidth -- more than any AI company does. It's a cloud company.

reply