To me the current behavior of those scrapers tells me that "they don't plan", period.
Looks like they hired a bunch of excavators and are digging 2 meters deep on whole fields, looking for nuggets of gold, and pilling the dirt on a huge mountain.
Once they realize the field was bereft of any gold but full of silver? Or that the gold was actually 2.5 meters deep?
They have to go through everything again.
Scrappers seem to be exceedingly careless in using public resources. The problem is often not even DDOS (as in overwhelming bandwidth usage) but rather DOS through excessive hits on expensive routes.
Don't need to ask anything i can tell you exactly - because they have no regard for anything but their own profit.
Let me give you an example of this mom and pop shop known as anthropic.
You see they have this thing called claudebot and at least initially it scraped iterating through IP's.
Now you have these things called shared hosting servers, typically running 1000-10000 domains of actual low volume websites on 1-50 or so IPs.
Guess what happens when it is your networks time to bend over? Whole hosting company infrastructure going down as each server has hundreds of claudebots crawling hundreds of vhosts at the same time.
This happened for months. Its the reason they are banned in WAFs by half the hosting industry.
The fact that 30%+ of the web relies on their caching services, routablility services and DDoS protection services is the main pull.
Their DNS is only really for data collection and to front as "good will"
30% of the web might use their caching services. 'Relies on' implies that it wouldn't work without them, which I doubt is the case.
It might be the case for the biggest 1% of that 30%. But not the whole lot.
Last time Cloudflare went down, their dashboard was also unavailable, so you couldn't turn off their proxy service anyway.
And forget about crawling. If you have a less reputable IP (basically every IP in third world countries are less reputable, for instance), you can be CAPTCHA'ed to no end by Cloudflare even as a human user, on the default setting, so plenty of site owners with more reputable home/office IPs don't even know what they subject a subset of their users to.
[1] E.g. https://www.wired.com/robots.txt to pick an example high up on HN front page.
> The /crawl endpoint respects the directives of robots.txt files, including crawl-delay. All URLs that /crawl is directed not to crawl are listed in the response with "status": "disallowed".
You don't need any scraping countermeasures for crawlers like those.
Given that malicious bots are allegedly spoofing real user agents, "another user agent you have to add to your list" seems like the least of your problems.
If I need to treat cloudflare bots the same as malicious bots, that undermines their claim.
https://robindev.substack.com/p/cloudflare-took-down-our-web...
HN Discussion:
Like there's a difference between dozens of drunk teenagers thrashing the city streets in the illegal street race vs a taxi driver.
They also use their dominant position to apply political pressure when they don’t like how a country chooses to run things.
So yeah, we’ve created another mega corp monster that will hurt for years to come.
Because I'm pretty sure they are not in fact wrong.
On the other hand when a page is small and static enough that it's basically just a flyer, I also care a lot less about who hosts it.