upvote
> Most crawls are not that latency sensitive, certainly not the ai ones.

They certainly behave like they are. We constantly see crawlers trying to do cache busting, for pages that hasn't change in days, if not weeks. It's hard to tell where the bots are coming from theses days, as most have taken to just lie and say that they are Chrome.

I'd agree that the respecting robots.txt makes this a non-starter for the problematic scrapers. These are bots that that will hammer a site into the ground, they don't respect robots.txt, especially if it tells them to go away.

All of this would be much less of a problem if the authors of the scrapers actually knew how to code, understood how the Internet works and had just the slightest bit of respect for others, but they don't so now all scrapers are labeled as hostile, meaning that only the very largest companies, like Google, get special access.

reply
> We constantly see crawlers trying to do cache busting

Do you have a source for this? Not saying you're wrong, I'd just like to know more

reply
Not really, given that the work we do in that direction isn't exactly public. You can recreate the scenario though. Spin up a wiki of some sort, scrapers love wikis, ideally enable some form of caching, and just sit back and watch scrapers throw random shit in the URL parameters.
reply