upvote
As someone with a self-hosted Mercurial instance dealing with this, I will say that the big names (OpenAI included, but not exclusively them) generally at least use proper user-agents and respect robots.txt, but they are still needlessly aggressive compared to traditional search indexers.

There are also scrapers that are hiding behind normal browser user agents. When I looked at IP ranges, at least some of them seemed to be coming from data centers in China.

reply
Stop just making up excuses for these companies. Other comments on this story have showed the bots are using openai user agents and making requests from openai owned ip ranges.
reply