As someone with a self-hosted Mercurial instance dealing with this, I will say that the big names (OpenAI included, but not exclusively them) generally at least use proper user-agents and respect robots.txt, but they are still needlessly aggressive compared to traditional search indexers.
There are also scrapers that are hiding behind normal browser user agents. When I looked at IP ranges, at least some of them seemed to be coming from data centers in China.