I added a robots.txt with explicit UAs for known scrapers (they seem to ignore wildcards), and after a few days the traffic died down completely and I've had no problem since.
Git frontends are basically a tarpit so are uniquely vulnerable to this, but I wonder if these folks actually tried a good robots.txt? I know it's wrong that they ignore wildcards, but it does seem to solve the issue
I suspect that some of these folks are not interested in a proper solution. Being able to vaguely claim that the AI boogeyman is oppressing us has turned into quite the pastime.
FWIW, you're literally in a comment thread where GP (me!) says "don't understand what the big issue is"...
So yes, they are definitely running scrapers that are this badly written.
Also old scraper bots trying to disguise themselves as GPTBot seems wholly unproductive, they're try to immitate users, not bots.
Yes, hence the "which was the only two I saw, but could have been forged".
> I'd love to see some of the web logs from this if you'd be willing to share!
Unfortunately not, I'm deleting any logs from the server after one hour, and also don't even log the full IP. I took a look now and none of the logs that still exists are from any user agent that looks like one of those bots.
Maybe its time for me to go ahead and start it again with logs to see if there are any logs.
I will maybe test it in all three 1) With CF tunnels + AI Block, 2) Only CF tunnels, 3) On a static IP directly. Maybe you can try the experiment too and we can compare our findings (also saying because I am lazy and I had misconfigured that cf tunnel so when it quit, I was too lazy to restart the vps given I just use it as a playground and just wanted to play around self hosting but maybe I will do it again now)