undefined

points

[-]

I'm also dealing with a scraper flood on a cgit instance. These conclusions come from just under 4M lines of logs collected in a 24h period.

- Caching helps, but is nowhere near a complete solution. Of the 4M requests I've observed 1.5M unique paths, which still overloads my server.

- Limiting request time might work, but is more likely to just cause issues for legitimate visitors. 5ms is not a lot for cgit, but with a higher limit you are unlikely to keep up with the flood of requests.

- IP ratelimiting is useless. I've observed 2M unique IPs, and the top one from the botnet only made 400 well-spaced-out requests.

- GeoIP blocking does wonders - just 5 countries (VN, US, BR, BD, IN) are responsible for 50% of all requests. Unfortunately, this also causes problems for legitimate users.

- User-Agent blocking can catch some odd requests, but I haven't been able to make much use of it besides adding a few static rules. Maybe it could do more with TLS request fingerprinting, but that doesn't seem trivial to set up on nginx.

by Imustaskforhelp9 hours ago|

parent|

[-]

Quick question but do these bots which you mention are from a 24H period but how long will this "attack" continue for?

Because this is something which is happening continuously & i have observed so many HN posts like these (Anubis iirc was created by its creator out of such frustration too). Git servers being scraped to the point of its effectively an DDOS.

by mzajc8 hours ago|

parent|

[-]

Yes, the attack is continuous. The rate fluctuates a lot, even within a day. It's definitely an anomaly, because eg. from 2025-08-15 to 2025-10-05 I saw zero days with more than 10k requests. Here's a histogram of the past 2 weeks plus today.

  2026-01-28     21'460
  2026-01-29     27'770
  2026-01-30     53'886
  2026-01-31    100'114  #
  2026-02-01    132'460  #
  2026-02-02     73'933
  2026-02-03    540'176  #####
  2026-02-04    999'464  #########
  2026-02-05    134'144  #
  2026-02-06  1'432'538  ##############
  2026-02-07  3'864'825  ######################################
  2026-02-08  3'732'272  #####################################
  2026-02-09  2'088'240  ####################
  2026-02-10    573'111  #####
  2026-02-11  1'804'222  ##################

by bcrl5 hours ago|

parent|

[-]

It's plausible that the AI companies have given up storing data for training runs and just stream it off the Internet directly now. It's probably cheaper to stream than buying more SSDs and HDDs from a supply constrained supply chain at this point.

by Imustaskforhelp8 hours ago|

parent|

prev|

[-]

Thanks for sharing the data, This unpredictability and everything is even more suspicious.

Thoughts on having an ssh server with https://github.com/charmbracelet/soft-serve instead?

by watermelon018 hours ago|

prev|

[-]

Great, now we need caching for something that's seldom (relatively speaking) used by people.

Let's not forget that scrapers can be quite stupid. For example, if you have phpBB installed, which by defaults puts session ID as query parameter if cookies are disabled, many scrapers will scrape every URL numerous times, with a different session ID. Cache also doesn't help you here, since URLs are unique per visitor.

by kimos13 hours ago|

parent|

[-]

You’re describing changing the base assumption for software reachable on the internet. “Assume all possible unauthenticated urls will be hit basically constantly”. Bots used to exist but they were rare traffic spikes that would usually behave well and could mostly be ignored. No longer.