Weird part #1 is that the traffic isn't for the most part shaped like crawler traffic. It's incredibly bursty, and heavily redundant, missing even the most obvious low hanging fruit optimizations.
Could be someone is using residential proxies to wrap AI agents' web traffic, but even so, there's a lot of pieces that don't really make sense, like why the traffic pattern is like being hit by a shotgun. It isn't just one request, but anywhere between 40 and 100 redundant requests.
A popular theory is that this is because of sloppy coding, AI companies are too rich to care, but then again that doesn't really add up. This isn't just a minor inefficiency, if it is "just" bad coding, they stand to gain monumental efficiency improvements by fixing the issues, in the sense of getting the data much faster, a clear competitive edge.
Really weird.
My unsubstantiated guess is the residential proxy/botnet is very unreliable, and that's why they fire so many request. Makes sense if it's sold as a service.
My website contains ~6000 unique data points in effectively infinite combinations on effectively infinite pages. Some of those combinations are useful for humans, but the AI-scrapers could gain a near-infinite efficiency improvement by just identifying as a bot and heeding my robots.txt and/or rel="nofollow" hints to access the ~500 top level pages which contain close to everything which is unique. They just don't care. All their efficiency attempts are directed solely toward bypassing blocks. (Today I saw them varying the numbers in their user agent strings: X15 rather than X11, Chrome/532 rather than Chrome/132, and so on...)
I can substantiate this a bit. Verified traffic from Amazonbot is too dumb to do anything with 429s. They will happily slam your site with more traffic than you can handle, and will completely ignore the fact that over half the responses are useless rate limits.
They say they honor REP, but Amazonbot will still hit you pretty persistently even with a full disallow directive in robots.txt
The root sources of the traffic from residential proxies gets murky very quickly.
It's easy to follow the chain partway for some traffic, eg "Why are we receiving all this traffic from Digital Ocean? ... oh, it's their hero client Firecrawl, using a deceptive UserAgent" ... but it still leaves the obvious question about who the Firecrawl client is.
Res proxy traffic is insane these days. There is also plenty of grey-market snowshoe IPs available for the right price, from a handful of ASNs. I regularly see unified crawling missions by unknown agents using 1000+ "clean" IP addresses an hour.
I bet lot of companies want to provide search results to AI agents.