undefined

points

[-]

I agree it's a more than a bit handwavy. The common consensus seems to be that AI companies are driving this, but it's really hard to conclusively prove who or what is behind the attacks.

Weird part #1 is that the traffic isn't for the most part shaped like crawler traffic. It's incredibly bursty, and heavily redundant, missing even the most obvious low hanging fruit optimizations.

Could be someone is using residential proxies to wrap AI agents' web traffic, but even so, there's a lot of pieces that don't really make sense, like why the traffic pattern is like being hit by a shotgun. It isn't just one request, but anywhere between 40 and 100 redundant requests.

A popular theory is that this is because of sloppy coding, AI companies are too rich to care, but then again that doesn't really add up. This isn't just a minor inefficiency, if it is "just" bad coding, they stand to gain monumental efficiency improvements by fixing the issues, in the sense of getting the data much faster, a clear competitive edge.

Really weird.

My unsubstantiated guess is the residential proxy/botnet is very unreliable, and that's why they fire so many request. Makes sense if it's sold as a service.

by gamesieve2 hours ago|

parent|

[-]

I suspect the redundant requests are primarily designed to weed out poisoned data served on otherwise valid URLs. I've also seen the redundant requests increase massively the more sources I blocked at the firewall level, so it feels like they're pre-emptively overcompensating for some percentage of requests being blocked.

My website contains ~6000 unique data points in effectively infinite combinations on effectively infinite pages. Some of those combinations are useful for humans, but the AI-scrapers could gain a near-infinite efficiency improvement by just identifying as a bot and heeding my robots.txt and/or rel="nofollow" hints to access the ~500 top level pages which contain close to everything which is unique. They just don't care. All their efficiency attempts are directed solely toward bypassing blocks. (Today I saw them varying the numbers in their user agent strings: X15 rather than X11, Chrome/532 rather than Chrome/132, and so on...)

by oasisbob2 hours ago|

parent|

prev|

[-]

> A popular theory is that this is because of sloppy coding, AI companies are too rich to care, but then again that doesn't really add up

I can substantiate this a bit. Verified traffic from Amazonbot is too dumb to do anything with 429s. They will happily slam your site with more traffic than you can handle, and will completely ignore the fact that over half the responses are useless rate limits.

They say they honor REP, but Amazonbot will still hit you pretty persistently even with a full disallow directive in robots.txt

by marginalia_nu2 hours ago|

parent|

[-]

How do you know it's Amazonbot?

by oasisbob2 hours ago|

parent|

[-]

User Agent, SWIPed IP space, and the PTR records resolving to an Amazon-controlled crawl zone.

by oasisbob3 hours ago|

prev|

[-]

I want more data too.

The root sources of the traffic from residential proxies gets murky very quickly.

It's easy to follow the chain partway for some traffic, eg "Why are we receiving all this traffic from Digital Ocean? ... oh, it's their hero client Firecrawl, using a deceptive UserAgent" ... but it still leaves the obvious question about who the Firecrawl client is.

Res proxy traffic is insane these days. There is also plenty of grey-market snowshoe IPs available for the right price, from a handful of ASNs. I regularly see unified crawling missions by unknown agents using 1000+ "clean" IP addresses an hour.

by ghywertelling4 hours ago|

prev|

[-]

https://parallel.ai/

I bet lot of companies want to provide search results to AI agents.