upvote
The bar to ingest unstructured data into something usable was lowered, causing more people to start doing it.

Used to be you needed to implement some papers to do sentiment analysis. Reasonably high bar to entry. Now anyone can do it, the result: more people doing scraping (in less competent scrapers too).

reply
I would say there's a couple aspects.

The crawlers for the big famous names in AI are all less well behaved and more voracious than say, Googlebot. Though this is all somewhat muddied by companies that ran the former "good" crawlers all also being in the AI business and sometimes trying to piggyback on people having allowed or whitelisted their search crawling User-Agent, mostly this has settled a little where they're separating Googlebot from GoogleOther, facebookexternalhit from meta-externalagent, etc. This was an earlier "wave" of increased crawling that was obviously attributable to AI development. In some cases it's still problematic but this is generally more manageable.

The other stuff, the ones that are using every User-Agent under the sun and a zillion datacenter IPs and residential IPs and rotate their requests constantly so all your naive and formerly-ok rate-based blocking is useless... that stuff is definitely being tagged as "for AI" on the basis of circumstantial evidence. But from the timing of when it seemed to start, the amount of traffic and addresses, I don't have any problem guessing with pretty high confidence that this is AI. To your question of "who are the customers"... who's got all the money in the world sloshing around at their fingertips and could use a whole bunch of scraped pages about ~everything? Call it lazy reasoning if you'd like.

How much this traces back ultimately to the big familiar brand names vs. would-be upstarts, I don't know. But a lot of sites are blocking their crawlers that admit who they are, so would I be surprised to see that they're also paying some shady subcontractors for scrapes and don't particularly care about the methods? Not really.

reply
For whatever reason, legislation is lax right now if you claim the purpose of scraping is for AI training even for copyrighted material.

May be everyone is trying to take advantage of the situation before law eventually catches up.

reply
> For whatever reason, legislation is lax right now if you claim the purpose of scraping is for AI training even for copyrighted material

I think the reason is that America & China for the most part are also in AI arms race combined with an AI bubble and neither side would wish to lose literally any percieved advantage to them no matter the cost on others.

Also there is an immense lobbying effort against senators who propose for a stricter AI regulation.

https://www.youtube.com/watch?v=DUfSl2fZ_E8 [What OpenAI doesn't want you to know]

It's actually a great watch. Highly recommended because a lot of talks about regulations does feel to me as mirrors and smoke.

reply