undefined

points

[-]

I doubt it's OpenAI. Maaaybe somebody who sells to OpenAI, but probably not. I think they're big enough to do this mostly in-house and properly. Before AI only big players would want a scrape of the entire internet, they could write quality bots, cooperate, behave themselves, etc. Now every 3rd tier lab wants that data and a billion startups want to sell it, so it's a wild west of bad behavior and bad implementations. They do use residential IP sets as well.

by mikepavone28 minutes ago|

parent|

[-]

As someone with a self-hosted Mercurial instance dealing with this, I will say that the big names (OpenAI included, but not exclusively them) generally at least use proper user-agents and respect robots.txt, but they are still needlessly aggressive compared to traditional search indexers.

There are also scrapers that are hiding behind normal browser user agents. When I looked at IP ranges, at least some of them seemed to be coming from data centers in China.

by reppap5 hours ago|

parent|

prev|

[-]

Stop just making up excuses for these companies. Other comments on this story have showed the bots are using openai user agents and making requests from openai owned ip ranges.

by esseph17 hours ago|

prev|

[-]

The dirty secret is a lot of them come through "residential proxies", aka backdoored home routers, iot devices with shitty security, etc. Basically the scrapers who are often also third party, go to these "companies" and buy access to these "residential proxies". Some are more... considerate than others.

Why? Data. Every bit of it is it might be valuable. And not to sound tin foil hatty, but we are getting closer to a post-quantum time (if we aren't already ).

by tigerlily14 hours ago|

parent|

[-]

How can I detect if my router is backdoored, or being used as a residential proxy?

by mzajc9 hours ago|

parent|

[-]

I'm dealing with such attack, so if you'd like, you can send me IPv4 addresses, and I'll grep my logs for them. Email address is on the website linked on my profile.

As for what you can do on your own, it really depends on your network. OpenWRT routers can run tcpdump, so you can check for suspicious connections or DNS requests, but it gets really hard to tell if you have lots of cloud-tethered devices at home. IoT, browser extensions, and smartphone applications are the usual suspects.

by thesuitonym4 hours ago|

parent|

prev|

[-]

The most surefire way would be to put a device between your router and your ONT/modem to capture the packets and see what requests are being sent. It's not complicated but it IS a lot of information to sift through.

Your router may have the ability to log requests, but many don't, and even if yours does, if you're concerned the device may be compromised, how can you trust the logs?

BUT, with all that said, these attacks are typically not very sophisticated. Most of the time they're searching for routers at 192.168.1.1 with admin/admin as the login credentials. If you have anything else set, you're probably good from 97% of attackers (This number is entirely made up, but seriously that percentage is high). You can also check for security advisories on your model of router. If you find anything that allows remote access, assume you're compromised.

---

As a final note, it's more likely these days that the devices running these bots are IoT devices and web browsers with malicious javascript running.

by 12_throw_away5 hours ago|

parent|

prev|

[-]

> How can I detect if my router is backdoored, or being used as a residential proxy?

Aside from the obvious smoke tests (are settings changing without your knowledge? Does your router expose access logs you can check?), I'm not sure there's any general purpose way to check, but 2 things you can do are:

1. search for your router's model number to see if it's known to be vulnerable, and replace it with a brand-new reputable one if so (and don't buy it from Amazon).

2. There are vendors out there selling "residential proxy IP databases", (e.g., [1]) no idea how good they are, but if you have a stable public IP address you could check whether you're on that.

[1] https://ipinfo.io/data/residential-proxy

by kimos13 hours ago|

parent|

prev|

[-]

If it’s legit you can ask your ISP if they sell use of your hardware. Or just don’t use the provided hardware and instead BYO router or modem or media converter or whatever.

But I think what OP is implying is insecure hardware being infected by malware and access to that hardware sold as a service to disreputable actors. For that buy a good quality router and keep it up to date.

by teeklp9 hours ago|

parent|

[-]

So you don't know? Why respond?

by oblio8 hours ago|

parent|

[-]

Don't be rude.

by the_biot8 hours ago|

parent|

prev|

[-]

Has this actually been investigated and proven to be true? I see allegations, but no facts really.

It seems to me to be just as likely that people are installing LLM chatbot apps that do the occasional bit of scraping work on the sly, covered by some agreed EULA.

by Symbiote6 hours ago|

parent|

[-]

Another likely source is "free" VPN tools, or tools for streaming TV (especially football or other pay-to-view stuff). The tool can make a little money proxying requests at the same time.

I can't provide evidence as it's close to impossible to separate the AI bots using residential proxies from actual users, and their IPs are considered personal data. But as the other reply shows, it's easy enough to find people selling this service.

by esseph7 hours ago|

parent|

prev|

[-]

Seriously, go to Google.

Search for: "residential proxy" ai data scraping.

Start reading through thousands of articles.

by the_biot2 hours ago|

parent|

[-]

That's the worst thing I've seen all week. The DDoS networks of 20 years ago, now out in the open and presented as real business.

Thanks for the info, wish I didn't know :-(

by karel-3d4 hours ago|

parent|

prev|

[-]

it isn't that hard to just buy a bunch of sim cards and put them in a modem and use that. it's good enough as a residential proxy. source: I did that before, when I worked on plaid-like thing.

by wseqyrku11 hours ago|

prev|

[-]

> this is pretty much the reason why Cloudflare even exists,

You said it yourself. If you're selling a cure, you might as well start a plague.