undefined

points

[-]

It might work against people just use their Mini Mac with OpenClaw to summarize news every morning, but it certainly won't work against Google.

More centralized web ftw.

by hexage18148 hours ago|

parent|

[-]

It also probably won't work if the person actually wants your content and is checking if the thing they scraped actually makes sense or it just noise. Like, none of these are new things. Site owners send junk/fake data to webscrapers since web scraping was invented.

by otherme1238 hours ago|

parent|

prev|

[-]

In my experience, Google (among others) plays nice. Just put "disallow: *" in your robots.txt, and they won't bother you again.

My current problem is OpenAI, that scans massively ignoring every limit, 426, 444 and whatever you throw at them, and botnets from East Asia, using one IP per scrap, but thousands of IPs.

by LaGrange8 hours ago|

parent|

prev|

[-]

> It might work against people just use their Mini Mac with OpenClaw to summarize news every morning,

Good enough for me.

> More centralized web ftw.

This ain't got anything to do with "centralized web," this kind of epistemological vandalism can't be shunned enough.

by sd910 hours ago|

prev|

[-]

Even it did work, I just can't bring myself to care enough. It doesn't feel like anything I could do on my site would make any material difference. I'm tired.

by 20k10 hours ago|

parent|

[-]

I definitely get this. The thing that gives me hope is that you only need to poison a very small % of content to damage AI models pretty significantly. It helps combat the mass scraping, because a significant chunk of the data they get will be useless, and its very difficult to filter it by hand

by lucasfin0008 hours ago|

parent|

[-]

The asymmetry is what makes this very interesting. The cost to inject poison is basically zero for the site owner, but the cost to detect and filter it at scale is significant for the scraper. That math gets a lot worse for them as more sites adopt it. It doesn't solve the problem, but it changes the economics.

by xyzal7 hours ago|

prev|

[-]

About two years ago, I made up reference to a nonexistent python library and put code "using" it in just 5 GitHub repos. Several months later the free ChatGPT picked it up. So IMO it works.

by logicprog7 hours ago|

parent|

[-]

Via websearch? Or training?

by spiderfarmer8 hours ago|

prev|

[-]

There are hundreds of bots using residential proxies. That is not free. Make them pay.

by bediger40007 hours ago|

prev|

[-]

The search engine crawlers are sophisticated enough, but Meta's are not. Neither is Anthropic's Claude crawler. Source: personal experience trying garbage generators on Yandex, Blexbot, Meta's and Anthropics crawlers.

I'm completely uncertain that the unsophisticated garbage I generated makes any difference, much less "poisons" the LLMs. A fellow can dream, can't he?

by m00dy9 hours ago|

prev|

[-]

it won't work, especially on gemini. Googlebot is very experienced when it comes to crawling. It might work for OpenAI and others maybe.

by nubg10 hours ago|

prev|

[-]

What kind of migitations? How would you detect the poison fountain?

by avereveard9 hours ago|

parent|

[-]

style="display: none;" aria-hidden="true" tabindex="1"

many scraper already know not to follow these, as it's how site used to "cheat" pagerank serving keyword soups

by m00dy9 hours ago|

parent|

[-]

Google will give your website a penalty for doing this.

by phplovesong5 hours ago|

parent|

prev|

[-]

You dont have to use this. You can have it visible bit hide it for humans with other easy tricks.

by cuu5085 hours ago|

parent|

[-]

Scrapers can work around those other easy tricks too.

by GaggiX9 hours ago|

parent|

prev|

[-]

Because the internet is noisy and not up to date all recent LLMs are trained using Reinforcement Learning with Verifiable Rewards, if a model has learned the wrong signature of a function for example it would be apparent when executing the code.

by phoronixrly9 hours ago|

prev|

[-]

It does work, on two levels:

1. Simple, cheap, easy-to-detect bots will scrape the poison, and feed links to expensive-to-run browser-based bots that you can't detect in any other way.

2. Once you see a browser visit a bullshit link, you insta-ban it, as you can now see that it is a bot because it has been poisoned with the bullshit data.

My personal preference is using iocaine for this purpose though, in order to protect the entire server as opposed to a single site.