Miasma: A tool to trap AI web scrapers in an endless poison pit

upvote

Miasma: A tool to trap AI web scrapers in an endless poison pit

(github.com)

243 points

by LucidLynx10 hours ago |

upvote

by bobosola5 hours ago|

[-]

I dunno... it feels like the same approach as those people who tell you gleeful stories of how they kept a phone spammer on a call for 45 minutes: "That'll teach 'em, ha ha!" Do these types of techniques really work? I’m not convinced.

Also, inserting hidden or misleading links is specifically a no-no for Google Search [0], who have this to say: We detect policy-violating practices both through automated systems and, as needed, human review that can result in a manual action. Sites that violate our policies may rank lower in results or not appear in results at all.

So you may well end up doing more damage to your own site than to the bots by using dodgy links in this manner.

[0]https://developers.google.com/search/docs/essentials/spam-po...

reply

upvote

by trinsic25 hours ago|

[-]

>I dunno... it feels like the same approach as those people who tell you gleeful stories of how they kept a phone spammer on a call for 45 minutes: "That'll teach 'em, ha ha!" Do these types of techniques really work? I’m not convinced

If you are automating it, I don't see why not. Kitboga, a you-tuber kept scam callers in AI call-center loops tying up there resources so they cant use them on unsuspecting victims.[0]

That's a guerilla tactic, similar in warfare, when you steal resources from an enemy, you get stronger and they get weaker, its pretty effective.

[0]: https://www.youtube.com/watch?v=ZDpo_o7dR8c

reply

upvote

by phplovesong3 hours ago|

[-]

Pretty easy. Get a paid number and have the phone scammers / marketers call that. I know a guy who made a decent side huzzle from this. They marketers slowly blocked his number tho, not sure if he still has this thing going on, as it was more a experiment.

reply

upvote

by yareally2 hours ago|

[-]

Was he picking up the phone and telling them to call him back on the other number?

reply

upvote

by bdangubic4 hours ago|

[-]

more and more scammers are automating their side as well so soon the loop will be just bots talking to bots

reply

upvote

by rogerrogerr2 hours ago|

[-]

> gleeful stories of how they kept a phone spammer on a call for 45 minutes: "That'll teach 'em, ha ha!" Do these types of techniques really work? I’m not convinced.

It’s one of the best time investments I’ve ever made. They just don’t call me anymore.

I think they have two lists: the “do not call” list, and the “unprofitable to call” list. You want to be on the latter list.

reply

upvote

by ordu2 hours ago|

[-]

> it feels like the same approach as those people who tell you gleeful stories of how they kept a phone spammer on a call for 45 minutes: "That'll teach 'em, ha ha!" Do these types of techniques really work? I’m not convinced.

In 2000s there was some company in Russia selling English courses. It spammed so much, that people were really pissed off. To make long story short, the company disappeared from a public space when Golden Telecom joined the party of retaliatory "spam" calls and make computer to call the company using Golden Telecom modem pool.

So, yeah, you kinda can achieve something in this way, but to make sure you should lease a modem pool for that.

reply

upvote

by xyzal5 hours ago|

[-]

One would assume legit spiders obey robots.txt.

reply

upvote

by lolc4 hours ago|

[-]

This, to me, is the strongest argument to offer these slop generators. It provides an incentive to follow the robots.txt.

reply

upvote

by bugfix3 hours ago|

[-]

I really don't get it. Wouldn't you be wasting a lot of resources feeding the bots like this?

reply

upvote

by chongli5 hours ago|

[-]

Also, inserting hidden or misleading links is specifically a no-no for Google Search [0]

Depending on your goals, this may be a pro or a con. I, personally, would like to see a return of "small web" human-centric communities. If there were tools that include anti-scraping, anti-Google (and other large search crawlers) as well as a small web search index for humans to find these sites, this idea becomes a real possibility.

reply

upvote

by maxrmk3 hours ago|

[-]

It’s easy to opt out of being indexed by Google.

reply

upvote

by cdrini3 hours ago|

[-]

Exactly. Identifying crawlers like Google, bing aren't the issue. They obey robots.txt, and can easily be blocked by user agent checks. Non-identifying crawlers, which provide humanlike user agents, and which are usually distributed so get around ip-based rate limits, are the main ones that are challenging to deal with.

reply

upvote

by iririririr4 hours ago|

[-]

yes it work.

phone scammers have a very high personel cost, hence why some resort for human traffic.

if everyone picked up the phone and wasted a few seconds, it would be enough to make their whole enterprise worthless. but since most people who would not fail shutdown right away, they have the best ROI of any industry. they don't even pay the call for first seconds.

reply

upvote

by throw109202 hours ago|

[-]

> I’m not convinced.

Is this how low we've sunk - that even below taking a single personal anecdote and generalizing it to everything - now we're taking zero experience and dismissing things based on vibes?

I've seen lots of LLM-slop-lovers doing the same thing. Maybe it's a pattern.

reply

upvote

by phplovesong3 hours ago|

[-]

Who TF cares about google? This is mostly for personal tech stuff (just the stuff AI steals for training). Id say its pretty welcome that it is not shown in google results.

reply

upvote

by tasuki7 hours ago|

[-]

> If you have a public website, they are already stealing your work.

I have a public website, and web scrapers are stealing my work. I just stole this article, and you are stealing my comment. Thieves, thieves, and nothing but thieves!

reply

upvote

by margalabargala5 hours ago|

[-]

The problem I have, is they hammer my site so hard they take it down.

The content is for everyone. They can have it. Just don't also take it away from everybody else.

reply

upvote

by ethmarks4 hours ago|

[-]

Unintentional denial-of-service attacks from AI scrapers are definitely a problem, I just don't know if "theft" is the right way to classify them. They shouldn't get lumped in with intellectual property concerns, which are a different matter. AI scrapers are a tragedy of the commons problem kind of like Kessler syndrome: a few bad actors can ruin low Earth orbit for everyone via space pollution, which is definitely a problem, but saying that they "stole" LEO from humanity doesn't feel like the right terminology. Maybe the problem with AI scrapers could be better described as "bandwidth pollution" or "network overfishing" or something.

reply

upvote

by oasisbob1 hours ago|

[-]

Theft isn't far off, it seems closer to me than using the word for IP violations.

When a crawler aggressively crawls your site, they're permanently depriving you the use of those resources for their intended purpose. Arguably, it looks a lot like conversion.

reply

upvote

by margalabargala4 hours ago|

[-]

Yes I completely agree.

reply

upvote

by FeepingCreature4 hours ago|

[-]

you're totally right about not being theft, but we have a term. you used it yourself, "distributed denial of service". that's all it is. these crawlers should be kicked off the internet for abuse. people should contact the isp of origin.

reply

upvote

by ethmarks4 hours ago|

[-]

Firstly, since this argument is about semantic pedantry anyways, it's just denial-of-service, not distributed denial-of-service. AI scraper requests come from centralized servers, not a botnet.

Secondly, denial-of-service implies intentionality and malice that I don't think is present from AI scrapers. They cause huge problems, but only as a negligent byproduct of other goals. I think that the tragedy of the commons framing is more accurate.

EDIT: my first point was arguably incorrect because some scrapers do use decentralized infrastructure and my second point was clearly incorrect because "denial-of-service" describes the effect, not the intention. I retract both points and apologize.

reply

upvote

by goodmythical49 minutes ago|

[-]

ah, no fun, I was going to continue the semantic deconstruction with a whole bunch of technicalities about how you're not quite precisely accurate and you gotta go do the right thing and retract your statements.

boo. took all the fun out of it ;)

reply

upvote

by FeepingCreature3 hours ago|

[-]

Sufficiently advanced negligence is indistinguishable from malice. There is a point you no longer gain anything from treating them differently.

reply

upvote

by cdrini3 hours ago|

[-]

The first is incorrect, these scrapers are usually distributed across many IPs, in my experience. I usually refer to them as "disturbed, non-identifying crawlers (DNCs)" when I want to be maximally explicit. (The worst I've seen is some crawler/botnet making exactly one request per IP -_-)

reply

upvote

by aduwah3 hours ago|

[-]

I think the second is incorrect too. DDoS is a DDoS no matter what the intent is.

reply

upvote

by pmlnr3 hours ago|

[-]

Been there recently. Rate limit on nginx and anti-syn flood on pf solved it.

reply

upvote

by spiderfarmer1 hours ago|

[-]

I'm being hit with 300 req/s 24/7 from hundreds of thousands of unique IP's from residential proxies. I can't rate limit any further without hurting the real users.

reply

upvote

by oasisbob58 minutes ago|

[-]

Yeah, IP-based rate limits are nearly ineffective these days.

reply

upvote

by coldpie5 hours ago|

[-]

I agree theft isn't a good analogy, but there is something similar going on. I put my words out into the world as a form of sharing. I enjoy reading things others write and share freely, so I write so others might enjoy the things I write. But now the things I write and share freely are being used to put money in the bank accounts of the worst people on the planet. They are using my work in a way I don't want it to be used. It makes me not want to share anymore.

reply

upvote

by gruez5 hours ago|

[-]

>but there is something similar going on [...]

No, what you're basically describing is "I shared something but then I didn't like how it ended up being used". If you put stuff out in public for anyone to use, then find out it's used in a way you don't like, it's your right to stop sharing, but it's not "similar" to stealing beyond "I hate stealing"

reply

upvote

by Hendrikto5 hours ago|

[-]

> If you put stuff out in public for anyone to use, then find out it's used in a way you don't like

Nope. Copyright is a thing, licenses are a thing. Both are completely ignored by LLM companies, which was already proven in court, and for which they already had to pay billions in fines.

Just because something is publicly accessible, that does not mean everybody is entitled to abuse it for everything they see fit.

reply

upvote

by gruez5 hours ago|

[-]

>Nope. Copyright is a thing, licenses are a thing. Both are completely ignored by LLM companies, which was already proven in court,

...the same courts that ruled that AI training is probably fair use? Fair use trumps whatever restrictions author puts on their "licenses". If you're an author and it turned out that your book was pirated by AI companies then fair enough, but "I put my words out into the world as a form of sharing" strongly implied that's not what was happening, eg. it was a blog on the open internet or something.

reply

upvote

by FromTheFirstIn4 hours ago|

[-]

I never understand why anyone wants authors to not be able to enforce copyright and licensing laws for AI training. Unless you are Anthropic or OAI it seems like a wild stance to have. It’s good when people are rewarded for works that other people value. If trainers don’t value the work, they shouldn’t train on it. If they do, they should pay for it.

reply

upvote

by FeepingCreature4 hours ago|

[-]

My own view is, I thought we were all agreed that the idea that Microsoft can restrict Wine from even using ideas from Windows, such that people who have read the leaked Windows source cannot contribute to Wine, was a horrible abuse of the legal system that we only went along with under duress? Now when it's our data being used, or more cynically when there's money to be made, suddenly everyone is a copyright maximalist.

No. Reading something, learning from it, then writing something similar, is legal; and more importantly, it is moral. There is no violation here. Copyright holders already have plenty of power; they must not be given the power to restrict the output of your brain forever more for merely having read and learnt. Reading and learning is sacred. Just as importantly, it's the entire damn basis of our profession!

If you do not want people to read and learn from your content, do not put it on the web.

reply

upvote

by FromTheFirstIn1 hours ago|

[-]

If you want people to read and learn from each other, you should incentivize people to make content worth reading and learning from. Making LLM training a viable loophole for copyright law means there won’t be incentives to produce such work.

reply

upvote

by goodmythical26 minutes ago|

[-]

I don't think that's the case.

People getting better at writing is only going to increase the quality of the output.

Increasing both competition and tooling (by providing every writer with the world's greatest encylcopedia/thesaurus/line-editor/brainstormer/planner/etc) is only going to make writers better.

Will there be lots of people who misuse the system? Are there lots of people who use thesaurus words without knowing what they're talking about? Can't you tell the difference?

I see in LLMs a lowering of the ground floor making it easier for people to get in. This will increase the total availability of content.

I also see in LLMs a raising of the top bar making it harder to be the best. If more people are writing and more people are trying to be the best, the best is going to get better.

Consider chess. Have we suddenly stopped playing chess now that a phone can beat 95+% of people? No. The market is stronger than ever and still growing. The greatest player in the world use the chess algorithms to refine their play and the play keeps expanding in new and interesting ways.

In both writing and chess, yes, there is an explosion of low and middling play. But since when have we not always had people producing content and playing chess that when compared to the masters of the field is generally viewed as substandard?

But here's the kicker. Some people's favorite genre is badly editted fanfic. Some people genuinely derive actual pleasure from things that you or I might call garbage. And what's wrong with that? Who am I to say that you can't love clutzy firecop loves suburban housewife paperbacks? Or Zelda/Harry Potter crossfics or whatever.

reply

upvote

by FromTheFirstIn1 hours ago|

[-]

Re-reading your comment, I think we’re both generally anti-corporate-fuckery. I view the current batch of copyright pearl clutching to be an argument about if VCs are allowed to steal books to make their chatbots worth talking to, and the Wine/MSoft debate about if it should be legal to engage in anticompetitive behavior by restrictive use of copyright. In both of these cases the root of the issue isn’t really the copyright as an abstract- it’s the bludgeoning of the person with less money by use of overwhelming legal costs to have a day in court.

reply

upvote

by gruez4 hours ago|

[-]

>I never understand why anyone wants authors to not be able to enforce copyright and licensing laws for AI training.

Fair use is part of "copyright and licensing laws".

reply

upvote

by grumbelbart3 hours ago|

[-]

Would using an actors face and voice as training data be fair use?

What it the model then creates a virtual actor that is very close to the real actor?

reply

upvote

by gruez3 hours ago|

[-]

>What it the model then creates a virtual actor that is very close to the real actor?

"Likeness" is a separate concept from copyrights

https://en.wikipedia.org/wiki/Personality_rights

reply

upvote

by hparadiz1 hours ago|

[-]

I wish I lived in the alternative timeline where open source folks didn't look a gift horse in the mouth and actually used these tools to copy left the shit out of software to the point where proprietary closed source software has no advantage.

But instead we've got people posting "honey pots" that an LLM will immediately detect and route around.

reply

upvote

by goodmythical22 minutes ago|

[-]

I bet we'd cure all cancers in a month if everyone whining about slop actually went and did something about it.

reply

upvote

by Lerc1 hours ago|

[-]

It sounds like you wanted to believe you were sharing freely while sharing conditionally.

reply

upvote

by tasuki5 hours ago|

[-]

> But now the things I write and share freely are being used to put money in the bank accounts of the worst people on the planet.

I don't think that's the case. I'm not even arguing they aren't the worst people on the planet - might as well be. But all is see them doing is burning money all over the place.

reply

upvote

by FromTheFirstIn5 hours ago|

[-]

They’re getting the money to burn, though

reply

upvote

by kmeisthax4 hours ago|

[-]

If you want a good analogy, try the enclosure of the commons in the British countryside. Communally managed grasslands were destroyed by noblemen with massive herds of cattle overgrazing the land, kickstarting a land grab that effectively forced people to enclose or be left behind themselves. Property is a virus that destroys all other forms of allocation.

reply

upvote

by kseniamorph2 hours ago|

[-]

> nothing but thieves! cool band btw

reply

upvote

by spiderfarmer6 hours ago|

[-]

If someone hands out cookies in the supermarket, are you allowed to grab everything and leave?

reply

upvote

by drfloyd516 hours ago|

[-]

Odd thing about cookies… they disappear after one serving.

Websites are an endless stream of cookies.

The analogy doesn’t hold.

reply

upvote

by ghywertelling6 hours ago|

[-]

If copying content from harddrive to another is theft, then so is DNA copying itself.

Everything is a Remix culture. We should promote remix culture rather than hamper it.

Everything is a Remix (Original Series) https://youtu.be/nJPERZDfyWc

reply

upvote

by subscribed2 hours ago|

[-]

Fine.

Me and my 9 friends stand around the cookie-serving person blocking everyone else.

It's taking all the cookies over a period of time.

The analogy was good.

reply

upvote

by GeoAtreides5 hours ago|

[-]

how about this analogy: I created a most tasty cookie recipe. I give it out for free, and all copies have my name because I am vain person who likes to be known far and wide as the best baking chef ever. Is it ok to get the recipe, remove my name, and write in LLM-Codex as the creator? again, i'm ok with giving the recipe for free, i just want my name out there.

reply

upvote

by gruez4 hours ago|

[-]

>Is it ok to get the recipe, remove my name, and write in LLM-Codex as the creator? again, i'm ok with giving the recipe for free, i just want my name out there.

From a legal perspective, it's a pretty clear "no". The instructions in recipes aren't copyrightable. The moral question is more ambiguous, but it's still pretty weak. Most recipes are uncredited, and it's unclear why someone can force everyone to attribute the recipe to them when all they realistically did was tweak the dish a bit. In the example above, I doubt you invented cookies.

reply

upvote

by GeoAtreides3 hours ago|

[-]

i'm curious, do you honestly think the argument was about recipes and cookies? maybe it was an analogy? looking back up the comment tree, it does seem to be an analogy, not a discussion about ACTUAL cookies and ACTUAL recipes.

reply

upvote

by gruez3 hours ago|

[-]

>maybe it was an analogy?

In that case it's a terrible analogy because if you can't get people to agree on the cookies case, what hope do you have to extend it to the case you're trying to apply the analogy to? It's like saying "You wouldn't pirate a movie, why would you pirate a blog post", because most people would pirate movies.

reply

upvote

by GeoAtreides2 hours ago|

[-]

oh man.

my comment was about the very human need to be recognized for something created, made, or thought by a person. People are ok with writing blog posts, they're ok with writing software, and they're ok with give it all for free, but they want their name attached and their contribution recognized.

reply

upvote

by gruez2 hours ago|

[-]

>my comment was about the very human need to be recognized for something created, made, or thought by a person.

And I specifically addressed that aspect:

>The moral question is more ambiguous, but it's still pretty weak. Most recipes are uncredited, and it's unclear why someone can force everyone to attribute the recipe to them when all they realistically did was tweak the dish a bit. In the example above, I doubt you invented cookies.

The cookies analogy was terrible because recipes are rarely credited, but even ignoring the terrible analogy the "recognition" argument still fails. If you wrote a blog post on how to set up kubernetes (or whatever), then it's fair enough that you get recognized for that specific blog post. If my friend asked me how to set up kubernetes, it wouldn't be cool for me to copy paste your blog post and send it over.

However similar to copyright, the recognition you deserve quickly drops off once it moves beyond that specific work. If I absorbed the knowledge from your blog post, then wrote another guide on setting up kubernetes, perhaps updated for my use case, it's unreasonable to require that you be credited. It might be nice, and often times people do, but it's also unreasonable if you wrote an angry letter demanding that you be credited. You weren't the inventor of kubernetes, and you probably got your knowledge of kubernetes from elsewhere (eg. the docs the creators made), so why should everyone have to credit you in perpetuity?

reply

upvote

by GeoAtreides2 hours ago|

[-]

your ability to not address my argument main point is something to behold. can't tell if you're doing on purpose or not.

if humans read my blog posts and then things without credit that would be fine. i like human eyeballs and i like them on my content. that's exactly the purpose of the blog post (_in this particular example_), to get human eyeballs on the content.

reply

upvote

by gruez2 hours ago|

[-]

>your ability to not address my argument main point is something to behold. can't tell if you're doing on purpose or not.

Or maybe you're just terrible at writing.

>if humans read my blog posts and then things without credit that would be fine.

I'm not sure how I (or anyone) was supposed to come away with this conclusion when you were writing stuff like:

"i'm ok with giving the recipe for free, i just want my name out there"

"the very human need to be recognized for something created"

"they want their name attached and their contribution recognized".

reply

upvote

by GeoAtreides1 hours ago|

[-]

there is nothing contradictory in what i said, and if you weren't favoring a very literal interpretation of my argument you would agree.

but, in the spirit of critical reading education, what i meant is: human attention good, machine ingestion bad.

reply

upvote

by z3c06 hours ago|

[-]

Digital information may be our first post-scarce resource. It's interesting, and sad, to see so many attempt to fit it within scarcity-based economic models.

reply

upvote

by Terretta5 hours ago|

[-]

> digital information may be our first post-scarce resource

… browses memory and storage prices on NewEgg …

Hmm.

But the word digital is distracting us.

The word information is the important one. The question isn't where information goes. It's where information comes from.

Is new information post scarcity?

Can it ever be?

reply

upvote

by lou13064 hours ago|

[-]

Bandwidth and compute constraints make websites all but an endless stream though.

reply

upvote

by spiderfarmer1 hours ago|

[-]

That's exactly it. It costs me real time and money to serve the 97% of fake traffic that just takes without giving me anything in return.

reply

upvote

by 3 hours ago|

[-]

deleted

reply

upvote

by throwaway6137466 hours ago|

[-]

[dead]

reply

upvote

by bengale6 hours ago|

[-]

It’s interesting to see twists on the old anti-piracy arguments recycled for anti-ai.

reply

upvote

by gruez5 hours ago|

[-]

Turns out many (most?) people on the internet were never anti-copyright in the first place. They were just anti-copyright (or at least, refused to challenge the anti-copyright people) because they wanted free movies and/or hated corporations.

reply

upvote

by subscribed2 hours ago|

[-]

Many of these people live int he countries where downloading for own use is lawful, since they're paying copyright levy exactly to cover for that.

They don't have to hate the copyright.

reply

upvote

by falcor846 hours ago|

[-]

That really depends, but the quick answer is that according to our human social contract, we'd just ask "how many can I take?". Until now, the only real tool to limit scrapers has been throttling, but I don't see any reason for there not to be a similar conversational social contract between machines.

reply

upvote

by volemo6 hours ago|

[-]

Isn’t robots.txt such a “social contract between machines”? But AI scrapers couldn’t care less.

reply

upvote

by 3 hours ago|

[-]

deleted

reply

upvote

by GaggiX6 hours ago|

[-]

I will copy the supermarket and paste it somewhere else.

I'm also going to download a car.

reply

upvote

by Bender2 hours ago|

[-]

If someone hands out cookies in the supermarket, are you allowed to grab everything and leave?

Depends on the trust level of your society. where the store resides.

The internet is a cesspool of vagrants, thieves, mentally unstable, people and software with no impulse control, pirates and that is just talking about corporations. It gets so much worse with individuals.

reply

upvote

by pbasista6 hours ago|

[-]

This is a dishonest analogy. In your example, there is only a limited amount of cookies available. While there is no practical limit on the amount of time a certain digital media can be viewed.

You are allowed to take one cookie. But you are allowed to view a public website multiple times if you so want.

reply

upvote

by spiderfarmer5 hours ago|

[-]

Multiple AI scrapers are downloading every page of my 6M page website as we speak. They don’t care about the fact that I have dedicated 20 years to building it, nor that I have to maintain multiple VPSes just to serve it to them.

If I can poison them and their families, I will.

reply

upvote

by joquarky2 hours ago|

[-]

> If I can poison them and their families, I will.

Don't post anything online that you don't want to be brought up in court later.

reply

upvote

by spiderfarmer1 hours ago|

[-]

Like the OP's solution it was about scrapers and the models they share their data with.

reply

upvote

by ImPostingOnHN4 hours ago|

[-]

Wow, how did you manually hand-write 6 million web pages? That is impressive. It would take me a while to even montonically count that high.

reply

upvote

by subscribed2 hours ago|

[-]

You're trying to use a quite unfunny "sarcasm" to move the goalpost to the strawman (they never claimed they handcrafted these pages) and quickly gloss ove the fact it's 20 years of work so why not?

reply

upvote

by throwaway6137466 hours ago|

[-]

[dead]

reply

upvote

by hollow-moe6 hours ago|

[-]

There sure is a limit in the load that the server you're DDoSing can take or the will for people to post new worthy content in public. The supply is limited just not at the first degree. Let's make a small edit: Are you allowed to take all the cookies and then sell them with a small ribbon with your name on it ?

reply

upvote

by spiderfarmer5 hours ago|

[-]

Their is no arguing with pirates. They’ll take what’s yours and forget about you while you tend to the ashes.

reply

upvote

by CrzyLngPwd5 hours ago|

[-]

Way back in the day I had a software product, with a basic system to prevent unauthorised sharing, since there was a small charge for it.

Every time I released an update, and new crack would appear. For the next six months I worked on improving the anti-copying code until I stumbled across an article by a coder in the same boat as me.

He realised he was now playing a game with some other coders where he make the copyprotection better, but the cracker would then have fun cracking it. It was a game of whack-a-mole.

I removed the copy protection, as he did, and got back to my primary role of serving good software to my customers.

I feel like trying to prevent AI bots, or any bots, from crawling a public web service, is a similar game of whack-a-mole, but one where you may also end up damaging your service.

reply

upvote

by Cpoll4 hours ago|

[-]

> the cracker would then have fun cracking it.

I wonder if you could've won by making the cracking boring. No new techniques, bare minimum changes to require compiling a new crack, and just enough to make it difficult to automate. I.e. turn the cracking into a job.

But in reality, there are other community-driven motivations to put out cracks.

reply

upvote

by gruez4 hours ago|

[-]

>No new techniques, bare minimum changes to require compiling a new crack, and just enough to make it difficult to automate.

From a practical perspective you also have to have a steady stream of features for the newer versions to be worth cracking. Otherwise why use v1.09 when v1.01 works fine? Moreover spending less effort into improving the DRM is still playing at the cat and mouse game, albeit with less time investment. If you're making minimal changes, the cracker also has to spend minimal time updating the crack.

reply

upvote

by joquarky2 hours ago|

[-]

So many problems could be solved by letting go.

Unfortunately social media and snowballing copyright maximalism has inflated egos to the point where more and more people think they need to control everything.

reply

upvote

by CrzyLngPwd1 hours ago|

[-]

If only I could go back in time 26 years and let myself know I was right to focus on my customers.

reply

upvote

by aldousd6666 hours ago|

[-]

This is ultimately just going to give them training material for how to avoid this crap. They'll have to up their game to get good code. The arms race just took another step, and if you're spending money creating or hosting this kind of content, it's not going to make up for the money you're losing by your other content getting scraped. The bottom has always been threatening to fall out of the ads paid for eyeballs, And nobody could anticipate the trigger for the downfall. Looks like we found it.

reply

upvote

by subscribed2 hours ago|

[-]

So, if at the end of the day instead of clicking EVERY single link in the repository they just check it out and parse locally...... I would consider it a win.

reply

upvote

by aldousd6666 hours ago|

[-]

To be clear, I mean AI is going to be the downfall of ad supported content. But let's face it. We have link farms and spam factories as a result of the ad supported content market. I think this is going to eventually do justice for users because it puts a premium on content quality that someone will want to pay a direct licensing fee to scrape for your AI bots as opposed to tricking somebody into clicking on a link and looking at an impression for something they won't buy.

reply

upvote

by johneth6 hours ago|

[-]

> This is ultimately just going to give them training material for how to avoid this crap.

> The arms race just took another step, and if you're spending money creating or hosting this kind of content, it's not going to make up for the money you're losing by your other content getting scraped.

So we should all just do nothing and accept the inevitable?

reply

upvote

by ninjagoo5 hours ago|

[-]

> So we should all just do nothing and accept the inevitable?

I daresay rate-limiting will result in better outcomes than well-poisoning with hidden links that are against the policies of search engines.

Lots of potential for collateral damage, including your own websites' reputations and search visibility, with the well-poisoning approach.

reply

upvote

by xantronix4 hours ago|

[-]

The README.md specifically states how to allow for nice robots to proceed unhindered. The people behind these efforts, I would imagine, don't particularly care about their sites' reputations in the cases people use LLMs for search.

reply

upvote

by ddtaylor4 hours ago|

[-]

To be honest who cares about Google search anymore it's pretty useless these days.

reply

upvote

by ninjagoo3 hours ago|

[-]

The small non-profit I volunteer with finds Google ads to be surprisingly effective, and much more cost-effective than FB for what they do, so there's at least some Google search usage in the demographic that they serve.

reply

upvote

by Apocryphon6 hours ago|

[-]

Tech is just a series of arms races

reply

upvote

by Art96815 hours ago|

[-]

Can't we simple parse and remove any style="display: none;", aria-hidden="true", and tabindex="1" attributes before the text is processed and get around this trick? What am I missing?

reply

upvote

by hoistbypetard4 hours ago|

[-]

If you do that and don't follow robots.txt, you are blocked. If you do that and follow robots.txt, fine. That's all we wanted you to do anyway. Just follow the instructions that well-behaved scrapers are meant to follow.

reply

upvote

by phplovesong3 hours ago|

[-]

Just have the link visible, but css it so that its either small as hell, or just off screen. Google / bots will follow it, real peopple will never see it.

reply

upvote

by madeofpalk8 hours ago|

[-]

Is there any evidence or hints that these actually work?

It seems pretty reasonable that any scraper would already have mitigations for things like this as a function of just being on the internet.

reply

upvote

by raincole6 hours ago|

[-]

It might work against people just use their Mini Mac with OpenClaw to summarize news every morning, but it certainly won't work against Google.

More centralized web ftw.

reply

upvote

by hexage18146 hours ago|

[-]

It also probably won't work if the person actually wants your content and is checking if the thing they scraped actually makes sense or it just noise. Like, none of these are new things. Site owners send junk/fake data to webscrapers since web scraping was invented.

reply

upvote

by otherme1236 hours ago|

[-]

In my experience, Google (among others) plays nice. Just put "disallow: *" in your robots.txt, and they won't bother you again.

My current problem is OpenAI, that scans massively ignoring every limit, 426, 444 and whatever you throw at them, and botnets from East Asia, using one IP per scrap, but thousands of IPs.

reply

upvote

by LaGrange6 hours ago|

[-]

> It might work against people just use their Mini Mac with OpenClaw to summarize news every morning,

Good enough for me.

> More centralized web ftw.

This ain't got anything to do with "centralized web," this kind of epistemological vandalism can't be shunned enough.

reply

upvote

by sd98 hours ago|

[-]

Even it did work, I just can't bring myself to care enough. It doesn't feel like anything I could do on my site would make any material difference. I'm tired.

reply

upvote

by 20k8 hours ago|

[-]

I definitely get this. The thing that gives me hope is that you only need to poison a very small % of content to damage AI models pretty significantly. It helps combat the mass scraping, because a significant chunk of the data they get will be useless, and its very difficult to filter it by hand

reply

upvote

by lucasfin0006 hours ago|

[-]

The asymmetry is what makes this very interesting. The cost to inject poison is basically zero for the site owner, but the cost to detect and filter it at scale is significant for the scraper. That math gets a lot worse for them as more sites adopt it. It doesn't solve the problem, but it changes the economics.

reply

upvote

by xyzal5 hours ago|

[-]

About two years ago, I made up reference to a nonexistent python library and put code "using" it in just 5 GitHub repos. Several months later the free ChatGPT picked it up. So IMO it works.

reply

upvote

by logicprog5 hours ago|

[-]

Via websearch? Or training?

reply

upvote

by bediger40005 hours ago|

[-]

The search engine crawlers are sophisticated enough, but Meta's are not. Neither is Anthropic's Claude crawler. Source: personal experience trying garbage generators on Yandex, Blexbot, Meta's and Anthropics crawlers.

I'm completely uncertain that the unsophisticated garbage I generated makes any difference, much less "poisons" the LLMs. A fellow can dream, can't he?

reply

upvote

by spiderfarmer6 hours ago|

[-]

There are hundreds of bots using residential proxies. That is not free. Make them pay.

reply

upvote

by m00dy7 hours ago|

[-]

it won't work, especially on gemini. Googlebot is very experienced when it comes to crawling. It might work for OpenAI and others maybe.

reply

upvote

by nubg8 hours ago|

[-]

What kind of migitations? How would you detect the poison fountain?

reply

upvote

by avereveard7 hours ago|

[-]

style="display: none;" aria-hidden="true" tabindex="1"

many scraper already know not to follow these, as it's how site used to "cheat" pagerank serving keyword soups

reply

upvote

by m00dy7 hours ago|

[-]

Google will give your website a penalty for doing this.

reply

upvote

by phplovesong3 hours ago|

[-]

You dont have to use this. You can have it visible bit hide it for humans with other easy tricks.

reply

upvote

by cuu5083 hours ago|

[-]

Scrapers can work around those other easy tricks too.

reply

upvote

by GaggiX7 hours ago|

[-]

Because the internet is noisy and not up to date all recent LLMs are trained using Reinforcement Learning with Verifiable Rewards, if a model has learned the wrong signature of a function for example it would be apparent when executing the code.

reply

upvote

by phoronixrly7 hours ago|

[-]

It does work, on two levels:

1. Simple, cheap, easy-to-detect bots will scrape the poison, and feed links to expensive-to-run browser-based bots that you can't detect in any other way.

2. Once you see a browser visit a bullshit link, you insta-ban it, as you can now see that it is a bot because it has been poisoned with the bullshit data.

My personal preference is using iocaine for this purpose though, in order to protect the entire server as opposed to a single site.

reply

upvote

by eliottre5 hours ago|

[-]

The data poisoning angle is interesting. Models trained on scraped web data inherit whatever biases, errors, and manipulation exist in that data. If bad actors can inject corrupted data at scale, it creates a malign incentive structure where model training becomes adversarial. The real solution is probably better data provenance -- models trained on licensed, curated datasets will eventually outcompete those trained on the open web.

reply

upvote

by 3 hours ago|

[-]

deleted

reply

upvote

by kristopolous5 hours ago|

[-]

I did a related approach:

A toll charging gateway for llm scrapers: a modification to robots.txt to add price sheets in the comment field like a menu.

This was for a hackathon by forking certbot. Cloudflare has an enterprise version of this but this one would be self hosted

I think it has legs but I think I need to get pushed and goaded otherwise I tend to lose interest ...

It was for the USDC company btw so that's why there's a crypto angle - this might be a valid use case!

I'm open to crypto not all being hustles and scams

Tell me what you think?

https://github.com/kristopolous/tollbot

reply

upvote

by ctoth3 hours ago|

[-]

This is literally what HTTP 402 is for -- there's a whole buncha work going on ... but please, please, please don't let Cloudflare become another bloody gatekeeper. Please.

reply

upvote

by effnorwood5 hours ago|

[-]

certainly don't allow anyone to access your content. perhaps shut the site down just to be safe.

reply

upvote

by aduwah2 hours ago|

[-]

Accessing the shop by going through the wall with a tank is not the same as walking in the door. Hosting costs money. These botnets should be charged for the costs they incur

reply

upvote

by storus1 hours ago|

[-]

I am failing to see how this stops pre-training scrapping? It still looks like legit code, playing nicely with the desired pre-training distribution. Obviously nobody is going to use it for SFT/DPO/GRPO later.

reply

upvote

by bluepeter5 hours ago|

[-]

A related technique used to work so well for search engine spiders. I had some software i wrote called 'search engine cloaker'... this was back in the early 2000s... one of the first if not the first to do the shadowy "cloaking" stuff! We'd spin dummy content from lists of keywords and it was just piles and piles. We made it a bit smarter using Markov chains to make the sentences somewhat sensible. We'd auto-interlink and get 1000s of links. It eventually stopped working... but it took a long while for that to happen. We licensed the software to others. I rationalized it because I felt, hey, we have to write crappy copy for this stupid "SEO" thing, so let's just automate that and we'll give the spiders what they seem to want.

reply

upvote

by ctoth3 hours ago|

[-]

You didn't 'give the spiders what they seem to want.' You exploited a naive ranking algorithm to inject garbage into search results that real people were trying to use. That you rationalized it at the time is human. That you're still rationalizing it decades later is something else.

reply

upvote

by ninjagoo5 hours ago|

[-]

Isn't this a trope at this point? That AI companies are indiscriminately training on random websites?

Isn't it the case that AI models learn better and are more performant with carefully curated material, so companies do actually filter for quality input?

Isn't it also the case that the use of RLHF and other refinement techniques essentially 'cures' the models of bad input?

Isn't it also, potentially, the case that the ai-scrapers are mostly looking for content based on user queries, rather than as training data?

If the answers to the questions lean a particular way (yes to most), then isn't the solution rate-limiting incoming web-queries rather than (presumed) well-poisoning?

Is this a solution in search of a problem?

reply

upvote

by xantronix4 hours ago|

[-]

You do raise an interesting point. The poison fountains would probably be more effective if their outputs more closely resembled whatever the most popular problem spaces are at any given point.

reply

upvote

by theandrewbailey6 hours ago|

[-]

Or you can block bots with these (until they start using them) https://developer.mozilla.org/en-US/docs/Glossary/Fetch_meta...

reply

upvote

by hmokiguess5 hours ago|

[-]

Could this lead to something like the Streisand effect? I imagine these bots work at a scale where humans in the loop only act when something deviates from the standard, so, if a bot flags something up with your website then you’re now in a list you previously weren’t. Now don’t ask me what they do with those lists, but I guess you will make the cut.

reply

upvote

by holysoles5 hours ago|

[-]

If anyone is looking for a tool to actually send traffic to a tool like this, I wrote a Traefik plugin that can block or proxy requests based on useragent.

https://github.com/holysoles/bot-wrangler-traefik-plugin

reply

upvote

by dwa35924 hours ago|

[-]

Love it. Thanks for doing this work. Not sure why people are criticizing this. Also, insane amount of work has been done to improve scraping - which in my mind is just absolute bonkers and i didn't see people complaining about that.

reply

upvote

by iFire37 minutes ago|

[-]

I for one welcome everyone to the tarpit where a normal person is seen as a robot in an endless poison pit and sounds like a Black Mirror television episode.

reply

upvote

by meta-level8 hours ago|

[-]

Isn't posting projects like this the most visible way to report a bug and let it have fixed as soon as possible?

reply

upvote

by suprfsat8 hours ago|

[-]

"disobeys robots.txt" is more of a feature

reply

upvote

by 2 hours ago|

[-]

deleted

reply

upvote

by nosmokewhereiam6 hours ago|

[-]

My asthmar

I'm assuming this is a reference to Lord of the flies

reply

upvote

by cwnyth5 hours ago|

[-]

Miasma is bad or poisonous air. It's a Greek word.

reply

upvote

by jackdoe1 hours ago|

[-]

rage against the dying of the light

reply

upvote

by ninjagoo6 hours ago|

[-]

This is essentially machine-generated spam.

The irony of machine-generated slop to fight machine-generated slop would be funny, if it weren't for the implications. How long before people start sharing ai-spam lists, both pro-ai and anti-ai?

Just like with email, at some point these share-lists will be adopted by the big corporates, and just like with email will make life hard for the small players.

Once a website appears on one of these lists, legitimately or otherwise, what'll be the reputational damage hurting appearance in search indexes? There have already been examples of Google delisting or dropping websites in search results.

Will there be a process to appeal these blacklists? Based on how things work with email, I doubt this will be a meaningful process. It's essentially an arms race, with the little folks getting crushed by juggernauts on all sides.

This project's selective protection of the major players reinforces that effect; from the README:

" Be sure to protect friendly bots and search engines from Miasma in your robots.txt!

User-agent: Googlebot User-agent: Bingbot User-agent: DuckDuckBot User-agent: Slurp User-agent: SomeOtherNiceBot Disallow: /bots Allow: / "

reply

upvote

by snehesht8 hours ago|

[-]

Why not simply blacklist or rate limit those bot IP’s ?

reply

upvote

by Bender2 hours ago|

[-]

Why not simply blacklist or rate limit those bot IP’s ?

Many bots cycle through short DHCP leases on LTE wifi devices. One would have to accept blocking all cell phones which I have done for my personal hobby crap but most businesses will not do this. Another big swath of bots come from Amazon EC2 and GoogleCloud which I will also happily block on my hobby crap but most businesses will not.

Some bots are easier to block as they do not use real web clients and are missing some TCP/IP headers making them ultra easy to block. Some also do not spoof user-agent and are easy to block. Some will attempt to access URL's not visible to real humans thus blocking themselves. Many bots can not do HTTP/2.0 so they are also trivial to block. Pretty much anything not using headless Chrome is easy to block.

reply

upvote

by xprnio7 hours ago|

[-]

If you have real traffic and bot traffic, you still need to identify which is which. On top of that, bots very likely don’t reuse the same IPs over and over again. I assume if we knew all the IPs used only by bots ahead of time, then yeah it would be simple to blacklist them. But although it’s simple in theory, the practice of identifying what to blacklist in the first place is the part that isn’t as simple

reply

upvote

by snehesht7 hours ago|

[-]

You wouldn’t permanently block them, it’s more like a rolling window.

You can use security challenges as a mechanism to identify false positives.

Sure bots can get tons of proxies for cheap, doesn’t mean you can’t block them similar to how SSH Honeypots or Spamhaus SBL work albeit temporarily.

reply

upvote

by phyzome7 hours ago|

[-]

Because punishment for breaking the robots.txt rules is a social good.

reply

upvote

by nextlevelwizard2 hours ago|

[-]

Point is to kill or at least hinder AI progress

reply

upvote

by arbol6 hours ago|

[-]

The AI companies are using virtually unlimited "clean" residential IPs so this is not a valid strategy.

reply

upvote

by DaiPlusPlus6 hours ago|

[-]

How? They run their scraping and training infrastructure - and models themselves - from within those “AI datacenters”[1] we hear about in the news - and not proxying through end-users’ own pipes.

[1]: in quotes, because I dislike the term, because it’s immaterial whether or not an ugly block of concrete out in the sticks is housing LLM hardware - or good ol’ fashioned colo racks.

reply

upvote

by AyyEye5 hours ago|

[-]

Residential proxy networks.

reply

upvote

by aduwah7 hours ago|

[-]

There are way too many to do that

reply

upvote

by snehesht6 hours ago|

[-]

True, most of the blacklists systems today aren’t realtime like Amazon WAF or Cloudflare.

We need a Crawler blacklist that can in realtime stream list deltas to centralized list and local dbs can pull changes.

Verified domains can push suspected bot ips, where this engine would run heuristics to see if there is a patters across data sources and issue a temporary block with exponential TTL.

There are many problems to solve here, but as any OSS it will evolve over time if there is enough interest in it.

Costs of running this system will be huge though and corp sponsors may not work but individual sponsors may be incentivized as it’s helps them reduce bandwidth, compute costs related to bot traffic.

reply

upvote

by pixl976 hours ago|

[-]

In the real-time spam market the lists worked well with honest groups for a bit, but started falling apart when once good lists get taken over by actors that realize they can use their position to make more money. It's a really difficult trap to avoid.

reply

upvote

by xyzal5 hours ago|

[-]

For the lulz

reply

upvote

by superkuh5 hours ago|

[-]

Of course Googlebot, Bingbot, Applebot, Amazonbot, YandexBot, etc from the major corps are HTTP useragent spiders that will have their downloaded public content used by corporations for AI training too. Might as well just drop the "AI" and say "corporate scrapers".

reply

upvote

by rob6 hours ago|

[-]

"/brainstorming git checkout this miasma repo source code and implement a fix to prevent the scraper from not working on sites that use this tool"

reply

upvote

by foxes6 hours ago|

[-]

Wonder if you can just avoid hiding it to make it more believable

Why not have a library of babel esq labrinth visible to normal users on your website,

Like anti surveillance clothing or something they have to sift through

reply

upvote

by imdsm8 hours ago|

[-]

Applied model collapse

reply

upvote

by 7 hours ago|

[-]

deleted

reply

upvote

by Imustaskforhelp8 hours ago|

[-]

I wish if there was some regulation which could force companies who scrape for (profit) to reveal who they are to the end websites, many new AI company don't seem to respect any decision made by the person who owns the website and shares their knowledge for other humans, only for it to get distilled for a few cents.

reply

upvote

by joquarky2 hours ago|

[-]

Yep, they are already working on de-anonymizing the internet.

reply

upvote

by jijji4 hours ago|

[-]

why not just try to block them at the door instead of feeding them poisoned food...

reply

upvote

by rvz8 hours ago|

[-]

> > Be sure to protect friendly bots and search engines from Miasma in your robots.txt!

Can't the LLMs just ignore or spoof their user agents anyway?

reply

upvote

by phoronixrly7 hours ago|

[-]

Well-behaved agents will obey robots.txt and not fall into the trap.

reply

upvote

by maltyxxx6 hours ago|

[-]

[dead]

reply

upvote

by SophieVeldman7 hours ago|

[-]

[dead]

reply

upvote

by devnotes776 hours ago|

[-]

[dead]

reply

upvote

by pugchat3 hours ago|

[-]

[dead]

reply

upvote

by firekey_browser7 hours ago|

[-]

[dead]

reply

upvote

by obsidianbases14 hours ago|

[-]

I know there are real world problems to deal with, but at least I got one over on that evil open claw instance /s

reply

upvote

by GaggiX8 hours ago|

[-]

These projects are the new "To-Do List" app.

reply

upvote

by obsidianbases17 hours ago|

[-]

Why do this though?

It's like if someone was trying to "trap" search crawlers back in the early 2000s.

Seems counterproductive

reply

upvote

by bilekas7 hours ago|

[-]

Because of bots that don't respect ROBOTS.txt .

If you want an AI bot to crawl your website while you pay for that bandwidth then you wont use the tool.

reply

upvote

by obsidianbases14 hours ago|

[-]

If bandwidth cost is a concern the maybe you should reconsider how you publish your site.

Like, what if you actually post something that gains traction, is it going to bankrupt you or something?

reply

upvote

by bilekas3 hours ago|

[-]

It's not just financial, you're taking up a lot of bandwidth, resources etc.

It's not just some light bump in traffic. It's a headache that shouldn't need to be dealt with if they would respect ROBOTS.txt. Quite simple really.

reply

upvote

by integralid6 hours ago|

[-]

search crawlers used to bring people TO your site llm boots are used to keep people OUT of your site, because knowledge is indexed and distributed by corporations.

reply

upvote

by obsidianbases15 hours ago|

[-]

So if your site is dependent on ads, and since the only way for people to see those ads is coming to your site, then yes, you lose.

If your site exists to share information, then the information gets disseminated, whether via LLM or some browser, it doesn't make a difference to me

reply

upvote

by lelanthran4 hours ago|

[-]

Those are not the only two options.

Why are you presenting the latter option as if it were mainstream? It's such a small percentage of use cases that it probably isn't even a rounding error.

People who want to disseminate information also want the credit.

I'd still like to know why you are presenting this false dichotomy. What reason do you have for presenting a use case that has fractions of a percentage as if it were a standard use case? What is your motivation behind this?

reply

upvote

by obsidianbases14 hours ago|

[-]

My only motivation is that it pains me to see smart capable people working on insignificant problems.

Maybe I don't understand the problem as well as I should, and I'm open to hearing what it is you think that I'm missing.

But from my perspective, this is a solution for a non-problem, which in my eyes is a problem itself.

reply

upvote

by lelanthran4 hours ago|

[-]

You misunderstand: I am asking what is your motivation for presenting a 0.0001% use case as a 50% use case.

The use case you present is so small it can be ignored as an option, yet you present it as the only other option.

reply

upvote

by joquarky2 hours ago|

[-]

> People who want to disseminate information also want the credit.

This is psychological projection.

reply

upvote

by lelanthran1 hours ago|

[-]

> This is psychological projection.

You don't know what that means.

In any case, people who want to disseminate information with credit can do so without standing up a blog (any place that allows posting of comments, such as Reddit, HN, etc).

In the context of this discussion, we're talking about site owners; people who put up a blog.

reply

upvote

by aarjaneiro4 hours ago|

[-]

You don't get attribution for your work if it merely feeds into it's training data

reply

upvote

by obsidianbases14 hours ago|

[-]

That assumes the AI bots are scraping for training data and not simple retrieval/ RAG (which would likely provide attribution)

reply

upvote

by Forgeties797 hours ago|

[-]

Web crawlers didn’t routinely take down public resources or use the scraped info to generate facsimiles that people are still having ethical debates over. Its presence didn’t even register and it was indexing that helped them. It isn’t remotely the same thing.

https://www.libraryjournal.com/story/ai-bots-swarm-library-c...

reply

upvote

by obsidianbases14 hours ago|

[-]

AI bots must've taken down that link you shared, it won't load :/

And search crawlers/results have been producing snippets that prevent users from clicking to the source for well over a decade.

Edit: it loaded. I don't see how the problem isn't simply solved by an off the shelf solution like cloud flare. In the real world, you wouldn't open up a space/location if you couldn't handle the throughput. Why should online spaces/locations get special treatment?

reply

upvote

by Forgeties792 hours ago|

[-]

Why should everyone else pay the price for VC-funded, private companies? They should incur the cost.

This is no different than saying “robbers aren’t causing any problems, you just need to lock your doors, buy and set up sensors on every point of potential ingress, and pay a monthly cost for an alarm system. That’s on you.”

reply

upvote

by splitbrainhack8 hours ago|

[-]

-1 for the name

reply

upvote

by QuantumNomad_8 hours ago|

[-]

https://en.wikipedia.org/wiki/Miasma_theory

Seems a clever and fitting name to me. A poison pit would probably smell bad. And at the same time, the theory that this tool would actually cause “illness” (bad training data) in AI is not proven.

reply

upvote

by jstanley6 hours ago|

[-]

If you want to ruin someone's web experience based on what kind of thing they are, rather than the content of their character, consider that you might be the baddies.

reply

upvote

by mrweasel6 hours ago|

[-]

If you're constantly being harassed by someone and despite your best efforts, nothing is being done to help you, quite the opposite in fact, tons of people cheer your assailant on in the name of profit and progress, it's only natural that you lash out.

It's not all that productive, it's an act of desperation. If you can't stop the enemy, at least you can make their action more costly.

One positive outcome I could see it AI companies becoming more critical of their training data.

reply

upvote

by lifeformed4 hours ago|

[-]

What "content of character" do you ascribe to a web scraper?

reply

upvote

by jstanley4 hours ago|

[-]

You don't, that's why it's unethical to block them.

If you keep getting harrassed by people wearing black hoodies, would it be ethical to start taking countermeasures against all people who wear black hoodies?

reply

upvote

by lelanthran4 hours ago|

[-]

If they are coming to my door to harass me, then yes, it makes sense to take countermeasures against all black-hoodie wearers when I see them at the door.

reply

upvote

by Apocryphon2 hours ago|

[-]

You’re gonna have to try harder to sneak in the a priori assumption that LLMs have any character beyond which corporation deployed them.

reply