undefined

points

[-]

> without a legislative solution and an enforcement mechanism

If there's one thing people, especially HN users, should've learned by now, it's that there's no enforcement mechanism worth a damn for Internet legislation when incentives don't align.

by AnthonyMouse3 hours ago|

prev|

[-]

> how can we make scraping less of a burden on individual hosts?

Isn't this basically what content-addressable storage is for? Have the site provide the content hashes rather than the content and then put the content on IPFS/BitTorrent/whatever where the bots can get it from each other instead of bothering the site.

Extra points if you can get popular browsers to implement support for this, since it also makes it a lot harder to censor things and a decent implementation (i.e. one that prefers closer sources/caches) would give most of the internet the efficiency benefits of a CDN without the centralization.

by heavyset_go6 hours ago|

prev|

[-]

If you don't publish content to the public web anymore, you don't have to worry traffic or scraping or bots

Maybe it'll just be cheaper for CDNs or whatever to sell the data they serve directly instead of doing extra steps with scraping

by eikenberry2 hours ago|

parent|

[-]

I think this is what will happen. That the public internet will become the place you go to seed the data you want to the scrapers and you will use a private internet for everything else. Private sites, private feeds, mesh networks, etc. We're basically going back in time similar to when AOL and friends had their own private networks for their members.

by miki1232116 hours ago|

parent|

prev|

[-]

The only answer is WebDRM.

It's easy to pretend you're human, it's hard to pretend that you have a valid cryptographic signature for Google which attests that your hardware is Google-approved.

Crawling is the price we pay for the web's openness.

by realusername4 hours ago|

parent|

[-]

It's not hard to bypass attestation, it's actually very easy and done right now at scale, there's giant click farms with phones on racks.

They don't modify any device and will pass whatever attestation you try to make.

by suzzer995 hours ago|

prev|

[-]

I don't see this is a permanent problem. Right now there must be 1000s of well-funded AI companies trying to scrape the entire internet. Eventually the AI equity bubble will pop and there will be consolidation. If every player left has already scanned the web, will they need to keep constantly scanning it? Seems like no. Even if they do, there will be a lot less of them.

by kdheiwns3 hours ago|

parent|

[-]

The current trend is that it's getting cheaper and easier to roll out your own AI on your own computer, so more and more people will do it as a hobby. Even if the big players die out, some dude with a decent gaming PC could decide to start scraping everything pertaining to their interests just for the hell of it. Every government with a budget and someone capable of doing the job will surely get in on it as well.

by overfeed1 hours ago|

parent|

[-]

> some dude with a decent gaming PC could decide to start scraping everything pertaining to their interests just for the hell of it.

Not from their single residential IP, they are not.

If they do succeed[1] - it is not going to be at hundreds or thousands of requests per second that the current AI scrapers bombard servers with. Some dude at home will, at best, be putting 4-6 orders of magnitude less strain on a limited set of servers.

1. Scraping is an arms race: if you're just "some dude" at the skill floor - you're going to have a bad time whether you're scraping, or defending against scrapers.

by zer00eyz3 hours ago|

prev|

[-]

> anyone who has conceded that there is no way to stop AI scrapers at this point and what that means for how we maintain public information on the internet in the future.

Bloat, and bandwidth costs are the real problems here. Every one seems to have forgotten basics of engineering and accounting.

by titzer7 hours ago|

prev|

[-]

You're going to hate this, but one answer might be blockchain. A crytographically strong, attestable public record of appending information to a shared repository. Combined with cryptographic signatures for humans, it's basically a secure, open git repository for human knowledge.

by techjamie6 hours ago|

parent|

[-]

> Combined with cryptographic signatures for humans

What happens when the human gives an agent access to said signature? Then you fall back on traditional anti-bot techniques and you're right back where you started.

by jakeydus4 hours ago|

parent|

[-]

DNA/biometrics are the only secure future!

I joke, but there are those out there who don’t.

by amarant2 hours ago|

parent|

prev|

[-]

You'd spend less compute just serving the crawlers than maintaining the Blockchain.

Like, 3 orders of magnitude less compute, conservatively counting.

by catapart7 hours ago|

parent|

prev|

[-]

Sounds interesting, but I guess I'm a little unsure of how to connect the dots? Are you suggesting that websites would be hosted on a blockchain and browsed by human-signed browsers? Or more like there would be a blockchain authority, which server hosts could query to determine if a signature, provided by their browser, is human? Would you mind painting the picture in a little more detail?

by sharperguy6 hours ago|

parent|

prev|

[-]

You can have cryptographically signed data caches without the need for a blockchain. What a blockchain can add is the ability to say that a particular piece of data must have existed before a given date, by including the hash of that data somewhere in the chain.

by echelon6 hours ago|

parent|

prev|

[-]

We're rarely going to need to attest anything is "real" or "human". It's basically only going to matter in civil and criminal court, and IDV.

We don't need to attest signals are analogue vs. digital. The world is going to adapt to the use of Gen AI in everything. The future of art, communications, and productivity will all be rooted in these tools.