undefined

points

[-]

> I figured that they have found an (automated) way to imitate Googlebot really well.

If a site (or the WAF in front of it) knows what it's doing then you'll never be able to pass as Googlebot, period, because the canonical verification method is a DNS lookup dance which can only succeed if the request came from one of Googlebots dedicated IP addresses. Bingbot is the same.

by xurukefi3 hours ago|

parent|

[-]

There are ways to work around this. I've just tested this: I've used the URL inspection tool of Google Search Console to fetch a URL from my website, which I've configured to redirect to a paywalled news article. Turns out the crawler follows that redirect and gives me the full source code of the redirected web site, without any paywall.

That's maybe a bit insane to automate at the scale of archive.today, but I figure they do something along the lines of this. It's a perfect imitation of Googlebot because it is literally Googlebot.

by jsheard3 hours ago|

parent|

[-]

I'd file that under "doesn't know what they're doing" because the search console uses a totally different user-agent (Google-InspectionTool) and the site is blindly treating it the same as Googlebot :P

Presumably they are just matching on *Google* and calling it a day.

by xurukefi2 hours ago|

parent|

[-]

Sure, but maybe there are other ways to control Googlebot in a similar fashion. Maybe even with a pristine looking User-Agent header.

by Aurornis2 hours ago|

parent|

prev|

[-]

> which I've configured to redirect to a paywalled news article.

Which specific site with a paywall?

by elzbardico3 hours ago|

prev|

[-]

> which is, of course, ridiculous.

Why? in the world of web scrapping this is pretty common.

by xurukefi3 hours ago|

parent|

[-]

Because it works too reliably. Imagine what that would entail. Managing thousands of accounts. You would need to ensure to strip the account details form archived peages perfectly. Every time the website changes its code even slightly you are at risk of losing one of your accounts. It would constantly break and would be an absolute nightmare to maintain. I've personally never encountered such a failure on a paywalled news article. archive.today managed to give me a non-paywalled clean version every single time.

Maybe they use accounts for some special sites. But there is definetly some automated generic magic happening that manages to bypass paywalls of news outlets. Probably something Googlebot related, because those websites usually give Google their news pages without a paywall, probably for SEO reasons.

by mikkupikku3 hours ago|

parent|

[-]

Using two or more accounts could help you automatically strip account details.

by xurukefi3 hours ago|

parent|

[-]

That's actually a really neat idea.

by behringer16 minutes ago|

parent|

prev|

[-]

Replace any identifiers like usernames and emails with another string automatically.

by Aurornis3 hours ago|

prev|

[-]

> I've seen people claiming that they have a bunch of paid accounts that they use to fetch the pages, which is, of course, ridiculous.

The curious part is that they allow web scraping arbitrary pages on demand. So if a publisher could put in a lot of arbitrary requests to archive their own pages and see them all coming from a single account or small subset of accounts.

I hope they haven't been stealing cookies from actual users through a botnet or something.

by xurukefi3 hours ago|

parent|

[-]

Exactly. If I was an admin of a popular news website I would try to archive some articles and look at the access logs in the backend. This cannot be too hard to figure out.

by coppsilgold1 hours ago|

parent|

prev|

[-]

You don't even need active measures. If a publisher is serious about tracing traitors there are algorithms for that (which are used by streamers to trace pirates). It's called "Traitor Tracing" in the literature. The idea is to embed watermarks following a specific pattern that would point to a traitor or even a coalition of traitors acting in concert.

It would be challenging to do with text, but is certainly doable with images - and articles contain those.

by tonymet3 hours ago|

prev|

[-]

I’m an outsider with experience building crawlers. You can get pretty far with residential proxies and browser fingerprint optimization. Most of the b-tier publishers use RBC and heuristics that can be “worked around” with moderate effort.

by quietsegfault3 hours ago|

parent|

[-]

.. but what about subscription only, paywalled sources?

by tonymet2 hours ago|

parent|

[-]

many publisher's offer "first one's free".

For those that don't , I would guess archive.today is using malware to piggyback off of subscriptions.

by layer83 hours ago|

prev|

[-]