upvote
Well, the conversion process into the JSON representation is going to take CPU, and then you have to store the result, in essence doubling your cache footprint.

Doing it on demand still utilizes their cached version, so it saves a trip to the origin, but doesn’t require doubling the cache size. They can still cache the results if the same site is scraped multiple times, but this saves having to cache things that are never going to be requested.

Cache footprint management is a huge factor in the cost and performance for a CDN, you want to get the most out of your storage and you want to serve as many pages from cache as possible.

I know in my experience working for a CDN, we were doing all sorts of things to try to maximize the hit rate for our cache.. in fact, one of the easiest and most effective techniques for increasing cache hit rate is to do the OPPOSITE of what you are suggesting; instead of pre-caching content, you do ‘second hit caching’, where you only store a copy in the cache if a piece of content is requested a second time. The idea is that a lot of content is requested only once by one user, and then never again, so it is a waste to store it in the cache. If you wait until it is requested a second time before you cache it, you avoid those single use pages going into your cache, and don’t hurt overall performance that much, because the content that is most useful to cache is requested a lot, and you only have to make one extra origin request.

reply
> Doing it on demand still utilizes their cached version, so it saves a trip to the origin, but doesn’t require doubling the cache size. They can still cache the results if the same site is scraped multiple times, but this saves having to cache things that are never going to be requested.

Isn't this solving a slightly, but very significantly different problem?

You could serve the very same data in two different ways: One to present to the users and one to hand over to scrapers. Of course, some sites would be too difficult or costly to transform into a common underlying cache format, but people who WANT their sides accessible to scrapers could easily help the process along a bit or serve their site in the necessary format in the first place.

But the key is:

A tool using a "pre-scraped" version of a site has very likely very different requirements of how a CDN caches this site. And this could be easily customizable by those using this endpoint.

Want a free version? Ok, give us the list of all the sites you want, then come back in 10min and grab everything in one go, the data will be kept ready for 60s. Got an API token? 10 free near-real-time request for you and they'll recharge at a rate of 2 per hour. Want to play nice? Ask the CDN to have the requested content ready in 3 hours. Got deep pockets? Pay for just as many real-real-time requests as you need.

What makes this so different is that unless customers are willing to hand over a lot of money, you dont need to cache anything to serve requests at all. Potentially not even later if you got enough capacity to serve the data for scheduled requests from the storage network directly.

You just generate an immediate promise response to the request telling them to come back later. And depending on what you put into that promise, you've got quite a lot of control over the schedule yourself.

- Got a "within 10min" request but your storage network has plenty if capacity in 30s? Just tell them to come back in 30s.

- A customer is pushing new data into your network around 10am and many bots are interested in getting their hands on it as soon as possible, making requests for 10am to 10:05? Just bundle their requests.

- Expected data still not around at 10:05? Unless the bots set an "immediate" flag (or whatever) indicating that they want whatever state the site is in right now, just reply with a second promise when they come back. And a third if necessary... and so on.

reply
Not the same thing, but they have something close (it's not on-by-default, yet) [1]:

> Cloudflare's network now supports real-time content conversion at the source, for enabled zones using content negotiation headers. Now when AI systems request pages from any website that uses Cloudflare and has Markdown for Agents enabled, they can express the preference for text/markdown in the request. Our network will automatically and efficiently convert the HTML to markdown, when possible, on the fly.

[1] https://blog.cloudflare.com/markdown-for-agents/

reply
Interesting - its sounds like this could be combined with some creative cache parsing on their side to provide this feature to sites that want it.
reply
> I'm surprised that Cloudflare hasn't started hosting a pre-scraped version of websites that use Cloudflare's proxy

It's entirely possible that they're doing this under the hood for cases where they can clearly identify the content they have cached is public.

reply
How would they know the content hasn’t changed without hitting the website?
reply
They wouldn't, well there's Etag and alike but it still a round trip on level 7 to the origin. However the pattern generally is to say when the content is good to in the Response headers, and cache on that duration, for an example a bitcoin pricing aggregator might say good for 60 seconds (with disclaimers on page that this isn't market data), whilst My Little Town news might say that an article is good for an hour (to allow Updates) and the homepage is good for 5 minutes to allow breaking news article to not appear too far behind.
reply
Keeping track of when content changes is literally the primary function of a CDN.
reply
Caching headers?

(Which, on Akamai, are by default ignored!)

reply
Based on the post, it seems likely that they'd just delay per the robots.txt policy no matter what, and do a full browser render of the cached page to get the content. Probably overkill for lots and lots of sites. An HTML fetch + readability is really cheap.
reply
Offering wholesale cache dumps blows up every assumption about origin privacy and copyright. Suddenly you are one toggle away from someone else automatically harvesting and reselling your work with Cloudflare as the unwitting middle tier.

You could try to gate this behind access controls but at that point you have reinvented a clunky bespoke CDN API that no site owner asked for, plus a fresh legal mess. Static file caches work because they only ever respond to the original request, not because they claim to own or index your content.

It is a short path from "helpful pre-scraped JSON" to handing an entire site to an AI scraper-for-hire with zero friction. The incentives do not line up unless you think every domain on Cloudflare wants their content wholesale exported by default.

reply
I think Common Crawl already offers this, although it's free: https://commoncrawl.org/
reply
That was my first thought when I read the headline. It would make perfect sense, and would allow some websites to have best of both worlds: broadcasting content without being crushed by bots. (Not all sites want to broadcast, but many do).
reply
That would prolly work for simple sites, but you still need the dedicated scraping service with a browser to render sites that are more complex (i.e. SPAs)
reply
But think about poor phishers and malware devs protected by Cloudflare.
reply
It’s a bit more complicated than that. This is their product Browser Rendering, which runs a real browser that loads the page and executes JavaScript. It’s a bit more involved than a simple curl scraping.
reply
So does that mean it can replace serpapi or similar?
reply