upvote
> Doing it on demand still utilizes their cached version, so it saves a trip to the origin, but doesn’t require doubling the cache size. They can still cache the results if the same site is scraped multiple times, but this saves having to cache things that are never going to be requested.

Isn't this solving a slightly, but very significantly different problem?

You could serve the very same data in two different ways: One to present to the users and one to hand over to scrapers. Of course, some sites would be too difficult or costly to transform into a common underlying cache format, but people who WANT their sides accessible to scrapers could easily help the process along a bit or serve their site in the necessary format in the first place.

But the key is:

A tool using a "pre-scraped" version of a site has very likely very different requirements of how a CDN caches this site. And this could be easily customizable by those using this endpoint.

Want a free version? Ok, give us the list of all the sites you want, then come back in 10min and grab everything in one go, the data will be kept ready for 60s. Got an API token? 10 free near-real-time request for you and they'll recharge at a rate of 2 per hour. Want to play nice? Ask the CDN to have the requested content ready in 3 hours. Got deep pockets? Pay for just as many real-real-time requests as you need.

What makes this so different is that unless customers are willing to hand over a lot of money, you dont need to cache anything to serve requests at all. Potentially not even later if you got enough capacity to serve the data for scheduled requests from the storage network directly.

You just generate an immediate promise response to the request telling them to come back later. And depending on what you put into that promise, you've got quite a lot of control over the schedule yourself.

- Got a "within 10min" request but your storage network has plenty if capacity in 30s? Just tell them to come back in 30s.

- A customer is pushing new data into your network around 10am and many bots are interested in getting their hands on it as soon as possible, making requests for 10am to 10:05? Just bundle their requests.

- Expected data still not around at 10:05? Unless the bots set an "immediate" flag (or whatever) indicating that they want whatever state the site is in right now, just reply with a second promise when they come back. And a third if necessary... and so on.

reply