Turns out it's using another project by the same author: https://github.com/tamnd/ascii-gif
The script used for the demo is at https://github.com/tamnd/kage/blob/01e75b87ecc893bbba7943c63... and has a comment showing how to run it:
ascii-gif render docs/demo/kage.tape -o docs/static/demo.gif
Looks like it's an opinionated wrapper around https://github.com/charmbracelet/vhsCool!
It would be especially cool to have a version that didn't require the separate serving process - even though it's nifty you can package up a whole site as a single binary.
Maybe a single HTML entrypoint shim with a bit of javascript that could index into an archive (potentially embedded) of the site's content?
Also, in my mind, I already have a script/program to convert HTML to Markdown, so it could actually store everything on disk as a folder of Markdown files, and then commit them to a Git repo.
Basically I'm looking for something like the old-school .chm files on Windows, where you could pack a bunch of HTML documents into a single archive and open it without needing to embed a full browser engine.
This would have the advantage of keeping the file sizes really small. And you don't have to worry about the browser engine become outdated and potentially becoming an attack vector.
Epub would also be a great target.
So something like SingleFileZ https://github.com/gildas-lormeau/SingleFileZ or Gwtar https://gwern.net/gwtar ?
Won't comment on a project (though idea seems interesting) but this in README is a tell for me ;)
If the result is static why does it need a server? Isn't it possible to make it so that it can simply be opened by the browser? Like:
$ firefox $HOME/data/kage/paulgraham.com
Then the result would be useable on machines without kage nstalled.
Actually, Kage has two parts: a crawler that crawls pages and converts them to clean HTML by capturing the DOM after rendering in Chrome/Chromium, and a pack/serve component that packages the result as either a ZIM file for Kiwix or an executable file.
Related WHATWG discussion: https://github.com/whatwg/html/issues/3099
It strips out all the JavaScript too, but also packs everything into a single HTML file that is easy to transfer. Binary assets (like web fonts and images) are packed as base64 strings.
They also offer a CLI powered by Puppeteer. [1]
What I'm implementing here is mirroring a whole website, with all its subpages, so you can browse it all offline. For example, all essays from paulgraham.com.
I think the misunderstanding stems from the browser's "Save As" reference in the description. It is misleading. You use "Save As" to save a single page, not an entire website.
Also, the description lacks a clear explanation of the project's purpose. It would be helpful to include a sentence explaining that the program downloads an entire website, not just a single page.
I highly recommend reading the singlefile source or https://archiveweb.page/ to see how they handle closed shadow DOMs, cross-origin iframes, websockets, media urls, deduping large assets, etc.
Not the same thing, but I made a clone of pg’s website which can be used for exactly that: https://github.com/shawwn/pg
If you want to read all essays, just clone the repo and open any of the .html files. Or any of the .page files which generated them.
That said, Kage looks promising if OP can combine SingleFile reproduction quality with the HTTPTrack spidering approach. SPA's are kinda tricky with archiving and do wonder how well Kage would handle that
For some reason it displays in IE better but I don't recall seeing this option in chrome of Firefox recently..
That way, the page is self-contained as it is, but requires no bundled binary code to serve the site. It is actually safer security-wise.
The vendored script can be as simple as this:
const site = {
"path-1": "<!DOCTYPE html><html> ... </html>",
"path-2": "<!DOCTYPE html><html> ... </html>",
// More paths
}
function attachListeners() {
for (const [path, html] of Object.entries(site)) {
document.querySelector(`a[href=${path}]`).onclick = () => {
document.documentElement.outerHTML = html
attachListeners()
}
}
}
document.addEventListeners("DOMContentLoaded", attachListeners)Let's say you have a site that fetches content from a database. If you Save As, then at best you'll get a local copy of an HTML page with JS that loads the content from the same remote database. It might not work (since the local copy has a different origin), or if it does, it requires you to be online, which defeats half of the purpose.
What this project, and SingleFile, both do is save a snapshot of what the rendered page actually looks like at that moment in time. The scripts are stripped out so it runs locally and has no external dependencies.
https://wiki.openzim.org/wiki/Build_your_ZIM_file
EDIT: https://get.kiwix.org/en/solutions/applications/kiwix-reader...
The executable file is mostly for people who don't have Kiwix installed yet, or just want to run the archive directly.
In any case, cool stuff :)
https://github.com/tamnd/kage/blob/main/Dockerfile
Btw, let me think the way to only enable this when running inside Docker.
Thanks for nice trick.
It's one of the reasons I've become a bigger fan of RSS over time. A feed from 10-ish years ago is often more usable today than a carefully preserved (application) website.
But will look into this now, see if we can swap some stuff out. We’ve really liked the idea of an offline mirror, makes a lot of collaboration use cases simpler
Compared to that is there anything kage does better?
By converting it to Markdown, we save a lot of space, but it is for a different purpose and a different project: https://github.com/tamnd/ccrawl-cli
For my own custom data format, I have a lot of private code that I plan to release soon. It is optimized for compression, fast lookups, and more. I have been working on it for two years. This is part of a larger, ambitious umbrella project: I am building Google from scratch (all open source), something that anyone can host, including the crawler, indexer, storage, and serving layers. Stay tuned!
Sounds awesome. There is a lot of untapped potential with respect to efficiently archiving and indexing websites. I saw the impressive things Marginalia Search is doing in this area (the blog is great when it gets technical). There is also a lot of very complete archives of websites out there which are not being indexed at all, and I would love to make them available for researchers. In any case, I'm interested in your project!
pandoc --from html --to epub --output /PATH/TO/FILE.epub https://example.comfor an entire website though of many pages I can see this can be useful.
Have you even read the first line of the readme of the project you're commenting on?
I previously downloaded the Snowflake docs, and it was something like tens or even hundreds of thousands of pages, I do not remember exactly. The output ended up being very large.
By the way, I forgot to add zstd compression support to my ZIM reader/writer. I will implement that in the next version.
I would recommend an add-on or new feature to detect and remove cookie banners / annoying popups that open on load (eg. sign up to my mailing list).
listing a few examples form fastText could help you.
You might also have the opposite problem though: some websites have content in the base html (so it's searchable by Google and they get views) and remove it on load (so you have to pay).
Capturing the initial html and comparing it to the final version could give you some hints and allow you to repair the removed content.
Best of luck with the project!
https://github.com/jart/cosmopolitan
https://justine.lol/cosmopolitan/index.html
(Certificates just expired for justine's website, just ignore the warning.)
I did something like that a very long time ago (Of course, I have forgotten)
I'd rather have platform specific minimal binaries than a single binary with hacks.
Installing packages is a solved problem
It's fine if you don't personally find it useful for your workflow, but I think it's mad cool, especially since you can zip together multiple binaries into one, along with data.
Is the code also AI slop?