upvote
I don't think you can provide a URL to a WARC that can be clicked to view its content directly in your browser.
reply
At the very least, WARC could have been used as the container ("tar") format after the preamble of Gwtar. But even there, given that this format doesn't work without a web server (unlike SingleFile, mentioned in the article), I feel like there's a lot to gain by separating the "viewer" (Gwtar's javascript) from the content, such that the viewer can be updated over time without changing the archives.

I certainly could be missing something (I've thought about this problem for all of a few minutes here), but surely you could host "warcviewer.html" and "warcviewer.js" next to "mycoolwarc.warc" "mycoolwrc.cdx" with little to no loss of convenience, and call it a day?

reply
You could potentially use WARC instead of Tar as the appended container, sure, but that's a lot of complexity, WARC doesn't serialize the rendered page (so what is the greater 'fidelity' actually getting you?) and SingleFile doesn't support WARC, and I don't see a specific advantage that a Gwtar using WARC would have. The page rendered what it rendered.

And if you choose to require separate files and break single-file, then you have many options.

> surely you could host "warcviewer.html" and "warcviewer.js" next to "mycoolwarc.warc" "mycoolwrc.cdx"

I'm not familiar with warcviewer.js and Googling isn't showing it. Are you thinking of https://github.com/webrecorder/wabac.js ?

reply
I should have been a bit more verbose as I didn't mean to send anyone on a wild goose chase. The "warcviewer.{html,js}" part was just a hypothetical viewer to illustrate having a static client-side "web app" that functions much like Gwtar, but separately from payloads.

To expand what I have in mind, it'd be a script like Gwtar, except it loads WARCs through URLs to CDX files. Alternatively, it might also load WARC files fully to memory, where an index could be constructed on the fly. In the latter case, that would allow the same viewer to be used with or without a web server. Though, I can imagine that loading archives without a web server was probably out-of-scope for Gwtar, otherwise something could have been figured out (e.g., putting the tar in a <textarea>'s RCDATA; do browsers support "binary" data in there correctly?).

While the WARC specs are a mess (sometimes quite ambiguous), I've never had much trouble reading or writing them. As for why WARC, having the option to preserve request/response metadata, as well as having interoperability with anything else in the WARC ecosystem, would be nice. Also, a separate viewer would naturally be updateable without changing the archive files themselves.

reply
I see. You could probably build something on top of wabac.js... But you'd need some sort of multi-file setup to support the indirection, I suppose.

> I imagine that loading archives without a web server was probably out-of-scope for Gwtar

More that it's just not important to us. I don't even look at the archives 'locally'. They are all archives of public web pages, which I just rehost publicly. When I want to look at them, I open them on Gwern.net like anyone else!

And if I really needed to, for some reason, it's literally a Bash one-liner (already provided inside the Gwtar as well as my writeup) to turn them back into a normal multi-file HTML. (This is a lot more than you can say for a WARC...) So my reaction to the complaints about lacking local viewing is mostly just ¯\_(ツ)_/¯

> (e.g., putting the tar in a <textarea>'s RCDATA; I wonder how well browsers support "binary" data in there?)

I don't know the details but you can just base-encode them, so I suppose that's an option, as long as you rewrote the ranges appropriately, maybe?

(Also worth noting that you can go the other way: if you really desperately want to preserve the raw header responses, you can just use the flexibility of Gwtar to append the WARC to the end of the file. As long as the range requests work, users won't download that part. The duplication is not so great for long-term storage, but you can just XZ them and that should remove duplication and overhead.)

reply
WARC is mentioned with very specific reason not being good enough: "WARCs/WACZs achieve static and efficient, but not single (because while the WARC is a single file, it relies on a complex software installation like WebRecorder/Replay Webpage to display)."
reply