As a feature request, would it possible for your pipeline to also create an EPUB? Then people can easily access and search through the document even when your site would go down. EPUB by default uses compression so the file size might even not be too bad for the full encyclopedia.
A few things... when I click an article and try to jump to a new topic, the top search box (labeled "Search titles and full text...") doesn't work. Second, when I first came to the site, I was a bit stuck. It took a bit of time to realize I need to click on "Articles" or even "Topics" to start browsing. Not sure why, maybe I expected the image to let me enter the site somehow...?
Another reason would be to able to keep running/using it even if the main site were to go down for whatever reason eventually; or, to operate a mirror of it, for redundancy (linking back to the original, of course).
The underlying text (1911 edition) is public domain, but the structured version here — the parsing, reconstruction, and linking — is something I put together for this site. Right now there isn’t a bulk download available. I’m considering exposing structured access (API or dataset) in some form, but haven’t decided exactly how that will work yet.
If you have a specific use case in mind (especially for training), I’d be interested to hear more.
Separately, I've fine-tuned the Gemma 4 model[2], it was very quick (just 90 seconds), so I think it could be interesting to train it to talk like 1911 Encyclopedia Britannica.
I would use the entries as training data and train it to talk in the same style. There isn't a specific use case for why, I just think it would be interesting. For example, I could see how it writes about modern concepts in the style of 1911 Britannica.
[1] https://stateofutopia.com/encyclopedia/
[2] To talk like a pirate! https://www.youtube.com/live/WuCxWJhrkIM
The underlying text is public domain, but the structured version here is something I put together for the site. I haven’t released a bulk dataset yet.
If you end up experimenting with it, I’d love to hear how it turns out — and I’m still figuring out what structured access might look like.
What I’ve built here is a structured edition — the parsing, reconstruction, linking, indexing, etc. I haven’t published a formal license for that yet.
For casual or small-scale use there’s no issue at all. For bulk use (e.g. dataset / training / redistribution), I’d prefer people get in touch so I can figure out a sensible way to support that.
If you live in the U.S. I recommend you read No Sweat of the Brow Copyright: https://www.gutenberg.org/help/no_sweat_copyright.html
They only release books that are in the public domain.
Not necessarily. Project Gutenberg does provide some works still under US copyright, such as F. P. Walter’s 1999 translation of Twenty Thousand Leagues Under the Seas: https://gutenberg.org/ebooks/2488