undefined

upvote

points

by asciimoo21 hours ago |

upvote

by jodoherty2 hours ago|

[-]

Beautiful! Thank you for making this.

I've been trying to find something to use for enriching my own self-hosted LLMs and agentic tools with information I find useful. Metasearch tools like SearXNG make it less likely you'll get blocked by bot detection tools when finding information, but usually it's something I've already found, read, or seen that I want to incorporate into my tooling.

I came to the conclusion that a self-hosted content storage system with a search engine and a browser extension that can extract and save web page content and metadata was the ideal setup for me. Preferably with some sort of federated content sharing ability and the ability to import creative commons content like Wikipedia and Gutenberg.

This looks almost exactly like what I wanted.

It'll be a few weeks before I have time to audit the code and deploy it, but I'm really looking forward to trying it out.

reply

upvote

by ydj16 hours ago|

[-]

Hister sounds like something I wanted for a while, but never got around to building. Searching stuff I’ve seen before is most of what I do with a search engine, so having it local and fast would be amazing. Eager to give it a try.

reply

upvote

by phrotoma6 hours ago|

[-]

And the number of times I've searched for something that I saw a while ago but is now gone is way too damned high.

reply

upvote

by zeroq19 hours ago|

[-]

I'm sorry for not taking the time to read the docs, but I have a question.

Some 20 years ago a friend of mine has set up a local proxy (python if I'm not mistaken) that was gathering all his web traffic and served him as a long term memory. The proxy had a web interface and allowed him to quickly find something he saw ca. 10 days ago, or that specific algorithm he recalls but can't remember it's name.

For years I've been collecting links to different work related trivia which I use on a daily basis as a rabbit-from-a-hat solution to answer random question from friends and coworkers. For example someone randomly asked me for an idea for color palette for data charts and I can immediately give them a scientific research into the color palette. Or an obscure algorithm.

But with time the collection has grown substantially and it's really cumbersome to find the proper things.

Would your project be a good fit for my problem?

reply

upvote

by asciimoo19 hours ago|

[-]

Absolutely, this is a great example where Hister can shine.

I started Hister as a proxy as well, but quickly switched to the current extension based approach, because intercepting HTTPS traffic requires a MiTM proxy which is much more painful to setup than installing a browser extension.

reply

upvote

by zeroq18 hours ago|

[-]

would it be possible to gdrive/rsync/git the data between machines and then use the data on an online server for retrieval (given that I would handle data sync myself)?

also what exactly are you using for search? does it support trigrams? how do you sort results?

reply

upvote

by sunshine-o10 hours ago|

[-]

I found Hister a few month ago and was amazed by it.

Now for many of us the browser extension approach is not possible (mobile usage, security, etc.)

My feeling is for a lot of users there is really a third way apart from the MiTM proxy or Browser extension approach. I actually do not want my "personal" / "logged in" pages to be indexed. This is a bit like the MS recall nightmare (self hosted version).

Any way to get the list of URL visited (with something like Privoxy, or maybe one of those popular ad blockers like Pi Hole but I guess they just get DNS queries?) and then importing it with some filtering rules with a nightly batch job is good enough for a lot of people.

The browser import [1] is great but I guess hard to use with mobile...

- [0] https://www.privoxy.org/

- [1] https://hister.org/docs/importing-browser-history

reply

upvote

by asciimoo8 hours ago|

[-]

Thanks for the kind words =]

There is already an ongoing discussion about the topic: https://github.com/asciimoo/hister/issues/387

The currently discussed solution relies on the browser extension, but mobile Firefox has extension support.

reply

upvote

by justusthane19 hours ago|

[-]

Also very interested in this. I was playing around with doing the same thing with YaCY. I want the proxy aspect so that I can proxy my phone traffic through it as well.

reply

upvote

by asciimoo19 hours ago|

[-]

Unfortunately mobile Chrome browsers don't support browser extensions, but our extension works well on mobile Firefox.

reply

upvote

by left-struck16 hours ago|

[-]

Would you mind sharing these links? Or a subset? I want to grow my collection which is tiny because I started way too late

reply

upvote

by Leonard_of_Q5 hours ago|

[-]

Interesting, a local search option. I made the recoll engine for SearX and now SearXNG and still use this daily over a rather large archive of journal articles and other non-fiction texts. Recoll's indexer can extract text from just about anything I throw at it, it also extracts and indexes metadata. Would Hister serve the same purpose and if so is there a SearXNG engine to integrate it into the result stream?

reply

upvote

by exiguus11 hours ago|

[-]

YaCY has a proxy mode that automatically index your web-serving. In my experience, the index grow in size very fast and reaches ~100GB or more. How does the index size of Hister compare to that?

reply

upvote

by asciimoo9 hours ago|

[-]

Hister stores only the text content of HTML/pdf pages. 1000 documents require around 80-100MB of storage and there is still plenty of room to optimize for storage space.

I'm using it for 6-7 months and my index size is below 1GB with almost 10k pages.

Also, a downside of the proxy approach: it does not handle properly JS based websites and cannot identify dynamic content changes. Our extension periodically checks if the browser tabs' content has been changed and automatically updates the index when change detected.

reply

upvote

by BrunoBernardino8 hours ago|

[-]

Hister is a great idea and the creator is a really nice person, please give it an honest look and consider supporting them (I'm Uruky's co-founder and we sponsored them)!

reply

upvote

by scritty-dev5 hours ago|

[-]

this is really cool, first time hearing about this, is there any org level model for this so you can promote individual's indexed websites into an organization/team owned model?

reply

upvote

by asciimoo4 hours ago|

[-]

Multiple users can use a shared instance and collect their indexed content in a central place. Hister has user handling and a "public mode" as well: https://hister.org/posts/public-search

reply

upvote

by MrDrMcCoy19 hours ago|

[-]

Always excited to see new things like Hister in the search space. What are the scaling limits, as far as you can tell in terms of how much can it hold before queries start breaking down or become too slow to be useful? Could it evolve into a general internet search engine if, say, enough trusted members of a geo-distributed YugabyteDB cluster and an army of crawlers built a sufficient index?

reply

upvote

by asciimoo18 hours ago|

[-]

> What are the scaling limits, as far as you can tell in terms of how much can it hold before queries start breaking down or become too slow to be useful?

There has been no stress tests in this regard. The indexer lib Bleve [1] can handle millions of documents according to their documentation.

> Could it evolve into a general internet search engine if, say, enough trusted members of a geo-distributed YugabyteDB cluster and an army of crawlers built a sufficient index?

My long term goal is exactly this. I'd like to add federation/P2P feature [2][3] to evolve from being a private search companion. I'd appreciate any help designing the system.

[1] https://blevesearch.com/docs/Home/ [2] https://github.com/asciimoo/hister/discussions/432 [3] https://hister.org/posts/public-search

reply

upvote

by derrida18 hours ago|

[-]

Wow! that looks like a bit of software I have been dreaming about for awhile - will definately check out! You're at least doing something right in communicating the reasons why and appeal for starters! All the best!

reply

upvote

by Abishek_Muthian15 hours ago|

[-]

This is great, like many others I've been thinking of something like hister but only for bookmarked web pages. I presume it should be straightforward with hister to do that?

All the best!

reply

upvote

by asciimoo12 hours ago|

[-]

It is possible. The automatic website indexing can be turned off in the extension and manual indexing can be triggered via the command line tool, the extension popoup or by hotkeys.

reply

upvote

by chrisss39519 hours ago|

[-]

I love your idea and wondered why saving and indexing browser visited pages was not being done. Does this handle large amounts of local files, for example 10-20TB across file types like Powerpoint, Excel, Word, and PDF?

reply

upvote

by asciimoo19 hours ago|

[-]

In its current form it cannot handle this amount of data efficiently (and doesn't support powerpoint/excel/word yet), but this is a valid use-case, I've added a TODO item to experiment with it.

reply

upvote

by blackqueeriroh18 hours ago|

[-]

Oh thank god there used to be several tools like this and they slowly went away. I’ve been wanting this to return.

reply

upvote

by 20 hours ago|

[-]

deleted

reply

upvote

by kristianpaul20 hours ago|

[-]

Is this similar to fastcrw ?

reply

upvote

by asciimoo20 hours ago|

[-]

Both are search engines, but that's all the similarity. Hister has a traditional crawler, but its biggest strength is automatically indexing browser tabs as those are rendered. This way it bypasses authentication, CloudFlare, captchas and most of the annoying limitations of traditional crawlers. Hister also provides full offline result previews. Check out the small read-only demo: https://demo.hister.org/

reply

upvote

by nickpsecurity16 hours ago|

[-]

I was considering paying someone to build something like this at some point. With two jobs, I eventually had no time to even organize what I find. It's just piles of links in text files.

Can I give your software a huge list of URL's to index? Or do I need to use browser automation to open them a few at a time with it caching and indexing them?

reply

upvote

by asciimoo9 hours ago|

[-]

I accept donations ;)

Hister has a built in crawler with standard HTTP lib and browser based backends, you can feed your link collection to it. Also, Hister supports importing your existing browser history automatically using either of the mentioned backends.

reply

upvote

by operatingthetan20 hours ago|

[-]

I installed this a while back and honestly I almost never touch it. It turns out that for me searching my history doesn't really replace a search engine at all. The built in extractor list is pretty limited and adding them seems like too much of an ordeal for me to bother.

reply

upvote

by asciimoo19 hours ago|

[-]

Sure, it cannot fully replace web search engines (yet), but it can reduce the dependence on these services more and more as your index grows. Hister is designed to support quickly falling back to traditional search engines with a single hotkey if no results found.

I agree, we should add more extractors [1]. Can you recommend extractors you missed?

[1] https://github.com/asciimoo/hister/issues/305

reply