Hister is a full text indexer for websites and local files which automatically saves all the visited pages rendered by your browser. Storing full page content allows serving offline result previews and the full page content via MCP.
Take a look at how the MCP can be utilized: https://hister.org/posts/give-your-ai-assistant-a-private-me...
I've been trying to find something to use for enriching my own self-hosted LLMs and agentic tools with information I find useful. Metasearch tools like SearXNG make it less likely you'll get blocked by bot detection tools when finding information, but usually it's something I've already found, read, or seen that I want to incorporate into my tooling.
I came to the conclusion that a self-hosted content storage system with a search engine and a browser extension that can extract and save web page content and metadata was the ideal setup for me. Preferably with some sort of federated content sharing ability and the ability to import creative commons content like Wikipedia and Gutenberg.
This looks almost exactly like what I wanted.
It'll be a few weeks before I have time to audit the code and deploy it, but I'm really looking forward to trying it out.
Some 20 years ago a friend of mine has set up a local proxy (python if I'm not mistaken) that was gathering all his web traffic and served him as a long term memory. The proxy had a web interface and allowed him to quickly find something he saw ca. 10 days ago, or that specific algorithm he recalls but can't remember it's name.
For years I've been collecting links to different work related trivia which I use on a daily basis as a rabbit-from-a-hat solution to answer random question from friends and coworkers. For example someone randomly asked me for an idea for color palette for data charts and I can immediately give them a scientific research into the color palette. Or an obscure algorithm.
But with time the collection has grown substantially and it's really cumbersome to find the proper things.
Would your project be a good fit for my problem?
I started Hister as a proxy as well, but quickly switched to the current extension based approach, because intercepting HTTPS traffic requires a MiTM proxy which is much more painful to setup than installing a browser extension.
also what exactly are you using for search? does it support trigrams? how do you sort results?
Now for many of us the browser extension approach is not possible (mobile usage, security, etc.)
My feeling is for a lot of users there is really a third way apart from the MiTM proxy or Browser extension approach. I actually do not want my "personal" / "logged in" pages to be indexed. This is a bit like the MS recall nightmare (self hosted version).
Any way to get the list of URL visited (with something like Privoxy, or maybe one of those popular ad blockers like Pi Hole but I guess they just get DNS queries?) and then importing it with some filtering rules with a nightly batch job is good enough for a lot of people.
The browser import [1] is great but I guess hard to use with mobile...
- [0] https://www.privoxy.org/
There is already an ongoing discussion about the topic: https://github.com/asciimoo/hister/issues/387
The currently discussed solution relies on the browser extension, but mobile Firefox has extension support.
I'm using it for 6-7 months and my index size is below 1GB with almost 10k pages.
Also, a downside of the proxy approach: it does not handle properly JS based websites and cannot identify dynamic content changes. Our extension periodically checks if the browser tabs' content has been changed and automatically updates the index when change detected.
There has been no stress tests in this regard. The indexer lib Bleve [1] can handle millions of documents according to their documentation.
> Could it evolve into a general internet search engine if, say, enough trusted members of a geo-distributed YugabyteDB cluster and an army of crawlers built a sufficient index?
My long term goal is exactly this. I'd like to add federation/P2P feature [2][3] to evolve from being a private search companion. I'd appreciate any help designing the system.
[1] https://blevesearch.com/docs/Home/ [2] https://github.com/asciimoo/hister/discussions/432 [3] https://hister.org/posts/public-search
All the best!
Can I give your software a huge list of URL's to index? Or do I need to use browser automation to open them a few at a time with it caching and indexing them?
Hister has a built in crawler with standard HTTP lib and browser based backends, you can feed your link collection to it. Also, Hister supports importing your existing browser history automatically using either of the mentioned backends.
I agree, we should add more extractors [1]. Can you recommend extractors you missed?
The nice thing about using your own backend is, that you can prio it in the results and for example, if I crawl the smallweb and other site important for myself, this sites come up first in the results.
Same here
> with YaCY Backends and else as fallback.
Do you run your own "super fast" YaCy instance? or with specific settings?
My experience with YaCy is it doesn't fit in the backend of SearX since YaCy kind of slowly stream results for about 30 seconds...
I also have a local `kiwix-serve` serving ZIM files of wikipedia, wiktionary, gutemberg, archwiki, etc. but same problem the kiwix search engine [0] doesn't really fit as a backend for SearX as it returns too many results and pollute the SearX result page.
What I haven't done yet is trying to plug SearX to a local Recoll instance [1]. But Recoll doesn't support indexing ZIM files... but could be useful for other archived html documents.
I would be curious to know more about a working setup since search is hard to get right.
- [0] https://kiwix-tools.readthedocs.io/en/latest/kiwix-serve.htm...
- [1] https://docs.searxng.org/dev/engines/online/recoll.html
environment:
JAVA_OPTS: >-
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:+ParallelRefProcEnabled
-XX:+UseStringDeduplication
-XX:InitiatingHeapOccupancyPercent=45
-XX:G1ReservePercent=15
-Xms1024m
-Xmx3072m
-XX:MaxMetaspaceSize=256m
-XX:MaxDirectMemorySize=256m
-XX:+ExitOnOutOfMemoryError
-XX:G1HeapWastePercent=10
-XX:G1MixedGCCountTarget=4
deploy:
resources:
limits:
cpus: "4.2"
memory: 5.2G
reservations:
cpus: "2"
memory: 2.5G
healthcheck:
test: |
/bin/bash -c '
if ! timeout 55s wget --spider --no-verbose http://127.0.0.1:8090/yacysearch.html?query=exiguus; then
exit 1
fi
if ! timeout 55s yacy_search_server/bin/checkalive.sh; then
exit 1
fi
exit 0
'
interval: 120s
timeout: 60s
retries: 3
start_period: 240s
That's the smallest I got it running mostly stable and self-healing with a index size of +100GB. I also avoid to use crawling by the build in tasks and use the API and cron jobs for weekly feed importing, because I found out, that kind of crawling eats up less resources then the usual. All-Over, to much running crawlers, make retrieving search results slow.
For production use, I suggest to min. double the resources. If you do this, it becomes very stable.Thanks to pointing out kiwix. I'll give it a try.
> Thanks to pointing out kiwix. I'll give it a try.
I see YaCy works with ZIM files [0] packaged by Kiwix so this is great.
In theory if you run YaCy kiwix is not necessary but they do package already valuable sites likes Wikipedia, iFixit, archwiki, etc. [0] so you do not have the worry of your crawler to be blocked and have local copy anyway [1]. So a lot of bandwidth and headache saved.
- [0] https://github.com/yacy/yacy_search_server/tree/master/sourc...
I'm curious what setups folks use to provide this functionality.
Since the quantized 24B parameter Gemma model came out, I've had good luck with tool calling on a 4070 Ti Super.
Successful tool calling is what finally made the local experience useful.
I should note this is for the general and not coding specific context.
It's at the bottom of this page: https://docs.searxng.org/admin/settings/settings_search.html
i have a friend with a 4080 that is wanting to experiment with local models and those cards should be similar enough. can you give any more detail about your setup? ty!
`gemma4-26b-a4b-it-qat.gguf`
https://huggingface.co/lmstudio-community/gemma-4-26B-A4B-it...
It is really great to use. As the poster above mentioned, my setup with Sear is the following, all through `llama.cpp`, which has a built-in webui with an MCP client:
* SearXNG in Docker — enable the JSON API (`search.formats: [html, json]`; off by default).
* `searxng-mcp` (FastMCP, native streamable-HTTP): `TRANSPORT=streamable-http HOST=127.0.0.1 PORT=8100` `SEARXNG_URL=http://localhost:8888 uvx --from searxng-mcp --with fastmcp searxng-mcp`
* `llama-server` with `--webui-mcp-proxy`, then add the server in the webui.
Some gotchas:
* `searxng-mcp` forgets to declare its own dep → `--with fastmcp`.
* Endpoint is `/mcp`, not the `/searxng-mcp/mcp` the docs claim.
* `--webui-mcp-proxy` only enables the CORS proxy; each MCP server entry still needs its "Use llama-server proxy" checkbox ticked, or the browser fetches direct and CORS-fails.
* Terminal clients (OpenCode etc.) skip the proxy — point them straight at `:8100/mcp`.
A couple interesting tidbits:
* There are temporal issues with search-related tool calls. The model trips out. 2026 results read to it a "future-dated hallucination" because it doesn't know the date. There's an additional `--tools get_datetime` function that will allow it to ground via the real date.
* Snippets-only is enough for most "what's current" questions and keeps context tiny.
Let me know if you have any questions!
I won't say searxng can help you degoogle because metasearch engine calls other search engines, e.g., google, to collect results. However, if you try searxng, you can at least get rid of things like ai reviews in no time.
In the end, thank you people after searxng project and public instances.
Often have to go through the preferences to deselect search engines that don't work (often because of the instance being blocked) or select those that do work, because of reliability problems. Which engines are working, can be different for each public instance, so that even saving a preference hash doesn't always work.
Would be great if SearXNG did automatic adjustment of presented search engines (or offered the option) based on reliability.
It also needs a SearXNG server to run, so I recently pivoted towards a self-contained solution: https://github.com/denysvitali/search-mcp
I actually included a recipe to deploy it to kubernetes in typekro, my TypeScript infrastructure-as-code project for kubernetes: https://typekro.run/api/searxng/
Image search is worse i guess mostly for lack of CDN so it's slow but whatever.