upvote
> SearXNG is my daily internet search now +5 years

Same here

> with YaCY Backends and else as fallback.

Do you run your own "super fast" YaCy instance? or with specific settings?

My experience with YaCy is it doesn't fit in the backend of SearX since YaCy kind of slowly stream results for about 30 seconds...

I also have a local `kiwix-serve` serving ZIM files of wikipedia, wiktionary, gutemberg, archwiki, etc. but same problem the kiwix search engine [0] doesn't really fit as a backend for SearX as it returns too many results and pollute the SearX result page.

What I haven't done yet is trying to plug SearX to a local Recoll instance [1]. But Recoll doesn't support indexing ZIM files... but could be useful for other archived html documents.

I would be curious to know more about a working setup since search is hard to get right.

- [0] https://kiwix-tools.readthedocs.io/en/latest/kiwix-serve.htm...

- [1] https://docs.searxng.org/dev/engines/online/recoll.html

reply
I ran my own YaCY instances. Three of them to be specific, because they are "super fast" and "reboot" often. I crawl with them the smallweb, smallcomic and smallyt sites and also all feeds from my miniflux instance; getting them via the miniflux api. Beside that i have other static entries that i crawl. For wikibooks and wikipedia i tried and use also YaCY, but it use a lot of resources. So its only in one instance. I suggest >16GB RAM and 300GB+ HDD if you want to do this. To access wikimedia, gutemberg, archwiki or media.ccc.de directly, I use also SearXNG. Usually it takes 1-3 Seconds to get search results from YaCY in my setup. I run them in docker on aarch64 with ~6GB of RAM and 200GB HDD. The VPS it-self has 8GB RAM, 6 arm cores and 250GB HDD. If YaCY hang, i just restart it. This are my pretty good working docker deploy and java settings I use currently:

    environment:
      JAVA_OPTS: >-
        -XX:+UseG1GC
        -XX:MaxGCPauseMillis=200
        -XX:+ParallelRefProcEnabled
        -XX:+UseStringDeduplication
        -XX:InitiatingHeapOccupancyPercent=45
        -XX:G1ReservePercent=15
        -Xms1024m
        -Xmx3072m
        -XX:MaxMetaspaceSize=256m
        -XX:MaxDirectMemorySize=256m
        -XX:+ExitOnOutOfMemoryError
        -XX:G1HeapWastePercent=10
        -XX:G1MixedGCCountTarget=4
    deploy:
      resources:
        limits:
          cpus: "4.2"
          memory: 5.2G
        reservations:
          cpus: "2"
          memory: 2.5G
    healthcheck:
      test: |
        /bin/bash -c '
        if ! timeout 55s wget --spider --no-verbose http://127.0.0.1:8090/yacysearch.html?query=exiguus; then
          exit 1
        fi
        if ! timeout 55s yacy_search_server/bin/checkalive.sh; then
          exit 1
        fi
        exit 0
        '
      interval: 120s
      timeout: 60s
      retries: 3
      start_period: 240s
That's the smallest I got it running mostly stable and self-healing with a index size of +100GB. I also avoid to use crawling by the build in tasks and use the API and cron jobs for weekly feed importing, because I found out, that kind of crawling eats up less resources then the usual. All-Over, to much running crawlers, make retrieving search results slow. For production use, I suggest to min. double the resources. If you do this, it becomes very stable.

Thanks to pointing out kiwix. I'll give it a try.

reply
Thanks you so much this is highly very valuable information.

> Thanks to pointing out kiwix. I'll give it a try.

I see YaCy works with ZIM files [0] packaged by Kiwix so this is great.

In theory if you run YaCy kiwix is not necessary but they do package already valuable sites likes Wikipedia, iFixit, archwiki, etc. [0] so you do not have the worry of your crawler to be blocked and have local copy anyway [1]. So a lot of bandwidth and headache saved.

- [0] https://github.com/yacy/yacy_search_server/tree/master/sourc...

- [1] https://browse.library.kiwix.org/#lang=eng

reply