upvote
I ran my own YaCY instances. Three of them to be specific, because they are "super fast" and "reboot" often. I crawl with them the smallweb, smallcomic and smallyt sites and also all feeds from my miniflux instance; getting them via the miniflux api. Beside that i have other static entries that i crawl. For wikibooks and wikipedia i tried and use also YaCY, but it use a lot of resources. So its only in one instance. I suggest >16GB RAM and 300GB+ HDD if you want to do this. To access wikimedia, gutemberg, archwiki or media.ccc.de directly, I use also SearXNG. Usually it takes 1-3 Seconds to get search results from YaCY in my setup. I run them in docker on aarch64 with ~6GB of RAM and 200GB HDD. The VPS it-self has 8GB RAM, 6 arm cores and 250GB HDD. If YaCY hang, i just restart it. This are my pretty good working docker deploy and java settings I use currently:

    environment:
      JAVA_OPTS: >-
        -XX:+UseG1GC
        -XX:MaxGCPauseMillis=200
        -XX:+ParallelRefProcEnabled
        -XX:+UseStringDeduplication
        -XX:InitiatingHeapOccupancyPercent=45
        -XX:G1ReservePercent=15
        -Xms1024m
        -Xmx3072m
        -XX:MaxMetaspaceSize=256m
        -XX:MaxDirectMemorySize=256m
        -XX:+ExitOnOutOfMemoryError
        -XX:G1HeapWastePercent=10
        -XX:G1MixedGCCountTarget=4
    deploy:
      resources:
        limits:
          cpus: "4.2"
          memory: 5.2G
        reservations:
          cpus: "2"
          memory: 2.5G
    healthcheck:
      test: |
        /bin/bash -c '
        if ! timeout 55s wget --spider --no-verbose http://127.0.0.1:8090/yacysearch.html?query=exiguus; then
          exit 1
        fi
        if ! timeout 55s yacy_search_server/bin/checkalive.sh; then
          exit 1
        fi
        exit 0
        '
      interval: 120s
      timeout: 60s
      retries: 3
      start_period: 240s
That's the smallest I got it running mostly stable and self-healing with a index size of +100GB. I also avoid to use crawling by the build in tasks and use the API and cron jobs for weekly feed importing, because I found out, that kind of crawling eats up less resources then the usual. All-Over, to much running crawlers, make retrieving search results slow. For production use, I suggest to min. double the resources. If you do this, it becomes very stable.

Thanks to pointing out kiwix. I'll give it a try.

reply
Thanks you so much this is highly very valuable information.

> Thanks to pointing out kiwix. I'll give it a try.

I see YaCy works with ZIM files [0] packaged by Kiwix so this is great.

In theory if you run YaCy kiwix is not necessary but they do package already valuable sites likes Wikipedia, iFixit, archwiki, etc. [0] so you do not have the worry of your crawler to be blocked and have local copy anyway [1]. So a lot of bandwidth and headache saved.

- [0] https://github.com/yacy/yacy_search_server/tree/master/sourc...

- [1] https://browse.library.kiwix.org/#lang=eng

reply