In the cloud or on prem I suspect folks are having better luck than I did, but also open to being wrong about this.
I have one. And it's managed. I don't think there's significant cost savings to going unmanaged, but maybe. Even so, why would I need a ton of them?
Can’t use cloud stuff on-prem and also if your clients have a server room of their own. Same for homelab.
Also it’s nice not to shift the pets attitude from servers to clusters and instead treat everything as cattle - provided you have backups of persistent data and the config versioned in a Git repo and there’s maybe some Ansible in the mix, being able to recreate an environment in the case of a fuckup is nice and also helps against bit rot.
Disclaimer: I actually prefer Docker Swarm/Compose over K8s due to simplicity (which matches my deployments and scale), but in the cases where I had to use a variety of K8s, going for K3s was pretty okay.
Also, fun fact, k3sup is pronounced "ketchup" according to the README[0]
[0]: https://github.com/alexellis/k3sup/blob/master/README.md
It's a cool project, but I didn't think the K3s part was the hard part.
We have been running into lot of issues at production with k3s. There I embarked on journey to writing a kubernetes compliant and equivalent platform in rust with the help of claude [1]. It is a fun little project for now, still figuring out stuff, idea is to keep it minimal and single binary every embedded including CNI, and support various runtimes like docker, containerd etc but also wasm, vms and also jvm.
Architecturally - where do you run Postgres ? I assume it would be external to the cluster ? (doing it internally would create a circular dependency ?)
If you want to do a quick setup, it creates a SQLite DB for the metadata.
With all respect, "building it because I want to" and "working toward making (it) production grade" doesn't inspire a ton of confidence. k3s has been part of the CNCF for many years and its developer Darren Shepherd was the founding CTO for both cloud.com and Rancher Labs, which were acquired by Citrix and SUSE. It looks like you're running your own B2B company and hoping to swap out k3s as the underlying engine for multitenancy. That's very risky. Surely Claude can help you understand and use k3s just as readily as help you write a replacement, and I'm sure SUSE sells professional services. I have no clue what they charge but typically you're talking like $300 an hour and you'd probably only need 40 hours.
There were many issues. On top of my mind was, after a DR drill where in a VM was booted, node did not join the cluster. Apart from that bunch of issues due to etcd, longhorn.
Another major one was the CNI stopped work for a particular node. Garbage collection for images was another, we labelled the images, it would still remove then from the node.
Bunch of these kind of issues when our requirement is fairly straightforward. Therefore we are working towards a strip down version.
There is lot of operation complexity in general and most of us can do without.
I've found things more stable if you can give a dedicated interface just for internal k3s communication. It can be a bridge interface on top of a vlan interface - but not the vlan interface itself, or some things will break in very interesting ways. Also, even when using IPv6, just stick with internal IPs and nat everything - touching internal IP ranges is no fun. Plus, if there's a chance you'd ever want to use dual stack, set it up with internal v6 addresses, and just don't use the v6 addresses for now. There's also a lot of unintuitive behaviour around dual stack networking - and lots of areas where documentation is just plain wrong.
I'm scripting our stuff with ansible - one of the more useful things was the realisation that in some areas changes which shouldn't break anything can lead to cluster communication being interrupted, which is a very interesting thing to deal with, especially when you can't pin it to that change that didn't touch anything close to that, and therefore should not be responsible. I've learned, and sprinkled checks to make sure all members can still reach each other in there now, so that at least when I break it on changes I directly know why.
I cannot wait for the end of this month to leave that place.
We are hiring, btw.