Architecturally - where do you run Postgres ? I assume it would be external to the cluster ? (doing it internally would create a circular dependency ?)
If you want to do a quick setup, it creates a SQLite DB for the metadata.
With all respect, "building it because I want to" and "working toward making (it) production grade" doesn't inspire a ton of confidence. k3s has been part of the CNCF for many years and its developer Darren Shepherd was the founding CTO for both cloud.com and Rancher Labs, which were acquired by Citrix and SUSE. It looks like you're running your own B2B company and hoping to swap out k3s as the underlying engine for multitenancy. That's very risky. Surely Claude can help you understand and use k3s just as readily as help you write a replacement, and I'm sure SUSE sells professional services. I have no clue what they charge but typically you're talking like $300 an hour and you'd probably only need 40 hours.
Once i have embarked on the journey building this from scratch, there are new innovative ideas i can implement not bound to any foundation nor org.
Ps. We do not sell as product it is 100% free and opensource with MIT license.
There were many issues. On top of my mind was, after a DR drill where in a VM was booted, node did not join the cluster. Apart from that bunch of issues due to etcd, longhorn.
Another major one was the CNI stopped work for a particular node. Garbage collection for images was another, we labelled the images, it would still remove then from the node.
Bunch of these kind of issues when our requirement is fairly straightforward. Therefore we are working towards a strip down version.
There is lot of operation complexity in general and most of us can do without.
I've found things more stable if you can give a dedicated interface just for internal k3s communication. It can be a bridge interface on top of a vlan interface - but not the vlan interface itself, or some things will break in very interesting ways. Also, even when using IPv6, just stick with internal IPs and nat everything - touching internal IP ranges is no fun. Plus, if there's a chance you'd ever want to use dual stack, set it up with internal v6 addresses, and just don't use the v6 addresses for now. There's also a lot of unintuitive behaviour around dual stack networking - and lots of areas where documentation is just plain wrong.
I'm scripting our stuff with ansible - one of the more useful things was the realisation that in some areas changes which shouldn't break anything can lead to cluster communication being interrupted, which is a very interesting thing to deal with, especially when you can't pin it to that change that didn't touch anything close to that, and therefore should not be responsible. I've learned, and sprinkled checks to make sure all members can still reach each other in there now, so that at least when I break it on changes I directly know why.
I cannot wait for the end of this month to leave that place.
We are hiring, btw.