My problem with microVMs was that they usually won't run docker / kubernetes, I work on apps that consist of whole kubernetes clusters and want the sandbox to contain all that.
Does your solution support running k3s for example?
Really appreciate the feedback!
That's the one feature of similar systems that always gets left out. I understand why: it's not a priority for "cloud native" workloads. The world, however, has work loads that are not cloud native, because that comes at a high cost, and it always will. So if you'd like a real value-add differentiator for your micro-VM platform (beyond what I believe you already have,) there you go.
Otherwise this looks pretty compelling.
By what I assume is your definition, there are plenty of "non cloud native" workloads running on clouds that need live migration. Azure and GCP use LM behind the scenes to give the illusion of long uptime hosts. Guest VMs are moved around for host maintenance.
As does OCI, and (relatively recently) AWS. That's a lot of votes.
Use case: some legacy database VM needs to move because the host needs maintenance, the database storage (as opposed to the database software) is on a iSCSI/NFS/NVMe-oF array somewhere, and clients are just smart enough to transparently handle a brief disconnect/reconnect (which is built-in to essentially every such database connection pool stack today.)
Use case: a web app platform (node/spring/django/rails/whatever) with a bunch of cached client state needs to move because the host needs maintenance. The developers haven't done all the legwork to make the state survive restart, and they'll likely never get time needed to do that. That's essentially the same use case as previous. It's also rampant.
Use case: a long running batch process (training, etc.) needs to move because reasons, and ops can't wait for it to stop, and they can't kill it because time==money. It's doesn't matter that it takes an hour to move because big heap, as long as the previous 100 hours isn't lost.
"as in how large the heap is"
That's an undecidable moving target, so let the user worry about it. Trust them to figure out what is feasible given the capabilities of their hardware and talent. They'll do fine if you provide the mechanism. I've been shuffling live VMs between hosts for 10+ years successfully, and Qemu/KVM has been capable of it for nearly 20, never mind VMware.
"CRIU"
Dormant, and still containers. Also, it's re-solving solved problems once you're running in a VM, but with more steps.
Thanks
virsh migrate --live GuestName DestinationURL
This is feasible when network storage is available and useful when a host needs to be drained for maintenance.My team spent 4 months on our implementation of vm memory that let us do it and its still our biggest time suck. We also were able to make assumptions like RDMA that are not available.
All that to say — as someone not working on smolVMs — I am confident smolVMs and most other OSS sandbox implementations will get live migration via hypervisor upgrades in the next 12 months.
Until then there are enterprise-y providers like that have it and great OSS options that already solve this like cloud hypervisor.
Not useful for things it hadn't been trained on before. But now I have the core functionality in place - it's been of great help.
I have been working on something similar but on top of firecracker, called it bhatti (https://github.com/sahil-shubham/bhatti).
I believe anyone with a spare linux box should be able to carve it into isolated programmable machines, without having to worry about provisioning them or their lifecycle.
The documentation’s still early but I have been using it for orchestrating parallel work (with deploy previews), offloading browser automation for my agents etc. An auction bought heztner server is serving me quite well :)
also, yes, shuru was (still) a wrapper over the Virtualization.framework, but it now supports Linux too (wrapper over KVM lol)
Linux was built in the 90s. Hardware improved more than a 1000x. Linux virtual machine startup times stayed relatively the same.
Turns out we kept adding junk to the linux kernel + bootup operations.
So all I did was cut and remove unnecessary parts until it still worked.
This ended up also getting boot up times to under 1s. The kernel changes are the 10 commits I made, you can verify here: https://github.com/smol-machines/libkrunfw
There's probably more fat to cut to be honest.
WSL2 runs a linux virtual machine. Need to take some time and care to wire that up, but definitely feasible.