undefined

points

[-]

It helps if you offer a concrete use case, as in how large the heap is, what kinda of blackout period you can handle, and whether the app can handle all of it's open connections being destroyed, etc. The more an app can handle resetting some of it's own state, the easier LM is going to be to implement. If your workload jives with CRIU https://github.com/checkpoint-restore/criu you could do this already.

By what I assume is your definition, there are plenty of "non cloud native" workloads running on clouds that need live migration. Azure and GCP use LM behind the scenes to give the illusion of long uptime hosts. Guest VMs are moved around for host maintenance.

by topspin6 hours ago|

parent|

[-]

"Azure and GCP use LM behind the scenes"

As does OCI, and (relatively recently) AWS. That's a lot of votes.

Use case: some legacy database VM needs to move because the host needs maintenance, the database storage (as opposed to the database software) is on a iSCSI/NFS/NVMe-oF array somewhere, and clients are just smart enough to transparently handle a brief disconnect/reconnect (which is built-in to essentially every such database connection pool stack today.)

Use case: a web app platform (node/spring/django/rails/whatever) with a bunch of cached client state needs to move because the host needs maintenance. The developers haven't done all the legwork to make the state survive restart, and they'll likely never get time needed to do that. That's essentially the same use case as previous. It's also rampant.

Use case: a long running batch process (training, etc.) needs to move because reasons, and ops can't wait for it to stop, and they can't kill it because time==money. It's doesn't matter that it takes an hour to move because big heap, as long as the previous 100 hours isn't lost.

"as in how large the heap is"

That's an undecidable moving target, so let the user worry about it. Trust them to figure out what is feasible given the capabilities of their hardware and talent. They'll do fine if you provide the mechanism. I've been shuffling live VMs between hosts for 10+ years successfully, and Qemu/KVM has been capable of it for nearly 20, never mind VMware.

"CRIU"

Dormant, and still containers. Also, it's re-solving solved problems once you're running in a VM, but with more steps.

by fqiao6 hours ago|

prev|

[-]

Really appreciate the suggestion! By "live migration", do you mean keeping the existing files and migrate them elsewhere with the vm?

Thanks

by topspin6 hours ago|

parent|

[-]

I mean making any given VM stop on host A and appear on host B; e.g. standard Qemu/KVM:

    virsh migrate --live GuestName DestinationURL

This is feasible when network storage is available and useful when a host needs to be drained for maintenance.

by benswerd1 hours ago|

parent|

[-]

Live migrations and the tech powering it was the hardest thing I ever built. Its something that I think will come naturally to projects like smolVM as more of the hypervisors build it in, but its a deeply challenging task to do in userspace.

My team spent 4 months on our implementation of vm memory that let us do it and its still our biggest time suck. We also were able to make assumptions like RDMA that are not available.

All that to say — as someone not working on smolVMs — I am confident smolVMs and most other OSS sandbox implementations will get live migration via hypervisor upgrades in the next 12 months.

Until then there are enterprise-y providers like that have it and great OSS options that already solve this like cloud hypervisor.

by fqiao6 hours ago|

parent|

prev|

[-]

I see. so right now smolvm can be stopped, and then "packed" (think of it as compressed), and restart on a different host. files in the disks are preserved, but memory snapshotting is still hard tbh

by sureglymop4 hours ago|

parent|

prev|

[-]

It's also feasible without network storage, --copy-storage-all will migrate all disks too.