Revision as of 03:31, 16 January 2024

Documentations

Migration documentations is far from mature. There are a lot of things to do here. For example, we should have a good page describing each of the migration features and parameters. Maybe we can have something generated from qapi/migration.json.

We should have an unified place for migration documentation. The plan for now is we make it:

https://qemu.org/docs/master/devel/migration

We may want to move some wiki pages there, and fill up the holes. Currently this page is the solo place to keep migration ToDo items.

New Features

Fixed RAM migration

This allows migration to a file/fd with constant file offsets for each page. This work is currently led by Fabiano Rosas <farosas@suse.de> upstream.

CPR VFIO migration

This allows fast hypervisor upgrade (including both kernel / qemu) even with VFIO device assignments. This work is currently led by Steven Sistare <steven.sistare@oracle.com> upstream.

Postcopy 1G page support

Postcopy can work with 1G (x86_64) pages functionally, however it's mostly not working if each page need to be requested in 1G chunks. We need a way to fault the pages in smaller sizes like 4K on x86_64.

Previous RFC works:

https://lore.kernel.org/r/20230117220914.2062125-1-peterx@redhat.com

That work requires a kernel feature called "hugetlb HGM" and it was not accepted upstream due to multiple complexities. It may or may not go in that direction anymore because KVM has plan to propose a separate userfault interface considering the evolving of guest_memfd. One can see:

https://lore.kernel.org/r/CALzav=d23P5uE=oYqMpjFohvn0CASMJxXB_XEOEi-jtqWcFTDA@mail.gmail.com
https://lore.kernel.org/r/20240103174343.3016720-1-seanjc@google.com

No matter what will be the kernel interface to be merged to support 1G, QEMU will need to support it to finally enable 1G.

Migration handshake

Currently QEMU relies on Libvirt to setup the same migration capabilities / parameters on both sides properly. This may not be needed in the future if QEMU can have a conversation between the source / destionation QEMU, and something more than that.

A summary of procedures for such a handshake may contain but not limited to:

Feature negociations. We'll probably have a list of features, so src can fetch from dst, and see what to enable and what to disable.

Sync migration configurations between src/dst, decide the migration configuration. E.g., dest can miss some feature, then src should know what's missing. If we pass this phase, we should make sure both qemus have the same migration setup and are all good to proceed.

Sync channels used for migration (mostly cover multifd / postcopy preempt). Channels should be managed by handshake core and named properly, so we know which channel is for whom.

Check device tree on both sides, etc., to make sure the migration is applicable. E.g., we should fail early and clearly on any device mismatch.

KTLS support

Currently we do TLS live migration leveraging gnutls. Is it possible to support KTLS? How we can further optimize TLS migration? Is it possible to leverage hardware acceleration? Etc.

Multifd+Postcopy

Currently the two features are not compatible, due to the fact that postcopy requires atomically update on the pages, while multifd is so far designed to receive pages directly into guest memory buffers.

Another issue on this is multifd code is still not well tested with postcopy VM switchover, so there can be bugs here and there on switchover even though that shouldn't be a major design issue.

Adding support for multifd+postcopy in some form should accelerate migration bandwidth for postcopy. Currently during postcopy phase we can only use majorly one channel for migrating pages, and that can easily bottleneck on one single CPU (postcopy preempt mode is an exception, however the extra channel for now is only for servicing faulted pages, so the bandwidth may not be improved by it).

Live snapshot for file memory (e.g. shmem)

We have background live snapshot support for anonymous memory. We don't yet support share memory like shmem because userfaultfd-wp doesn't support file memory when introduced. Now userfault-wp supports shared memory, and we can consider adding shmem support for it.

Note that multi-process shmem support will need extra work, as QEMU will also require the other processes to trap their writes to guest memory.

Optimizations

Device state downtime analysis and accountings

Currently live migration estimates the downtime _without_ considering device states. VFIO is a special case because it's migrated even during iterative sections, and it reports its own device pending states periodically so when QEMU estimates the downtime they can be accounted.

However 99% of devices don't do that like what VFIO does. It is currently a simplified model for most of the devices assuming that most device states will be trivial to migrate so not accounted as part of downtime. However that may not always be the case.

It might be useful to think of some way to investigate some common cases where device state can take time. A few examples:

For example, what if there are one thousand vCPUs? Even if save/load one vCPU takes not a lot of time, a thousand of them should be accounted as part of downtime.

Or, what if a device pre_save()/post_load() may take a lot of time? We seem to have observed that in virtio devices already, where loading can take relatively long time. One can also refer to the other section regarding to "Optimize memory updates for non-iterative vmstates".

Optimize memory updates for non-iterative vmstates

There's potential chance for speeding up device loads by optimizing QEMU memory updates. One can refer to this thread for more information:

https://lore.kernel.org/r/20230317081904.24389-1-xuchuangxclwt@bytedance.com

The work was proposed but not merged due to not enough review. But still the direction seems all fine and a possible path for future optimizations on shrinking the downtime.

Optimize postcopy downtime

For postcopy, currently there's one extra step to send the dirty bitmap during switchover, which is an extra overhead comparing to precopy migration. The bitmap is normally small enough to be ignored, but may not be the case for a super large VM. 1TB memory will correspond to ~32MB bitmap. Meanwhile receiving the bitmap will also require punching holes in the "just dirtied" pages, which will also take time.

One thing we can do here is we can allow destination to start running even _before_ the bitmap is received. Some new interface will be needed to allow destination QEMU consult "whether this page is the latest" to source QEMU. Meanwhile, MINOR userfault will be needed instead of MISSING userfault, so that QEMU destination can trap the pages even if page existed. It also means with current Linux (until today, v6.7) anonymous pages will not be able to be supported in this use case, as MINOR fault only support VM_SHARED.

Optimize migration bandwidth calculation

Bandwidth estimation is very important to migration, because it's the important factor that we'll use to decide when to switchover the VM from source VM to destination VM. However it can be estimated so wrong sometimes so that QEMU 8.2 we introduced a new migration parameter ("avail-switchover-bandwidth") just to allow an admin to specify that value when needed. However it's not easy to specify that value right.

There can be other ways to remedy this case, e.g., we can change the estimation logic to provide an average bandwidth.

Move XBZRLE to multifd

When you are doing inter-data centers migration, anything that you can got to help is welcome. In this cases xbzrle 'could' probably help. So, why is this here? a- Move it to multifd, we can implement it the same that zlib or zstd. b- We need to test it with more cache. I guess that 25% to 50% of RAM is not out of question. Current cache of 64MB is a joke for current workloads. c - We need to measure that it helps.

Improve migration bitmap handling

Split bitmap use. We always use all bitmaps: VGA, CODE & Migration, independently of what we are doing. We could improve it with:

VGA: only add it to VGA framebuffers
MIGRATION: We only need to allocate/handle it during migration.
CODE: Only needed with TCG, no need at all for KVM

VFIO relevant

VFIO migration is now supported, but a lot of things are not mature for VFIO migration. TBD. One complexity for VFIO migration is it can be vendor specific, and the kernel driver to support migration might be close-sourced. So it may mean that there may not be a lot the community can do.

Cleanups

Create a thread for migration destination

Right now it is a coroutine. It might be good to start using a thread model just like the src if possible (and we already have multifd recv threads).

Rewrite QEMUFile for migration

QEMUFile interface currently is pretty ugly, for example, all the qemu_put_*() APIs do not allow fault reporting, while fault needs to be detected explicitly using another qemu_file_get_error() or similar API. A re-write of that interface is always wanted, but more to explore on how exactly.

Migration notifiers

We now have misc notifiers. Postcopy has its own. Precopy passes MigrationState* into the notifier.

A cleanup can be done to:

merge the two use cases as a generic migration notifier,
avoid passing in MigrationState*
allow notifiers to fail

Bugs

Detect put cpu register errors for migrations

This is a long standing issue, that we don't fail a migration even if e.g. ioctl(KVM_SET_XSAVE) fails. It can cause misterous guest issues later when it used to fail applying CPU states to KVM. Just imagine the guest vCPUs are kicked off to run even with a corrupted CPU states. That is chaos. See this thread:

https://lore.kernel.org/all/20220617144857.34189-1-peterx@redhat.com/

We used to already hit bugs and notice it was lost:

https://lore.kernel.org/all/ZQLOVjLtFnGESG0S@luigi.stachecki.net/

We need to revive it in one form or another.

Upstream reports

If you're looking for other migration bugs to fix, feel free to have a look at:

https://gitlab.com/qemu-project/qemu/-/issues/?label_name%5B%5D=Migration

Pick up whatever you want!