ToDo/LiveMigration

From QEMU

Documentations

Migration documentations is far from mature. There are a lot of things to do here. For example, we should have a good page describing each of the migration features and parameters. Maybe we can have something generated from qapi/migration.json. See qapi-doc extention and qapidoc.py. We may not want to generate included JSON files, but only the primitives introduced in migration.json.

We should have an unified place for migration documentation in general. The plan for now is we make it:

https://qemu.org/docs/master/devel/migration

We may want to move some wiki pages there, and fill up the holes. Currently this page is the solo place to keep migration ToDo items.

New Features

Postcopy 1G page support

Postcopy can work with 1G (x86_64) pages functionally, however it's mostly not working if each page need to be requested in 1G chunks. We need a way to fault the pages in smaller sizes like 4K on x86_64.

Previous RFC works:

https://lore.kernel.org/r/20230117220914.2062125-1-peterx@redhat.com

That work requires a kernel feature called "hugetlb HGM" and it was not accepted upstream due to multiple complexities. It may or may not go in that direction anymore because KVM has plan to propose a separate userfault interface considering the evolving of guest_memfd. One can see:

https://lore.kernel.org/r/CALzav=d23P5uE=oYqMpjFohvn0CASMJxXB_XEOEi-jtqWcFTDA@mail.gmail.com
https://lore.kernel.org/r/20240103174343.3016720-1-seanjc@google.com

No matter what will be the kernel interface to be merged to support 1G, QEMU will need to support it to finally enable 1G.

Migration handshake

Currently QEMU relies on Libvirt to setup the same migration capabilities / parameters on both sides properly. This may not be needed in the future if QEMU can have a conversation between the source / destionation QEMU, and something more than that.

A summary of procedures for such a handshake may contain but not limited to:

  • Feature negociations. We'll probably have a list of features, so src can fetch from dst, and see what to enable and what to disable.
  • Sync migration configurations between src/dst, decide the migration configuration. E.g., dest can miss some feature, then src should know what's missing. If we pass this phase, we should make sure both qemus have the same migration setup and are all good to proceed.
  • Sync channels used for migration (mostly cover multifd / postcopy preempt). Channels should be managed by handshake core and named properly, so we know which channel is for whom.
  • Check device tree on both sides, etc., to make sure the migration is applicable. E.g., we should fail early and clearly on any device mismatch.

KTLS support

Currently we do TLS live migration leveraging gnutls. Is it possible to support KTLS? How we can further optimize TLS migration? Is it possible to leverage hardware acceleration? Etc.

Multifd+Postcopy

Currently the two features are not compatible, due to the fact that postcopy requires atomically update on the pages, while multifd is so far designed to receive pages directly into guest memory buffers.

Another issue on this is multifd code is still not well tested with postcopy VM switchover, so there can be bugs here and there on switchover even though that shouldn't be a major design issue.

Adding support for multifd+postcopy in some form should accelerate migration bandwidth for postcopy. Currently during postcopy phase we can only use majorly one channel for migrating pages, and that can easily bottleneck on one single CPU (postcopy preempt mode is an exception, however the extra channel for now is only for servicing faulted pages, so the bandwidth may not be improved by it).

Live snapshot for file memory (e.g. shmem)

We have background live snapshot support for anonymous memory. We don't yet support share memory like shmem because userfaultfd-wp doesn't support file memory when introduced. Now userfault-wp supports shared memory, and we can consider adding shmem support for it.

Note that multi-process shmem support will need extra work, as QEMU will also require the other processes to trap their writes to guest memory.

Optimizations

Put cpu register errors for migrations

After commit 7191f24c7f ("accel/kvm/kvm-all: Handle register access errors"), CPU put()s can start to fail for migration (where it used to silently succeed). However that's not enough. At least three more things we can do:

(1) kvm_arch_put_registers() needs to report better than now

Currently only errno is reported, which is too coarse and not easy to investigate when it hit. We need to know what exact KVM ioctl() failed at least.

(2) Postcopy cpu put() errors

Not yet verified that postcopy can fail gracefully with a cpu put() error. If it works that'll be perfect, otherwise we'll need to fix it.

(3) Report error upwards rather than exit()

Hopefully the error can be reported via QMP rather than crashing QEMU even on dst (finally, in the UI in some form, not from QEMU log that we'll need to collect). Like what used to be proposed in:

https://lore.kernel.org/all/20220617144857.34189-1-peterx@redhat.com/

Device state concurrency

Device states save() and load() are currently done sequentially. It means QEMU only has one thread to fetch device states and dump them onto the migration stream one by one sequentially. It may not be ideal because:

  • There can be device that contains extremely large amount of data, like VFIO of a vGPU
  • There can be too many devices, considering large VM with hundreds of vCPUs
  • Some get()/put() are just slow, then if one device does it slow, it blocks all the other devices. We already observe some extremely slow get()s on CPU which can contribute to quite some portion of VM migration downtime.

A concurrent model might be good in this case to allow device states to be migrated in parallel. One idea is we can leverage multifd threads, so that it can send not only RAM pages, but also device states.

Device state downtime analysis and accountings

Currently live migration estimates the downtime _without_ considering device states. VFIO is a special case because it's migrated even during iterative sections, and it reports its own device pending states periodically so when QEMU estimates the downtime they can be accounted.

However 99% of devices don't do that like what VFIO does. It is currently a simplified model for most of the devices assuming that most device states will be trivial to migrate so not accounted as part of downtime. However that may not always be the case.

It might be useful to think of some way to investigate some common cases where device state can take time. A few examples:

For example, what if there are one thousand vCPUs? Even if save/load one vCPU takes not a lot of time, a thousand of them should be accounted as part of downtime.

Or, what if a device pre_save()/post_load() may take a lot of time? We seem to have observed that in virtio devices already, where loading can take relatively long time. One can also refer to the other section regarding to "Optimize memory updates for non-iterative vmstates".

Optimize memory updates for non-iterative vmstates

There's potential chance for speeding up device loads by optimizing QEMU memory updates. One can refer to this thread for more information:

https://lore.kernel.org/r/20230317081904.24389-1-xuchuangxclwt@bytedance.com

The work was proposed but not merged due to not enough review. But still the direction seems all fine and a possible path for future optimizations on shrinking the downtime.

Optimize postcopy downtime

For postcopy, currently there's one extra step to send the dirty bitmap during switchover, which is an extra overhead comparing to precopy migration. The bitmap is normally small enough to be ignored, but may not be the case for a super large VM. 1TB memory will correspond to ~32MB bitmap. Meanwhile receiving the bitmap will also require punching holes in the "just dirtied" pages, which will also take time.

One thing we can do here is we can allow destination to start running even _before_ the bitmap is received. Some new interface will be needed to allow destination QEMU consult "whether this page is the latest" to source QEMU. Meanwhile, MINOR userfault will be needed instead of MISSING userfault, so that QEMU destination can trap the pages even if page existed. It also means with current Linux (until today, v6.7) anonymous pages will not be able to be supported in this use case, as MINOR fault only support VM_SHARED.

Optimize migration bandwidth calculation

Bandwidth estimation is very important to migration, because it's the important factor that we'll use to decide when to switchover the VM from source VM to destination VM. However it can be estimated so wrong sometimes so that QEMU 8.2 we introduced a new migration parameter ("avail-switchover-bandwidth") just to allow an admin to specify that value when needed. However it's not easy to specify that value right.

There can be other ways to remedy this case, e.g., we can change the estimation logic to provide an average bandwidth.

Move XBZRLE to multifd

When you are doing inter-data centers migration, anything that you can got to help is welcome. In this cases xbzrle 'could' probably help. So, why is this here? a- Move it to multifd, we can implement it the same that zlib or zstd. b- We need to test it with more cache. I guess that 25% to 50% of RAM is not out of question. Current cache of 64MB is a joke for current workloads. c - We need to measure that it helps.

Improve migration bitmap handling

Split bitmap use. We always use all bitmaps: VGA, CODE & Migration, independently of what we are doing. We could improve it with:

  • VGA: only add it to VGA framebuffers
  • MIGRATION: We only need to allocate/handle it during migration.
  • CODE: Only needed with TCG, no need at all for KVM

Thread-ify dirty bitmap scanning

This is only an issue on the source host, because it scans the dirty bitmap during precopy to send whatever page is dirty. We clear the bit that is dirty, then send the pages.

This procedure is not scaling with the size of the guest, especially from memory POV. When the guest memory is large enough, and especially when the bitmap is sparsely set (aka, mostly zeros), it could happen that the scanning of the bitmap will easily become the bottleneck: the migration thread keeps spinning over the bitmap looking for rare 1s. We may want to have some way to thread-ify this procedure so that we can scan the bitmap concurrently.

We may or may not want to introduce yet another pool of threads just to do this. If we choose not to, logically we can still rely on the multifd threads to achieve the concurrency, but then something needs to be done to enhance the capability of the multifd thread pools:

  • It needs to start taking workload that has nothing to do with either page[] array (vanilla multifd), or IOV[]
  • It needs to know a way to further enqueue a dirty page found during the scanning, by either,
    • Enqueue this "send this page" request back into multifd thread pools again, or,
    • Send the pages in the thread that is doing the scanning.

It would also be good to verify this issue first, it's observed that for huge VMs (12TB, for example) the downtime can be much larger than expected, and it is also observed that NIC bandwidth during the switchover is much lower than the mbps before the switchover, even if multifd is enabled.

VFIO relevant

VFIO migration is now supported, but a lot of things are not mature for VFIO migration.

The migration handlers of the VFIO subsystem in QEMU are device-agnostic. VFIO migration support for a device relies on a vfio-pci variant driver which implements the required ops for migration and dirty tracking. This driver implementation relies it-self on firmware calls and the availability of HW resources can be critical for migration to succeed. This introduces new constraints on the QEMU migration subsystem because some code paths were previously considered as error-free. This is not the case anymore.

Kernel 6.9 has support for hisilicon, mlx5, pds devices. Intel QAT should be queued for 6.10. Currently, NVIDIA vGPU uses the VFIO mdev framework to support migration. Newer SR-IOV based vGPUs will use a vfio-pci variant driver in the future.

One complexity for VFIO migration is it can be vendor specific, and the kernel driver to support migration might be close-sourced. So it may mean that there may not be a lot the community can do.

Cleanups

Create a thread for migration destination

Right now it is a coroutine. It might be good to start using a thread model just like the src if possible (and we already have multifd recv threads).

Rewrite QEMUFile for migration

QEMUFile interface currently is pretty ugly, for example, all the qemu_put_*() APIs do not allow fault reporting, while fault needs to be detected explicitly using another qemu_file_get_error() or similar API. Another issue is currently migration uses two QEMUFile objects on each side to represent the main migration channel, having both objects connect to the same QIOChannel underneath. However the two QEMUFiles are actually internally bound together, for example, qemu_fclose() on the 1st QEMUFile object will also close the QIOChannel of QEMUFile of the other direction, which can be unwanted.

A re-write of that interface is always wanted, but more to explore on how exactly. Quotting from Daniel on a potential direction, which one can consider [1]:

 In the ideal world IMHO, QEMUFile would not exist at all, and we would
 have a QIOChannelCached that adds the read/write buffering above the
 base QIOChannel.

[1] https://lore.kernel.org/r/ZcC5QTO3tmt9gaCf@redhat.com

Multifd threading

Multifd could benefit from a more standardized thread model like a thread pool or another abstraction already implemented in the QEMU codebase. Before we can start looking at that there are some cleanups and consolidation that needs to happen.

First, there's the matter of the multifd threads having accounting (e.g. total_normal_pages) and making a copy of the data (p->normal). These should not be responsibility of the worker thread, but either be on the (multifd) client side or be converted to raw bytes. The multifd thread should receive opaque data and send it without knowledge. Same goes for the packet header, multifd should obtain that information in opaque form.

A second step would be to finish moving the operations that are done on the migration thread into multifd, such as zero page detection, postcopy and any new features currently in flight.

[these^ two steps are in progress. Give us a ping on the mailing list for more information]

Finally, these cleanups should already give us enough information about the requirements of multifd to figure out whether the existing thread models in QEMU are adequate for our needs or if we need to build something from scratch.

Migration cancel concurrency

We could take a closer look at the ramifications of having migrate_cancel running concurrently with the rest of migration. That routine has side-effects which are not documented, and aside from the BQL, not explicitly protected in the code.

It is also unclear what communicates the cancelling to the rest of the code. Is it changing the state (racy, see commit 6f2b811a61)? Or is it shutting down the migration files (erases the distinction between clean cancel and error)? In any case, this mechanism could be improved by using a specific flag to communicate cancelling and a separate routine for cleanup/poking threads.

Tests

Device migration stream test framework

A major source of migration bug comes from device changes to VMSDs where it can break migration from old QEMU binary to a new one.

Such changes may not always copy migration maintainers, and also due to limited bandwidth of migration maintainers it also may not be possible to review all of such changes.

Currently, the major migration test we have in QEMU is still focused on migration framework in general. It means it has no coverage of specific device on compatibilities of migration streams, so migration can still break when specific device is configured in some VM setup, even if the migration test will all pass.

It would be great to have a device test framework in some form, so that there will be some coverage of, for example, a list of devices migrated from an older QEMU version to the new QEMU version (for example, the current branch), or even bi-directional migrations to allow backward migrations.

Note that even with such test framework it may not cover 100% of device migration, not only because it may be extremely hard (if not impossible..) to cover all the devices (where some device can be special here and there that may not be suitable for the test framework to be proposed), but also in that a device VMSD stream can be relevant to the device state, so it may also change depending on the guest behavior (e.g., device VMSD may differ between device being idle/active). However it'll still try to cover a major usage scenario.

Bugs

Upstream reports

If you're looking for other migration bugs to fix, feel free to have a look at:

https://gitlab.com/qemu-project/qemu/-/issues/?label_name%5B%5D=Migration

Pick up whatever you want!