Documentations

Migration documentations is far from mature. There are a lot of things to do here. For example, we should have a good page describing each of the migration features and parameters. Maybe we can have something generated from qapi/migration.json. See qapi-doc extention and qapidoc.py. We may not want to generate included JSON files, but only the primitives introduced in migration.json.

We should have an unified place for migration documentation in general. The plan for now is we make it:

https://qemu.org/docs/master/devel/migration

We may want to move some wiki pages there, and fill up the holes. Currently this page is the solo place to keep migration ToDo items.

New Features

Postcopy 1G page support

Postcopy can work with 1G (x86_64) pages functionally, however it's mostly not working if each page need to be requested in 1G chunks because the page fault latency is too high. We need a way to fault the pages in smaller sizes like 4K on x86_64.

Previous works

There is an old RFC series that should make 1G pages work with postcopy in 4K chunks:

https://lore.kernel.org/r/20230117220914.2062125-1-peterx@redhat.com

That work requires a kernel feature called "hugetlb HGM", unfortunately it was not accepted upstream due to multiple complexities. Both the kernel and qemu side works are postponed permanently.

Current works

In 2023, KVM introduced a VM specific memfd typed file descriptor called guest-memfd. It works very like memfd but it is only used in virtual machine context.

Guest-memfd is initially almost designed to work for confidential computing contexts. However there's sign that it can be used to support even generic virtual machines. Either way, there's chance and use case to support huge pages too in guest-memfd. With that, there's also chance that VM huge page users move from hugetlbfs to guest-memfd.

The initial version of 1G page support for guest-memfd (based on pKVM use case) is proposed as RFC here:

https://lore.kernel.org/r/cover.1726009989.git.ackerleytng@google.com

Based on the pKVM work, there're two testing branches (for both Linux/QEMU) that can start to boot a VM using guest-memfd with or without huge pages. The branches are PoC as of now, and can be found at:

https://github.com/xzpeter/linux/commits/peter-gmem-v0.2/
https://github.com/xzpeter/qemu/commits/peter-gmem-v0.2/

To start a VM using guest-memfd as backend, one can create a guest-memfd with fully shared mode with "guest-memfd=on" specified in the memory-backend-memfd object:

-object memory-backend-memfd,id=mem,size=${mem},share=on,guest-memfd=on

Then to enable huge page, one can use the same hugetlb parameters used in memory-backend-memfd object, like:

-object memory-backend-memfd,id=mem,size=${mem},share=on,guest-memfd=on,hugetlb=on,hugetlbsize=1G

Above will create a fully shared guest-memfd object with 1G huge pages.

Guest-memfd manages huge page pool the same way as hugetlbfs, so that before starting the VM, the correct number of huge pages need to be allocated and reserved properly, or QEMU may fail to allocate hugetlb memory.

Next steps

allow guest-memfd huge pages to be split and collapse
allow userfaultfd registers on top of guest-memfd descriptor using offset ranges
support 4k granule faults on guest-memfd huge pages
more

Migration handshake

Currently QEMU relies on Libvirt to setup the same migration capabilities / parameters on both sides properly. This may not be needed in the future if QEMU can have a conversation between the source / destionation QEMU, and something more than that.

A summary of procedures for such a handshake may contain but not limited to:

Feature negociations. We'll probably have a list of features, so src can fetch from dst, and see what to enable and what to disable.

Sync migration configurations between src/dst, decide the migration configuration. E.g., dest can miss some feature, then src should know what's missing. If we pass this phase, we should make sure both qemus have the same migration setup and are all good to proceed.

Sync channels used for migration (mostly cover multifd / postcopy preempt). Channels should be managed by handshake core and named properly, so we know which channel is for whom.

Check device tree on both sides, etc., to make sure the migration is applicable. E.g., we should fail early and clearly on any device mismatch.

Bi-directional migration support with VMSD versioning

This may or may not be a good idea and evil may reside in the details, but let's just keep this a record just in case it will work out.

QEMU supports VMSD versioning, which allows to define structures with versioned fields. This structure should contain enough information to express "which version should need which fields", and this should apply on both sides. It means with such a versioning, logically we can do migration forward / backward, as long as the two QEMUs understand each other, and the source QEMU can conditionally send whatever will be supported by the destination QEMU.

Currently VMSD versioning won't work for bi-directional migration, but only forward migration (old->new QEMUs) because the source QEMU always send the latest version of VMSD fields, so that an old QEMU can already be surprised to see a newer version of VMSD received, then bail out with it.

If the source QEMU can know that the destination QEMU is an old one which only supports an old version of a VMSD, it can decide to only send the VMSD fields that are defined in _that_ version. Logically this could mean that we can do bi-directional migrations with VMSD versioning only, and this can unbind QEMU from using machine types for maintaining strict migration compatibility.

This will rely on the above handshake work, because obviously the source QEMU will need to know something from destination QEMU first on the device hierachy and VMSD versions for each device.

KTLS support

Currently we do TLS live migration leveraging gnutls. Is it possible to support KTLS? How we can further optimize TLS migration? Is it possible to leverage hardware acceleration? Etc.

Multifd+Postcopy

Currently the two features are not compatible, due to the fact that postcopy requires atomically update on the pages, while multifd is so far designed to receive pages directly into guest memory buffers.

Another issue on this is multifd code is still not well tested with postcopy VM switchover, so there can be bugs here and there on switchover even though that shouldn't be a major design issue.

Adding support for multifd+postcopy in some form should accelerate migration bandwidth for postcopy. Currently during postcopy phase we can only use majorly one channel for migrating pages, and that can easily bottleneck on one single CPU (postcopy preempt mode is an exception, however the extra channel for now is only for servicing faulted pages, so the bandwidth may not be improved by it). We may need to teach multifd recv threads to use UFFDIO_COPY.

File-based guest memory optimization

This will need to be built on top of userfaultfd-minor faults first.

When multifd is ready with postcopy, we could further optimize file-based recv, by using a second map for all guest pages, and feed the 2nd mapping pointers to recvmsg() syscalls. We need to make sure the 1st mapping of the guest memory is always protected by minor faults, so that the vCPUs cannot see the updates from the recvmsg(). With that, multifd recv threads can avoid using temp buffer and UFFDIO_COPY, instead it can directly recv data into guest page cache, then after recv complete issue UFFDIO_CONTINUE to install the pgtable for the 1st mapping.

Live snapshot for file memory (e.g. shmem)

We have background live snapshot support for anonymous memory. We don't yet support share memory like shmem because userfaultfd-wp doesn't support file memory when introduced. Now userfault-wp supports shared memory, and we can consider adding shmem support for it.

Note that multi-process shmem support will need extra work, as QEMU will also require the other processes to trap their writes to guest memory.

Allow QMP command "migrate[_incoming]" to take capabilities and parameters

See:

https://lists.nongnu.org/archive/html/qemu-devel/2024-10/msg04838.html

It's cleaner to user application that both "migrate" and "migrate_incoming" can take capabilities and parameters to be set together when invoking the migration (or incoming migration) request.

Discussed with Dan, that it may not have direct benefit to Libvirt, however it could make QEMU interface cleaner, so after years we may have a chance to obsolete the old way of setting capabilities / parameters.

Optimizations

Avoid page population when page is not populated

See the discussion here, which should contain the whole idea on both sides:

https://gitlab.com/qemu-project/qemu/-/issues/2839

On source side

On source side, migration thread scans guest pages even if guest pages are missing. It can cause guest pages to be populated due to the scanning.

One optimization we could do is using mincore() to detect whether page exists at all. Here, we need a kernel change to set bit 1 of mincore()'s results array to reflect that a swap entry exists (even if both page cache and swap cache is not resident in RAM). With that, migration thread can avoid touching any page that is reported completely missing by mincore(), instead send ZERO flag directly. This should work for both anonymous and shmem.

On dest side

In precopy stage, when receiving zero page, instead of scanning zero and memset(), we could discard the page instead. Postcopy should be fine (already using UFFDIO_ZEROPAGE if possible) but can double check.

Put cpu register errors for migrations

After commit 7191f24c7f ("accel/kvm/kvm-all: Handle register access errors"), CPU put()s can start to fail for migration (where it used to silently succeed). However that's not enough. At least three more things we can do:

(1) kvm_arch_put_registers() needs to report better than now

Currently only errno is reported, which is too coarse and not easy to investigate when it hit. We need to know what exact KVM ioctl() failed at least.

(2) Postcopy cpu put() errors

Not yet verified that postcopy can fail gracefully with a cpu put() error. If it works that'll be perfect, otherwise we'll need to fix it.

(3) Report error upwards rather than exit()

Hopefully the error can be reported via QMP rather than crashing QEMU even on dst (finally, in the UI in some form, not from QEMU log that we'll need to collect). Like what used to be proposed in:

https://lore.kernel.org/all/20220617144857.34189-1-peterx@redhat.com/

Device state concurrency

Device states save() and load() are currently done sequentially. It means QEMU only has one thread to fetch device states and dump them onto the migration stream one by one sequentially. It may not be ideal because:

There can be device that contains extremely large amount of data, like VFIO of a vGPU
There can be too many devices, considering large VM with hundreds of vCPUs
Some get()/put() are just slow, then if one device does it slow, it blocks all the other devices. We already observe some extremely slow get()s on CPU which can contribute to quite some portion of VM migration downtime.

A concurrent model might be good in this case to allow device states to be migrated in parallel. One idea is we can leverage multifd threads, so that it can send not only RAM pages, but also device states.

Device state downtime analysis and accountings

Currently live migration estimates the downtime _without_ considering device states. VFIO is a special case because it's migrated even during iterative sections, and it reports its own device pending states periodically so when QEMU estimates the downtime they can be accounted.

However 99% of devices don't do that like what VFIO does. It is currently a simplified model for most of the devices assuming that most device states will be trivial to migrate so not accounted as part of downtime. However that may not always be the case.

It might be useful to think of some way to investigate some common cases where device state can take time. A few examples:

For example, what if there are one thousand vCPUs? Even if save/load one vCPU takes not a lot of time, a thousand of them should be accounted as part of downtime.

Or, what if a device pre_save()/post_load() may take a lot of time? We seem to have observed that in virtio devices already, where loading can take relatively long time. One can also refer to the other section regarding to "Optimize memory updates for non-iterative vmstates".

Optimize memory updates for non-iterative vmstates

There's potential chance for speeding up device loads by optimizing QEMU memory updates. One can refer to this thread for more information:

https://lore.kernel.org/r/20230317081904.24389-1-xuchuangxclwt@bytedance.com

The work was proposed but not merged due to not enough review. But still the direction seems all fine and a possible path for future optimizations on shrinking the downtime.

Optimize postcopy downtime

For postcopy, currently there's one extra step to send the dirty bitmap during switchover, which is an extra overhead comparing to precopy migration. The bitmap is normally small enough to be ignored, but may not be the case for a super large VM. 1TB memory will correspond to ~32MB bitmap. Meanwhile receiving the bitmap will also require punching holes in the "just dirtied" pages, which will also take time.

One thing we can do here is we can allow destination to start running even _before_ the bitmap is received. Some new interface will be needed to allow destination QEMU consult "whether this page is the latest" to source QEMU. Meanwhile, MINOR userfault will be needed instead of MISSING userfault, so that QEMU destination can trap the pages even if page existed. It also means with current Linux (until today, v6.7) anonymous pages will not be able to be supported in this use case, as MINOR fault only support VM_SHARED.

Optimize migration bandwidth calculation

Bandwidth estimation is very important to migration, because it's the important factor that we'll use to decide when to switchover the VM from source VM to destination VM. However it can be estimated so wrong sometimes so that QEMU 8.2 we introduced a new migration parameter ("avail-switchover-bandwidth") just to allow an admin to specify that value when needed. However it's not easy to specify that value right.

There can be other ways to remedy this case, e.g., we can change the estimation logic to provide an average bandwidth.

Move XBZRLE to multifd

When you are doing inter-data centers migration, anything that you can got to help is welcome. In this cases xbzrle 'could' probably help. So, why is this here? a- Move it to multifd, we can implement it the same that zlib or zstd. b- We need to test it with more cache. I guess that 25% to 50% of RAM is not out of question. Current cache of 64MB is a joke for current workloads. c - We need to measure that it helps.

Improve migration bitmap handling

Split bitmap use. We always use all bitmaps: VGA, CODE & Migration, independently of what we are doing. We could improve it with:

VGA: only add it to VGA framebuffers
MIGRATION: We only need to allocate/handle it during migration.
CODE: Only needed with TCG, no need at all for KVM

Thread-ify dirty bitmap scanning

This is only an issue on the source host, because it scans the dirty bitmap during precopy to send whatever page is dirty. We clear the bit that is dirty, then send the pages.

This procedure is not scaling with the size of the guest, especially from memory POV. When the guest memory is large enough, and especially when the bitmap is sparsely set (aka, mostly zeros), it could happen that the scanning of the bitmap will easily become the bottleneck: the migration thread keeps spinning over the bitmap looking for rare 1s. We may want to have some way to thread-ify this procedure so that we can scan the bitmap concurrently.

We may or may not want to introduce yet another pool of threads just to do this. If we choose not to, logically we can still rely on the multifd threads to achieve the concurrency, but then something needs to be done to enhance the capability of the multifd thread pools:

It needs to start taking workload that has nothing to do with either page[] array (vanilla multifd), or IOV[]
It needs to know a way to further enqueue a dirty page found during the scanning, by either,
- Enqueue this "send this page" request back into multifd thread pools again, or,
- Send the pages in the thread that is doing the scanning.

It would also be good to verify this issue first, it's observed that for huge VMs (12TB, for example) the downtime can be much larger than expected, and it is also observed that NIC bandwidth during the switchover is much lower than the mbps before the switchover, even if multifd is enabled.

Unify error reporting

When destination QEMU fails loading a migration stream, currently QEMU will by default crash, dumping error messages into stderr (or when with "exit-on-error" set to false, set the error in migration states so that it can be queried from "query-migrate" later).

A better way to do might be that when QEMU destination fails, it reports the error to source (instead of cutting the wire, causing source QEMU sees a generic channel error). Then mgmt app can always query the source QEMU for errors as the single source of truth.

VFIO relevant

VFIO migration is now supported, but a lot of things are not mature for VFIO migration.

The migration handlers of the VFIO subsystem in QEMU are device-agnostic. VFIO migration support for a device relies on a vfio-pci variant driver which implements the required ops for migration and dirty tracking. This driver implementation relies it-self on firmware calls and the availability of HW resources can be critical for migration to succeed. This introduces new constraints on the QEMU migration subsystem because some code paths were previously considered as error-free. This is not the case anymore.

Kernel 6.9 has support for hisilicon, mlx5, pds devices. Intel QAT should be queued for 6.10. Currently, NVIDIA vGPU uses the VFIO mdev framework to support migration. Newer SR-IOV based vGPUs will use a vfio-pci variant driver in the future.

One complexity for VFIO migration is it can be vendor specific, and the kernel driver to support migration might be close-sourced. So it may mean that there may not be a lot the community can do.

Cleanups

Create a thread for migration destination

Right now it is a coroutine. It might be good to start using a thread model just like the src if possible (and we already have multifd recv threads).

Rewrite QEMUFile for migration

QEMUFile interface currently is pretty ugly, for example, all the qemu_put_*() APIs do not allow fault reporting, while fault needs to be detected explicitly using another qemu_file_get_error() or similar API. Another issue is currently migration uses two QEMUFile objects on each side to represent the main migration channel, having both objects connect to the same QIOChannel underneath. However the two QEMUFiles are actually internally bound together, for example, qemu_fclose() on the 1st QEMUFile object will also close the QIOChannel of QEMUFile of the other direction, which can be unwanted.

A re-write of that interface is always wanted, but more to explore on how exactly. Quotting from Daniel on a potential direction, which one can consider [1]:

 In the ideal world IMHO, QEMUFile would not exist at all, and we would
 have a QIOChannelCached that adds the read/write buffering above the
 base QIOChannel.

[1] https://lore.kernel.org/r/ZcC5QTO3tmt9gaCf@redhat.com

Multifd threading

Multifd could benefit from a more standardized thread model like a thread pool or another abstraction already implemented in the QEMU codebase. Before we can start looking at that there are some cleanups and consolidation that needs to happen.

First, there's the matter of the multifd threads having accounting (e.g. total_normal_pages) and making a copy of the data (p->normal). These should not be responsibility of the worker thread, but either be on the (multifd) client side or be converted to raw bytes. The multifd thread should receive opaque data and send it without knowledge. Same goes for the packet header, multifd should obtain that information in opaque form.

A second step would be to finish moving the operations that are done on the migration thread into multifd, such as zero page detection, postcopy and any new features currently in flight.

[these^ two steps are in progress. Give us a ping on the mailing list for more information]

Finally, these cleanups should already give us enough information about the requirements of multifd to figure out whether the existing thread models in QEMU are adequate for our needs or if we need to build something from scratch.

Migration cancel concurrency

We could take a closer look at the ramifications of having migrate_cancel running concurrently with the rest of migration. That routine has side-effects which are not documented, and aside from the BQL, not explicitly protected in the code.

It is also unclear what communicates the cancelling to the rest of the code. Is it changing the state (racy, see commit 6f2b811a61)? Or is it shutting down the migration files (erases the distinction between clean cancel and error)? In any case, this mechanism could be improved by using a specific flag to communicate cancelling and a separate routine for cleanup/poking threads.

Migration error detection and reporting

Migration error detection and reporting are still not as clear.

Due to the demand of "keeping the last migration error query-able", we have MigrationState.error. However we still have plenty of places using other ways to detect error, like qemu_file_get_error().

OTOH, when error happens, we still have tons of error_report() use cases floating all over, rather than properly reporting to the upper stack and update global Migration.error.

Some cleanup may be good in this regard. qemu_file_get_error() may only be removed for good if we can refactor qemufile API first; while for most of the error_report() they should be prone to be propagated upper the stack.

Tests

Device migration stream test framework

A major source of migration bug comes from device changes to VMSDs where it can break migration from old QEMU binary to a new one.

Such changes may not always copy migration maintainers, and also due to limited bandwidth of migration maintainers it also may not be possible to review all of such changes.

Currently, the major migration test we have in QEMU is still focused on migration framework in general. It means it has no coverage of specific device on compatibilities of migration streams, so migration can still break when specific device is configured in some VM setup, even if the migration test will all pass.

It would be great to have a device test framework in some form, so that there will be some coverage of, for example, a list of devices migrated from an older QEMU version to the new QEMU version (for example, the current branch), or even bi-directional migrations to allow backward migrations.

Note that even with such test framework it may not cover 100% of device migration, not only because it may be extremely hard (if not impossible..) to cover all the devices (where some device can be special here and there that may not be suitable for the test framework to be proposed), but also in that a device VMSD stream can be relevant to the device state, so it may also change depending on the guest behavior (e.g., device VMSD may differ between device being idle/active). However it'll still try to cover a major usage scenario.

Ping-pong migrations

It would be useful to refactor the initialization of test objects in a way that allow us to do ping-pong migrations (i.e. A->B->A). Currently that's not possible because test_migrate_start() has some hardcoded assumptions around src & dst identifiers (src/dst_serial, src/dst_state) and test_migrate_end() always cleans up both objects at once. The start_args->only_target option is not super useful these days, we could remove it in favor of two separate init functions for each object.

Migration downtime performance test

QEMU has a bunch of migration qtests, so far most of them are not performance relevant but functional. It may make sense to provide one performance based qtest to provide some fundamental downtime measurements for a migration process.

Such test can be built on top of the existing vmstate_downtime_* tracepoints. The tracepoints are supported now on both source and destination QEMUs.

To start from simple (however hopefully still try to cover the basic stuff), we can define such initial test as:

It should support both precopy and postcopy, on measuring downtime during switchover
The report should be flexible, on reporting a summary (e.g. total downtime), but also portions of downtime. We could start with very basic portions of that, e.g., iterable, non-iterable, total downtime.
We could start with reporting only source QEMU downtime, because:
- Only source has the complete picture of downtime, and
- Involving destination downtime can start to become challenging too, because some of the source downtime will overlap (happen concurrently) with the destination downtime.
It should be able to run a few rounds of the same measurements, and report some statistics rather than relying on one single test results.

With that, maybe at some point we could try to add a CI gating trying to detect obvious downtime regressions, but that's for later.

Bugs/Known Issues

Upstream reports

If you're looking for other migration bugs to fix, feel free to have a look at:

https://gitlab.com/qemu-project/qemu/-/issues/?label_name%5B%5D=Migration

Pick up whatever you want!

Instance ID mismatch after device unplug

We may have an issue with using instance_id to mark a device (alone with SaveStateEntry.idstr[]). The issue is after a device with id=1 unplugged, we could left device id=2 alone, but when creating the VM with the same cmdline (after id=1 device unplugged), the leftover device will have id=1 on destination.

Issue link: https://issues.redhat.com/browse/RHEL-56331

For the longer term we may want to use something more reliable to replace instance_id, so it can live across device unplugs, and still match with destination.

Network failure during postcopy can cause guest reset

In case a guest is blocked in an unresolved page fault for an extended amount of time, for example due to a network issue, the guest might get reset by a watchdog device. For the reset to happen, the following conditions need to be met:

a page fault in the guest memory occurs and it is not resolved before the configured watchdog timeout,
watchdog daemon running in the guest is blocked by the page-fault,
QEMU thread advancing the timers is not blocked by the page-fault.

Issue link: https://issues.redhat.com/browse/RHEL-60552

This cannot be correctly prevented in QEMU alone — the watchdog device is working as intended, resetting a machine that is unresponsive. In case the guest must not be reset, the guest watchdog daemon must disable the watchdog or increase its timeout. Possibly, the guest could be informed about a running migration by a PV interface (needs to be implemented), so it can prepare accordingly.