ToDo/LiveMigration: Difference between revisions

From QEMU
(Rewrite migration todo page)
Line 1: Line 1:
= Live Migration Road Map =
= Known Issues =


This section maintains what long-standing known issues with live migration.  We need to fix them.


== Stability ==
== Detect put cpu register errors for migrations ==
* Complete VmState transition
** CPU port posted upstream
** virtio is posted to vmstate, still missing are virtio-serial lists.
** slirp is posted to vmstate still missing are the toplevel lists , need some testing
* Visitors
* Device State automatic code generation - for example using annotation like Qt.
* Migration downtime calculation :
** The calculation of estimated migration downtime is done with the last bandwidth (how much we sent in the last iteration).The bandwidth can fluctuate between iteration it could be better to try and use an average bandwidth. Because QEMU is single threaded the actual downtime can be greater, if the thread will be busy. Separating the migration thread can help in this case.
** We need a mechanism to detect when we exceed maximal downtime and return to the iteration phase This can be implement using a timer.
* Migration speed calculation:
* Default migration speed can be too low, this can result in extending the migration and in some case never complete it (https://bugzilla.redhat.com/show_bug.cgi?id=695394).
* In the current implementation calculating the actual migration speed is very complex: we use a QemuFileBuffered for the outgoing migration , it can sends the data in two cases: 100 millisecond pasted from the previous packet or the buffer is full (~3.2M).
* Tests:
** autotest for migration (already exist), need to add test with guest and host load.
** VmState unit test , save to file/ load for file.
** VmState Sections/Subsections testing.
* Bugs !


== Performance ==
This is a long standing issue, that we don't fail a migration even if e.g. ioctl(KVM_SET_XSAVE) fails.  It can cause misterous guest issues later when it used to fail applying CPU states to KVM.
* Sending cold pages aka Page Priority - also SAP (see http://www.linux-kvm.org/wiki/images/c/cb/2011-forum-kvm_hudzia.pdf)
* Splitting Bitmap - Juan is working on it.
* Migration protocol - The protocol should be separate from device state and data format.It should be bi-directional protocol unlike today.
* Remove Buffered File - Too many copies of the data. Will not be needed when migration will have separate thread/s .


== Features ==
See this thread:
* Live snapshot (Andrey working on this: https://lore.kernel.org/qemu-devel/20201204093103.9878-1-andrey.gruzdev@virtuozzo.com/)
 
** Use the new util/userfaultfd.c across qemu repo (majorly, postcopy)
  https://lore.kernel.org/all/20220617144857.34189-1-peterx@redhat.com/
** disable dirty logging since not necessary for snapshots
 
** page request delay measurements
We need to revive it in one form or another.
** faster fault handling (better responsiveness for the user)
 
** support shmem/hugetlbfs
== Documentation, documentations, documentations ==
*** before that, we should disable for shmem/hugetlbfs - they need special care...
 
**** try to reuse the postcopy vhost-user framework on negociation and providing fault/wakeup handlers
Migration documentations is far from mature.  There are a lot of things to do here.  For example, we should have a good page describing each of the migration features and parameters.  Maybe we can have something generated from qapi/migration.json.
 
We should have an unified place for migration documentation. The plan for now is we make it:
 
  https://qemu.org/docs/master/devel/migration
 
We may want to move some wiki pages there, and fill up the holes.
 
We can keep using this page for migration relevant todo items, or we can move to some other place in the future.
 
= New Features =
 
== Migration handshake ==
 
Currently QEMU relies on Libvirt to setup the same migration capabilities / parameters on both sides properly.  This may not be needed in the future if QEMU can have a conversation between the source / destionation QEMU, and something more than that.
 
One quick thought of such a handshake may contain:
 
A summary of procedures for such a handshake may contain but not limited to:
 
* Feature negociations. We'll probably have a list of features, so src can fetch from dst, and see what to enable and what to disable.
 
* Sync migration configurations between src/dst, decide the migration configuration. E.g., dest can miss some feature, then src should know what's missing.  If we pass this phase, we should make sure both qemus have the same migration setup and are all good to proceed.
 
* Sync channels used for migration (mostly cover multifd / postcopy preempt).  Channels should be managed by handshake core and named properly, so we know which channel is for whom.
 
* Check device tree on both sides, etc., to make sure the migration is applicable.  E.g., we should fail early and clearly on any device mismatch.
 
== KTLS support ==
 
Currently we do TLS live migration leveraging gnutls.  Is it possible to support KTLS?  How we can further optimize TLS migration?  Is it possible to leverage hardware acceleration?  Etc.
 
== Multifd+Postcopy ==
 
Currently the two features are not compatible, due to the fact that postcopy requires atomically update on the pages, while multifd is so far designed to receive pages directly into guest memory buffers.
 
Another issue on this is multifd code is still not well tested with postcopy VM switchover, so there can be bugs here and there on switchover even though that shouldn't be a major design issue.
 
Adding support for multifd+postcopy in some form should accelerate migration bandwidth for postcopy.  Currently during postcopy phase we can only use majorly one channel for migrating pages, and that can easily bottleneck on one single CPU (postcopy preempt mode is an exception, however the extra channel for now is only for servicing faulted pages, so the bandwidth may not be improved by it).
 
== Live snapshot for file memory (e.g. shmem) ==
 
We have background live snapshot support for anonymous memory.  We don't yet support share memory like shmem because userfaultfd-wp doesn't support file memory when introduced.  Now userfault-wp supports shared memory, and we can consider adding shmem support for it.
 
Note that multi-process shmem support will need extra work, as QEMU will also require the other processes to trap their writes to guest memory.
 
= Downtime optimizations =
 
There're a bunch of things we can consider doing to optimize migration downtimes.
 
== Device state downtime analysis and accountings ==
 
Currently live migration estimates the downtime _without_ considering
device states.  VFIO is a special case because it's migrated even during
iterative sections, and it reports its own device pending states
periodically so when QEMU estimates the downtime they can be accounted.
 
However 99% of devices don't do that like what VFIO does.  It is
currently a simplified model for most of the devices assuming that
most device states will be trivial to migrate so not accounted as part
of downtime.  However that may not always be the case.
 
It might be useful to think of some way to investigate some common
cases where device state can take time.  A few examples:
 
For example, what if there are one thousand vCPUs?  Even if save/load
one vCPU takes not a lot of time, a thousand of them should be
accounted as part of downtime.
 
Or, what if a device pre_save()/post_load() may take a lot of time?  We
seem to have observed that in virtio devices already, where loading
can take relatively long time.  One can also refer to the other
section regarding to "Optimize memory updates for non-iterative vmstates".
 
== Optimize memory updates for non-iterative vmstates ==
 
There's potential chance for speeding up device loads by optimizing
QEMU memory updates.  One can refer to this thread for more information:
 
  https://lore.kernel.org/r/20230317081904.24389-1-xuchuangxclwt@bytedance.com
 
The work was proposed but not merged due to not enough review.  But
still the direction seems all fine and a possible path for future
optimizations on shrinking the downtime.
 
== Optimize postcopy downtime ==
 
For postcopy, currently there's one extra step to send the dirty
bitmap during switchover, which is an extra overhead comparing to
precopy migration.  The bitmap is normally small enough to be
ignored, but may not be the case for a super large VM.  1TB memory
will correspond to ~32MB bitmap.  Meanwhile receiving the bitmap
will also require punching holes in the "just dirtied" pages, which
will also take time.
 
One thing we can do here is we can allow destination to start
running even _before_ the bitmap is received.  Some new interface
will be needed to allow destination QEMU consult "whether this page
is the latest" to source QEMU.  Meanwhile, MINOR userfault will be
needed instead of MISSING userfault, so that QEMU destination can
trap the pages even if page existed. It also means with current
Linux (until today, v6.7) anonymous pages will not be able to be
supported in this use case, as MINOR fault only support VM_SHARED.
 
== Optimize migration bandwidth calculation ==
 
Bandwidth estimation is very important to migration, because it's
the important factor that we'll use to decide when to switchover the
VM from source VM to destination VM.  However it can be estimated so
wrong sometimes so that QEMU 8.2 we introduced a new migration parameter
("avail-switchover-bandwidth") just to allow an admin to specify
that value when needed.  However it's not easy to specify that value
right.
 
There can be other ways to remedy this case, e.g., we can change the
estimation logic to provide an average bandwidth.
 
= VFIO =
 
A lot of things are not mature for VFIO migration.

Revision as of 09:45, 9 January 2024

Known Issues

This section maintains what long-standing known issues with live migration. We need to fix them.

Detect put cpu register errors for migrations

This is a long standing issue, that we don't fail a migration even if e.g. ioctl(KVM_SET_XSAVE) fails. It can cause misterous guest issues later when it used to fail applying CPU states to KVM.

See this thread:

 https://lore.kernel.org/all/20220617144857.34189-1-peterx@redhat.com/

We need to revive it in one form or another.

Documentation, documentations, documentations

Migration documentations is far from mature. There are a lot of things to do here. For example, we should have a good page describing each of the migration features and parameters. Maybe we can have something generated from qapi/migration.json.

We should have an unified place for migration documentation. The plan for now is we make it:

 https://qemu.org/docs/master/devel/migration

We may want to move some wiki pages there, and fill up the holes.

We can keep using this page for migration relevant todo items, or we can move to some other place in the future.

New Features

Migration handshake

Currently QEMU relies on Libvirt to setup the same migration capabilities / parameters on both sides properly. This may not be needed in the future if QEMU can have a conversation between the source / destionation QEMU, and something more than that.

One quick thought of such a handshake may contain:

A summary of procedures for such a handshake may contain but not limited to:

  • Feature negociations. We'll probably have a list of features, so src can fetch from dst, and see what to enable and what to disable.
  • Sync migration configurations between src/dst, decide the migration configuration. E.g., dest can miss some feature, then src should know what's missing. If we pass this phase, we should make sure both qemus have the same migration setup and are all good to proceed.
  • Sync channels used for migration (mostly cover multifd / postcopy preempt). Channels should be managed by handshake core and named properly, so we know which channel is for whom.
  • Check device tree on both sides, etc., to make sure the migration is applicable. E.g., we should fail early and clearly on any device mismatch.

KTLS support

Currently we do TLS live migration leveraging gnutls. Is it possible to support KTLS? How we can further optimize TLS migration? Is it possible to leverage hardware acceleration? Etc.

Multifd+Postcopy

Currently the two features are not compatible, due to the fact that postcopy requires atomically update on the pages, while multifd is so far designed to receive pages directly into guest memory buffers.

Another issue on this is multifd code is still not well tested with postcopy VM switchover, so there can be bugs here and there on switchover even though that shouldn't be a major design issue.

Adding support for multifd+postcopy in some form should accelerate migration bandwidth for postcopy. Currently during postcopy phase we can only use majorly one channel for migrating pages, and that can easily bottleneck on one single CPU (postcopy preempt mode is an exception, however the extra channel for now is only for servicing faulted pages, so the bandwidth may not be improved by it).

Live snapshot for file memory (e.g. shmem)

We have background live snapshot support for anonymous memory. We don't yet support share memory like shmem because userfaultfd-wp doesn't support file memory when introduced. Now userfault-wp supports shared memory, and we can consider adding shmem support for it.

Note that multi-process shmem support will need extra work, as QEMU will also require the other processes to trap their writes to guest memory.

Downtime optimizations

There're a bunch of things we can consider doing to optimize migration downtimes.

Device state downtime analysis and accountings

Currently live migration estimates the downtime _without_ considering device states. VFIO is a special case because it's migrated even during iterative sections, and it reports its own device pending states periodically so when QEMU estimates the downtime they can be accounted.

However 99% of devices don't do that like what VFIO does. It is currently a simplified model for most of the devices assuming that most device states will be trivial to migrate so not accounted as part of downtime. However that may not always be the case.

It might be useful to think of some way to investigate some common cases where device state can take time. A few examples:

For example, what if there are one thousand vCPUs? Even if save/load one vCPU takes not a lot of time, a thousand of them should be accounted as part of downtime.

Or, what if a device pre_save()/post_load() may take a lot of time? We seem to have observed that in virtio devices already, where loading can take relatively long time. One can also refer to the other section regarding to "Optimize memory updates for non-iterative vmstates".

Optimize memory updates for non-iterative vmstates

There's potential chance for speeding up device loads by optimizing QEMU memory updates. One can refer to this thread for more information:

 https://lore.kernel.org/r/20230317081904.24389-1-xuchuangxclwt@bytedance.com

The work was proposed but not merged due to not enough review. But still the direction seems all fine and a possible path for future optimizations on shrinking the downtime.

Optimize postcopy downtime

For postcopy, currently there's one extra step to send the dirty bitmap during switchover, which is an extra overhead comparing to precopy migration. The bitmap is normally small enough to be ignored, but may not be the case for a super large VM. 1TB memory will correspond to ~32MB bitmap. Meanwhile receiving the bitmap will also require punching holes in the "just dirtied" pages, which will also take time.

One thing we can do here is we can allow destination to start running even _before_ the bitmap is received. Some new interface will be needed to allow destination QEMU consult "whether this page is the latest" to source QEMU. Meanwhile, MINOR userfault will be needed instead of MISSING userfault, so that QEMU destination can trap the pages even if page existed. It also means with current Linux (until today, v6.7) anonymous pages will not be able to be supported in this use case, as MINOR fault only support VM_SHARED.

Optimize migration bandwidth calculation

Bandwidth estimation is very important to migration, because it's the important factor that we'll use to decide when to switchover the VM from source VM to destination VM. However it can be estimated so wrong sometimes so that QEMU 8.2 we introduced a new migration parameter ("avail-switchover-bandwidth") just to allow an admin to specify that value when needed. However it's not easy to specify that value right.

There can be other ways to remedy this case, e.g., we can change the estimation logic to provide an average bandwidth.

VFIO

A lot of things are not mature for VFIO migration.