Features/MicroCheckpointing: Difference between revisions

From QEMU
No edit summary
No edit summary
 
(24 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Summary ==
== Summary ==
Patches welcome! (A machine would be welcome too. We have limited hardware access).


This is an implementation of Micro Checkpointing for memory and cpu state. Also known as: "Continuous Replication" or "Fault Tolerance" or 100 other different names - choose your poison.
This is an implementation of Micro Checkpointing for memory and cpu state. Also known as: "Continuous Replication" or "Fault Tolerance" or 100 other different names - choose your poison.
Line 5: Line 6:
=== Contact ===
=== Contact ===
* '''Name:''' [[User:Michael Hines|Michael Hines]]
* '''Name:''' [[User:Michael Hines|Michael Hines]]
* ''' Email:''' mrhines@us.ibm.com
* ''' Contact:''' http://michael.hinespot.com


Wiki: [http://wiki.qemu.org/Features/MicroCheckpointing http://wiki.qemu.org/Features/MicroCheckpointing]
Wiki: [http://wiki.qemu.org/Features/MicroCheckpointing http://wiki.qemu.org/Features/MicroCheckpointing]


Github: [http://github.com/hinesmr/qemu.git http://github.com/hinesmr/qemu.git], 'mc' branch
Github: [https://github.com/hinesmr/qemu/tree/mc https://github.com/hinesmr/qemu/tree/mc], 'mc' branch


Libvirt Support: [http://github.com/hinesmr/libvirt.git http://github.com/hinesmr/libvirt.git], 'mc' branch
(not ready) Libvirt Support: [https://github.com/hinesmr/libvirt/tree/mc https://github.com/hinesmr/libvirt/tree/mc], 'mc' branch


Copyright (C) 2014 IBM Michael R. Hines <mrhines@us.ibm.com>
Copyright (C) 2015 IBM Michael R. Hines <mrhines@us.ibm.com>


=== Introduction ===
=== Introduction ===


Micro-Checkpointing (MC) is one method for providing Fault Tolerance to a
Micro-Checkpointing (MC) is one method for providing Fault Tolerance to a
running virtual machine (VM) with neither runtime assistance from the guest
running virtual machine (VM) with little or no runtime assistance from the guest
kernel nor from the guest application software. Furthermore, Fault Tolerance
kernel or guest application software. Furthermore, Fault Tolerance
is one method of providing high availability to a VM such that, from the
is one method of providing high availability to a VM such that, from the
perspective of the outside world (clients, devices, and neighboring VMs that
perspective of the outside world (clients, devices, and neighboring VMs that
Line 29: Line 30:
the potentially extended lifetime of the VM because of this type of high
the potentially extended lifetime of the VM because of this type of high
availability, such software-level bugs may in fact manifest themselves  
availability, such software-level bugs may in fact manifest themselves  
more often than they ordinarily would, in which case you would need to
more often than they otherwise ordinarily would, in which case you would need to
employ other forms of availability to guard against such software-level faults.
employ other forms of availability to guard against such software-level faults.


This implementation is also fully compatible with RDMA. (See docs/rdma.txt for more details).
This implementation is also fully compatible with RDMA and has undergone special
optimizations to support the use of RDMA. (See docs/rdma.txt for more details).


== The Micro-Checkpointing Process ==
== The Micro-Checkpointing Process ==


=== Basic Algorithm ===
=== Basic Algorithm ===
Micro-Checkpointing works against the existing live migration path in QEMU,
Micro-Checkpoints (MC) work against the existing live migration path in QEMU,
and can effectively be understood as a "live migration that never ends".
and can effectively be understood as a "live migration that never ends".
As such, iterations rounds happen at the granularity of 10s of milliseconds
As such, iteration rounds happen at the granularity of 10s of milliseconds
and perform the following steps:
and perform the following steps:


Line 46: Line 48:
  4. Resume the VM immediately so that it can make forward progress.
  4. Resume the VM immediately so that it can make forward progress.
  5. Transmit the checkpoint to the destination.
  5. Transmit the checkpoint to the destination.
  6. Repeat  
  6. Repeat


Upon failure, load the contents of the last MC at the destination back
Upon failure, load the contents of the last MC at the destination back
Line 58: Line 60:
state that was committed by the virtual machine after failure. This is
state that was committed by the virtual machine after failure. This is
possible because a checkpoint may diverge by N milliseconds of time and
possible because a checkpoint may diverge by N milliseconds of time and
commit state while the current checkpoint is being transmitted to the
commit state while the current MC is being transmitted to the destination.  
destination.  


To guard against this problem, first, we must "buffer" the TX output of the
To guard against this problem, first, we must "buffer" the TX output of the
Line 66: Line 67:
at the source until the MC is transmitted. After transmission is complete,  
at the source until the MC is transmitted. After transmission is complete,  
those packets can be released. Similarly, in the case of disk I/O, we must
those packets can be released. Similarly, in the case of disk I/O, we must
ensure that either the contents of the local disk is safely mirrored to a  
ensure that either the contents of the local disk are safely mirrored to a  
remote disk before completing a MC or that the output to a shared disk,  
remote disk before completing a MC or that the output to a shared disk,  
such as iSCSI, is also buffered between checkpoints and then later released
such as iSCSI, is also buffered between checkpoints and then later released
Line 72: Line 73:


For the network in particular, buffering is performed using a series of  
For the network in particular, buffering is performed using a series of  
netlink Qdisc "plugs", introduced by the Xen Remus implementation. All packets
netlink (libnl3) Qdisc "plugs", introduced by the Xen Remus implementation.  
go through netlink in the host kernel - there are no exceptions and no gaps.
All packets go through netlink in the host kernel - there are no exceptions and no gaps.
Even while one buffer is being released (say, after a checkpoint has been
Even while one buffer is being released (say, after a checkpoint has been
saved), another plug will have already been initiated to hold the next round
saved), another plug will have already been initiated to hold the next round
Line 88: Line 89:
  3. Generate a MC by invoking the live migration software path to identify and copy dirty memory into a local staging area inside QEMU.
  3. Generate a MC by invoking the live migration software path to identify and copy dirty memory into a local staging area inside QEMU.
  4. Insert a *new* Qdisc plug (Buffer B). This buffers all new packets only.
  4. Insert a *new* Qdisc plug (Buffer B). This buffers all new packets only.
  5. Resume the VM immediately so that it can make forward progress.
  5. Resume the VM immediately so that it can make forward progress (making use of Buffer B).
  6. Transmit the checkpoint to the destination.
  6. Transmit the MC to the destination.
  7. Wait for acknowledgement.
  7. Wait for acknowledgement.
  8. Acknowledged.
  8. Acknowledged.
Line 97: Line 98:


This implementation *currently* only supports buffering for the network.
This implementation *currently* only supports buffering for the network.
This requires that the VM's root disk or any non-ephemeral disks also be  
(Any help on implementing disk support would be greatly appreciated).
Due to this lack of disk support, this requires that the VM's
root disk or any non-ephemeral disks also be
made network-accessible directly from within the VM. Until the aforementioned
made network-accessible directly from within the VM. Until the aforementioned
buffering or mirroring support is available (ideally through drive-mirror),
buffering or mirroring support is available (ideally through drive-mirror),
Line 112: Line 115:


Due to the high-frequency nature of micro-checkpointing, we expect
Due to the high-frequency nature of micro-checkpointing, we expect
a new checkpoint to be generated many times per second. Even missing just
a new MC to be generated many times per second. Even missing just
a few checkpoints easily constitutes a failure. Because of the consistent
a few MCs easily constitutes a failure. Because of the consistent
buffering of device I/O, this is safe because device I/O is not committed
buffering of device I/O, this is safe because device I/O is not committed
to the outside world until the checkpoint has been received at the
to the outside world until the MC has been received at the destination.
destination.


Failure is thus assumed under two conditions:
Failure is thus assumed under two conditions:


1. MC over TCP/IP: Once the socket connection breaks, we assume failure. This happens very early in the loss of the latest checkpoint not only because a very large amount of bytes is typically being sequenced in a TCP stream but perhaps also because of the timeout in acknowledgement of the receipt of a commit message by the destination.
1. MC over TCP/IP: Once the socket connection breaks, we assume failure. This happens very early in the loss of the latest MC not only because a very large amount of bytes is typically being sequenced in a TCP stream but perhaps also because of the timeout in acknowledgement of the receipt of a commit message by the destination.


2. MC over RDMA:  Since Infiniband does not provide any user-level timeout mechanisms, this implementation enhances QEMU's RDMA migration protocol to include a simple keep-alive. Upon the loss of multiple keep-alive messages, the sender is deemed to be failed.
2. MC over RDMA:  Since Infiniband does not provide any underlying timeout mechanisms, this implementation enhances QEMU's RDMA migration protocol to include a simple keep-alive. Upon the loss of multiple keep-alive messages, the sender is deemed to have failed.


In both cases, either due to a failed TCP socket connection or lost RDMA
In both cases, either due to a failed TCP socket connection or lost RDMA
keep-alive group, both the sender or the receiver can be deemed to be failed.
keep-alive group, both the sender or the receiver can be deemed to have failed.


If the sender is deemed to be failed, the destination takes over immediately
If the sender is deemed to have failed, the destination takes over immediately
using the contents of the last checkpoint.
using the contents of the last checkpoint.


Line 135: Line 137:
which may involve a third-party to identify a new destination host again to
which may involve a third-party to identify a new destination host again to
use as a backup for the VM.
use as a backup for the VM.


== Optimizations ==
== Optimizations ==
=== Parallelization of the migration bitmap ===
As of now, the cost of preparing the QEMU migration bitmap is rather high (on the order of 10s of milliseconds for large multi-gigabyte guests.) To mitigate this processing time, the patch currently uses all the available host processors to parallelize the preparation of this bitmap. This ability is exposed as a capability called "bitworkers".
This capability works by spawning a thread for each host CPU/core. During each checkpoint, the VM is stopped, and the bitmap preparation (after getting LOGDIRTY from KVM) is divided up among the threads for processing.
Each thread converts the logdirty information into the bitmap in parallel and then goes back to sleep.
Once all the threads have notified that they're finish, the rest of the MC is generated into local
staging memory and the VM is immediately resumed.
This parallelization, for example, reduces a 20ms preparation time of the bitmap for a 4GB guest down to about 5ms, a 4x improvement n downtime of the virtual machine.


=== Memory Management ===
=== Memory Management ===
Line 160: Line 147:
MCs are typically only a few MB when idle. However, they can easily be very large during heavy workloads. In the *extreme* worst-case, QEMU will need double the amount of main memory than that of what was originally allocated to the virtual machine.
MCs are typically only a few MB when idle. However, they can easily be very large during heavy workloads. In the *extreme* worst-case, QEMU will need double the amount of main memory than that of what was originally allocated to the virtual machine.


To support this variability during transient periods, a MC consists of a linked list of slabs, each of identical size. A better name
To support this variability during transient periods, a MC consists of a linked list of slabs, each of identical size. A better name would be welcome, as the name was only chosen because it resembles linux memory allocation. Because MCs occur several times per second (a frequency of 10s of milliseconds), slabs allow MCs to grow and shrink without constantly re-allocating all memory in place during each checkpoint. During steady-state, the 'head' slab is permanently allocated and never goes away, so when the VM is idle, there is no memory allocation at all.  This design supports the use of RDMA. Since RDMA requires memory pinning, we must be able to hold on to a slab for a reasonable amount of time to get any real use out of it.
would be welcome, as the name was only chosen because it resembles linux memory allocation. Because MCs occur several times per second (a frequency of 10s of milliseconds), slabs allow MCs to grow and shrink without constantly re-allocating all memory in place during each checkpoint. During steady-state, the 'head' slab is permanently allocated and never goes away, so when the VM is idle, there is no memory allocation at all.  This design supports the use of RDMA. Since RDMA requires memory pinning, we must be able to hold on to a slab for a reasonable amount of time to get any real use out of it.


Regardless, the current strategy taken will be:
Regardless, the current strategy taken will be:
Line 167: Line 153:
  1. If the checkpoint size increases, then grow the number of slabs to support it.
  1. If the checkpoint size increases, then grow the number of slabs to support it.
  2. If the next checkpoint size is smaller than the last one, then that's a "strike".
  2. If the next checkpoint size is smaller than the last one, then that's a "strike".
  3. After N strikes, cut the size of the slab cache in half (to a minimum of 1 slab as described before.
  3. After N strikes, cut the size of the slab cache in half (to a minimum of 1 slab as described before).


As of this writing, the average size of an Idle-VM checkpoint is under 5MB.
As of this writing, the average size of a Linux-based Idle-VM checkpoint is under 5MB.


=== RDMA Integration ===
=== RDMA Integration ===
Line 175: Line 161:
RDMA is instrumental in enabling better MC performance, which is the reason
RDMA is instrumental in enabling better MC performance, which is the reason
why it was introduced into QEMU first.
why it was introduced into QEMU first.
RDMA is used for two different reasons:


  1. Checkpoint generation (RDMA-based memcpy):
  1. Checkpoint generation (RDMA-based memcpy):
  2. Checkpoint transmission (for performance and less CPU impact)
  2. Checkpoint transmission


Checkpoint generation (step 2 in the previous section) must be done while
Checkpoint generation must be done while the VM is paused.  
the VM is paused. In the worst case, the size of the checkpoint can be  
In the worst case, the size of the checkpoint can be equal in size  
equal in size to the amount of memory in total use by the VM. In order
to the amount of memory in total use by the VM. In order to resume  
to resume VM execution as fast as possible, the checkpoint is copied
VM execution as fast as possible, the checkpoint is copied consistently
consistently locally into a staging area before transmission. A standard
locally into a staging area before transmission. A standard
memcpy() of potentially such a large amount of memory not only gets
memcpy() of potentially such a large amount of memory not only gets
no use out of the CPU cache but also potentially clogs up the CPU pipeline
no use out of the CPU cache but also potentially clogs up the CPU pipeline
which would otherwise be useful by other neighbor VMs on the same
which would otherwise be useful by other neighbor VMs on the same
physical node that could be scheduled for execution by Linux. To minimize
physical node that could be scheduled for execution. To minimize
the effect on neighbor VMs, we use RDMA to perform a "local" memcpy(),
the effect on neighbor VMs, we use RDMA to perform a "local" memcpy(),
bypassing the host processor.
bypassing the host processor. On more recent processors, a 'beefy'
enough memory bus architecture can move memory just as fast (sometimes
faster) as a pure-software CPU-only optimized memcpy() from libc.
However, on older computers, this feature only gives you the benefit
of lower CPU-utilization at the expense of MC performance, so for
sometime, most users with older memory speeds will want to leave this
feature disabled by default.


Checkpoint transmission can potentially consume very large amounts of
Checkpoint transmission can potentially also consume very large amounts of
both bandwidth as well as CPU utilization that could otherwise by used by
both bandwidth as well as CPU utilization that could otherwise by used by
the VM itself or its neighbors. Once the aforementioned local copy of the
the VM itself or its neighbors. Once the aforementioned local copy of the
checkpoint is saved, this implementation makes use of the same RDMA
checkpoint is saved, this implementation makes use of the same RDMA
hardware to perform the transmission, similar to the way a live migration
hardware to perform the transmission exactly the same way that a live migration
happens over RDMA (see docs/rdma.txt).  
happens over RDMA (see docs/rdma.txt).


== Usage ==
== Usage ==
Line 202: Line 196:


First, compile QEMU with '--enable-mc' and ensure that the corresponding
First, compile QEMU with '--enable-mc' and ensure that the corresponding
libraries for netlink are available. The netlink 'plug' support from the
libraries for netlink (libnl3) are available. The netlink 'plug' support from the
Qdisc functionality is required in particular, because it allows QEMU to
Qdisc functionality is required in particular, because it allows QEMU to
direct the kernel to buffer outbound network packages between checkpoints
direct the kernel to buffer outbound network packages between checkpoints
as described previously.
as described previously. Do not proceed without this support in a production
environment, or you risk corrupting the state of your I/O.


  $ git clone http://github.com/hinesmr/qemu.git
  $ git clone http://github.com/hinesmr/qemu.git
Line 217: Line 212:
QEMU Monitor Command:
QEMU Monitor Command:


  $ migrate_set_capability x-mc on # disabled by default
  $ migrate_set_capability mc on # disabled by default


Currently, only one network interface is supported, *and* currently you
Currently, only one network interface is supported, *and* currently you
Line 227: Line 222:
if you simply want to get an understanding of the performance
if you simply want to get an understanding of the performance
penalties associated with this feature activated.  
penalties associated with this feature activated.  
Current required until testing is complete. There are some COLO
disk replication patches that I am testing, but they don't work yet,
so you have to explicitly set this:
QEMU Monitor Command:
$ migrate_set_capability mc-disk-disable on # disk replication activated by default


Next, you can optionally disable network-buffering for additional test-only
Next, you can optionally disable network-buffering for additional test-only
execution. This is useful if you want to get a breakdown only what the cost
execution. This is useful if you want to get a breakdown only of what the cost
of the checkpointing the memory state is without the cost of
of checkpointing the memory state is without the cost of checkpointing device state.
checkpointing device state.


QEMU Monitor Command:
QEMU Monitor Command:
Line 245: Line 247:
  $ migrate_set_capability mc-rdma-copy on # disabled by default
  $ migrate_set_capability mc-rdma-copy on # disabled by default


Next, you can optionally enable the 'bitworkers' feature of QEMU.
Additionally, you can tune the checkpoint frequency. By default it is set to
This is allows QEMU to use all available host CPU cores to parallelize
checkpoint every 100 milliseconds. You can change that at any time,
the process of processing the migration dirty bitmap as described previously.
like this:
For normal live migrations, we disable this by default as migration is
typically a short-lived operation.


QEMU Monitor Command:
QEMU Monitor Command:


  $ migrate_set_capability bitworkers on # disabled by default
  $ migrate-set-mc-delay 100 # checkpoint every 100 milliseconds


Finally, if you are using QEMU's support for RDMA migration, you will want
Finally, if you are using QEMU's support for RDMA migration, you will want
Line 263: Line 263:
  $ migrate_set_capability rdma-keepalive on # disabled by default
  $ migrate_set_capability rdma-keepalive on # disabled by default


=== libnl / NETLINK compatibility ===
Unfortunately, You cannot just install any version of libnl, as we depend
on a recently introduced feature from Xen Remus into libnl called
"Qdisc Plugs" which perform the network buffering functions of
micro-checkpointing in the host linux kernel.
As of today, the minimum version you would need from my Ubuntu system
would be the following packages (or their equivalents on Redhat/Fedora/Debian....etc)
libnl-3-200_3.2.16-0ubuntu1_amd64.deb
libnl-3-dev_3.2.16-0ubuntu1_amd64.deb
libnl-cli-3-200_3.2.16-0ubuntu1_amd64.deb
libnl-cli-3-dev_3.2.16-0ubuntu1_amd64.deb
libnl-genl-3-200_3.2.16-0ubuntu1_amd64.deb
libnl-genl-3-dev_3.2.16-0ubuntu1_amd64.deb
libnl-nf-3-200_3.2.16-0ubuntu1_amd64.deb
libnl-nf-3-dev_3.2.16-0ubuntu1_amd64.deb
libnl-route-3-200_3.2.16-0ubuntu1_amd64.deb
libnl-route-3-dev_3.2.16-0ubuntu1_amd64.deb
libnl-utils_3.2.16-0ubuntu1_amd64.deb
There have also been reports of failure on newer versions, so there may need to be some
extra work in case libnl is introducing backwards-incompatible changes.


=== Running ===
=== Running ===
Line 271: Line 294:


Now, install a Qdisc plug to the tap device using the same
Now, install a Qdisc plug to the tap device using the same
naming convention as the tap device created by QEMU:
naming convention as the tap device created by QEMU (it must
be the same, because QEMU needs to interact with the IFB device
and the only mechanism we have right now of knowing the name
of the IFB devices is to assume that it matches the tap device
numbering scheme):


  $ ip link set up ifb0 # <= corresponds to tap device 'tap0'
  $ ip link set up ifb0 # <= corresponds to tap device 'tap0'
Line 301: Line 328:


Numbers are still coming in, but without output buffering of network I/O,
Numbers are still coming in, but without output buffering of network I/O,
the performance penalty on a typical 4GB RAM Java-based application server workload
the performance penalty of a typical 4GB RAM Java-based application server workload
using a 10 Gbps link (a good worst case for testing due Java's constant
using a 10 Gbps link (a good worst case for testing due Java's constant
garbage collection) is on the order of 25%. With network buffering activated,
garbage collection) is on the order of 25%. With network buffering activated,
this can be as high as 50%.
this can be as high as 50%.


The majority of the 25% penalty is due to the preparation of the QEMU migration
Assuming that you have a reasonable 10G (or RDMA) network in place, the majority
dirty bitmap, which can incur tens of milliseconds of downtime against the guest.
of the penalty is due to the time it takes to copy the dirty memory into a staging
area before transmission of the checkpoint. Any optimizations / proposals to speed
this up would be welcome!


The remaining 25% penalty comes from network buffering is typically due to checkpoints
The remaining penalty comes from network buffering is typically due to checkpoints
not occurring fast enough since a typical "round trip" time between the request of
not occurring fast enough since a typical "round trip" time between the request of
an application-level transaction and the corresponding response should ideally be
an application-level transaction and the corresponding response should ideally be
Line 318: Line 347:


We believe that this effect is "amplified" due to the poor performance in
We believe that this effect is "amplified" due to the poor performance in
processing the migration bitmap and thus since an application-level RTT cannot
processing copying the dirty memory to staging since an application-level RTT cannot
be serviced with more frequent checkpoints, network I/O tends to get held in
be serviced with more frequent checkpoints, network I/O tends to get held in
the buffer too long. This has the effect of causing the guest TCP/IP stack
the buffer too long. This has the effect of causing the guest TCP/IP stack
Line 324: Line 353:
way up to the application.
way up to the application.


== TODO ==
=== Libvirt Support ===
 
NOTE: This is not supported yet. Network buffering and disk replication simply do not exist. It is only for benchmarking.
 
If you want to contribute patches, you're more than welcome.


1. Eliminate as much of the cost of migration dirty bitmap preparation as possible. Parallelization is really only a stop-gap measure.
This does work if you checkout the branch mentioned at the beginning of this page, with one catch:


2. Implement local disk mirroring by integrating with QEMU's 'drive-mirror' feature in order to full support virtual machines with local storage.
$ virsh migrate --live --mc --mc-net-disable test qemu+tcp://ftdest/system


3. Implement output commit buffering for shared storage.
You must use the "mc-net-disable option" for now because the packet buffer support required by
netlink and setup of the IFB device has not been written yet. As a result, this
option is only for performance testing until someone (probably me) has time to setup
the netlink commands inside libvirt properly.


== FAQ / Frequently Asked Questions ==
== FAQ / Frequently Asked Questions ==
Line 353: Line 389:
Several things about this implementation attempt are different from previous implementations:
Several things about this implementation attempt are different from previous implementations:


1. We are dedicated to see this through the community review process and stay current with the master branch.
1. This implementation is 100% compatible with RDMA.
 
2. Memory management is completely overhauled - malloc()/free() churn is reduced to a minimum.
 
3. This is not port of Kemari - it is (yet another) re-write, focusing on performance.
 
4. Network I/O buffering is outsourced to the host kernel, using netlink code introduced by the Remus/Xen project.
 
5. We make every attempt to change as little of the existing migration call path as possible.
 
== TODO ==
 
1. Main bottleneck is to try to improve performance of the local memory copy to staging memory. The faster we can copy, the faster we can flush then network buffer.
 
2. Integrate with disk replication from COLO team.
 
3. Implement output commit buffering for shared storage.
 
4. TOO MANY COMMANDS! A conversation with Eric Blake has a nice recommendation which "someone" needs to implement:
 
Mailing list discussion:
 
>> We're building up a LOT of migrate- tunable commands.  Maybe it's time
 
>> to think about building a more generic migrate-set-parameter, which
 
>> takes both the name of the parameter to set and its value, so that a
>> single command serves all parameters, instead of needing a proliferation
 
>> of commands.  Of course, for that to be useful, we also need a way to
 
>> introspect which parameters can be tuned; whereas with the current
 
>> approach of one command per parameter (well, 2 for set vs. get) the
 
>> introspection is based on whether the command exists.
>
 
> I asked to have that.  My suggestion was that
>
> migrate_set_capability auto-throotle on
>
> So we could add it to new variables without extra change.
>
> And I agree that having a way to read them, and ask what values they
> have is a good idea.
>
> Luiz, any good idea about how to do it through QMP?
 
I'm trying to thing of a back-compat method, which exploits the fact
that we now have flat unions (something we didn't have when
migrate-set-capabilities was first added). Maybe something like:


2. This implementation is 100% compatible with RDMA.
{ 'type': 'MigrationCapabilityBase',
  'data': { 'capability': 'MigrationCapability' } }
{ 'type': 'MigrationCapabilityBool',
  'data': { 'state': 'bool' } }
{ 'type': 'Migration CapabilityInt',
  'data': { 'value': 'int' } }
{ 'union': 'MigrationCapabilityStatus',
  'base': 'MigrationCapabilityBase',
  'discriminator': 'capability',
  'data': {
    'xbzrle': 'MigrationCapabilityBool',
    'auto-converge': 'MigrationCapabilityBool',
...
    'mc-delay': 'MigrationCapabilityInt'
  } }
along with a tweak to query-migrate-capabilities for full back-compat:
# @query-migrate-capabilities
# @extended: #optional defaults to false; set to true to see non-boolean
  capabilities (since 2.1)
{ 'command: 'query-migrate-capabilities',
  'data': { '*extended': 'bool' },
  'returns': ['MigrationCapabilityStatus'] }
Now, observe what happens.  If an old client calls { "execute":
"query-migrate-capabilities" }, they get a return that lists ONLY the
boolean members of the MigrationCapabilityStatus array (good, because if
we returned a non-boolean, we would confuse the consumer when they are
expecting a 'state' variable that is not present) - what's more, this
representation is identical on the wire to the format used in earlier
qemu. But new clients can call { "execute":
"query-migrate-capabilities", "arguments": { "extended": true } }, and
get back:


3. Memory management is completely overhauled - malloc()/free() churn is reduced to a minimum.
{ "capabilities": [
    { "capability": "xbzrle", "state": true },
    { "capability": "auto-converge", "state": false },
...
    { "capability": "mc-delay", "value": 100 }
  ] }
Also, once a new client has learned of non-boolean extended
capabilities, they can also set them via the existing command:


4. This is not port of Kemari. Kemari is obsolete and incompatible with the most recent QEMU.
{ "execute": "migrate-set-capabilities",
  "arguments": [
      { "capability": "xbzrle", "state": false },
      { "capability": "mc-delay", "value": 200 }
  ] }


5. Network I/O buffering is outsourced to the host kernel, using netlink code introduced by the Remus/Xen project.
So, what do you think?  My slick type manipulation means that we need
zero new commands, just a new option the the query command, and a new
flat union type that replaces the current struct type. The existence
(but not the type) of non-boolean parameters is already introspectible
to a client new enough to request an 'extended' query, and down the
road, if we ever gain full QAPI introspection, then a client also would
gain the ability to learn the type of any non-boolean parameter as well.


6. We make every attempt to change as little of the existing migration call path as possible.
Stability wise, as long as we never change the type of a capability
once first exposed, then if a client plans on using a particular
parameter when available, it can already hard-code what type that
parameter should have without even needing full QAPI introspection (that
is, if libvirt is taught to manipulate mc-delay, libvirt will already
know to expect mc-delay as an int, and not any other type, and merely
needs to probe if qemu supports the 'mc-delay' extended capability).
And of course, this new schema idea can retroactively cover all existing
migration tunables, such as migrate_set_downtime, migrate_set_speed,
migrate-set-cache-size, and so on.

Latest revision as of 14:39, 20 July 2015

Summary

Patches welcome! (A machine would be welcome too. We have limited hardware access).

This is an implementation of Micro Checkpointing for memory and cpu state. Also known as: "Continuous Replication" or "Fault Tolerance" or 100 other different names - choose your poison.

Contact

Wiki: http://wiki.qemu.org/Features/MicroCheckpointing

Github: https://github.com/hinesmr/qemu/tree/mc, 'mc' branch

(not ready) Libvirt Support: https://github.com/hinesmr/libvirt/tree/mc, 'mc' branch

Copyright (C) 2015 IBM Michael R. Hines <mrhines@us.ibm.com>

Introduction

Micro-Checkpointing (MC) is one method for providing Fault Tolerance to a running virtual machine (VM) with little or no runtime assistance from the guest kernel or guest application software. Furthermore, Fault Tolerance is one method of providing high availability to a VM such that, from the perspective of the outside world (clients, devices, and neighboring VMs that may be paired with it), the VM and its applications have not lost any runtime state in the event of either a failure of the hypervisor/hardware to allow the VM to make forward progress or a complete loss of power. This mechanism for providing fault tolerance does *not* provide any protection whatsoever against software-level faults in the guest kernel or applications. In fact, due to the potentially extended lifetime of the VM because of this type of high availability, such software-level bugs may in fact manifest themselves more often than they otherwise ordinarily would, in which case you would need to employ other forms of availability to guard against such software-level faults.

This implementation is also fully compatible with RDMA and has undergone special optimizations to support the use of RDMA. (See docs/rdma.txt for more details).

The Micro-Checkpointing Process

Basic Algorithm

Micro-Checkpoints (MC) work against the existing live migration path in QEMU, and can effectively be understood as a "live migration that never ends". As such, iteration rounds happen at the granularity of 10s of milliseconds and perform the following steps:

1. After N milliseconds, stop the VM.
3. Generate a MC by invoking the live migration software path to identify and copy dirty memory into a local staging area inside QEMU.
4. Resume the VM immediately so that it can make forward progress.
5. Transmit the checkpoint to the destination.
6. Repeat

Upon failure, load the contents of the last MC at the destination back into memory and run the VM normally.

I/O buffering

Additionally, a MC must include a consistent view of device I/O, particularly the network, a problem commonly referred to as "output commit". This means that the outside world can not be allowed to experience duplicate state that was committed by the virtual machine after failure. This is possible because a checkpoint may diverge by N milliseconds of time and commit state while the current MC is being transmitted to the destination.

To guard against this problem, first, we must "buffer" the TX output of the network (not the input) between MCs until the current MC is safely received by the destination. For example, all outbound network packets must be held at the source until the MC is transmitted. After transmission is complete, those packets can be released. Similarly, in the case of disk I/O, we must ensure that either the contents of the local disk are safely mirrored to a remote disk before completing a MC or that the output to a shared disk, such as iSCSI, is also buffered between checkpoints and then later released in the same way.

For the network in particular, buffering is performed using a series of netlink (libnl3) Qdisc "plugs", introduced by the Xen Remus implementation. All packets go through netlink in the host kernel - there are no exceptions and no gaps. Even while one buffer is being released (say, after a checkpoint has been saved), another plug will have already been initiated to hold the next round of packets simultaneously while the current round of packets are being released. Thus, at any given time, there may be as many as two simultaneous buffers in place.

With this in mind, here is the extended procedure for the micro checkpointing process:

1. Insert a new Qdisc plug (Buffer A).

Repeat Forever:

2. After N milliseconds, stop the VM.
3. Generate a MC by invoking the live migration software path to identify and copy dirty memory into a local staging area inside QEMU.
4. Insert a *new* Qdisc plug (Buffer B). This buffers all new packets only.
5. Resume the VM immediately so that it can make forward progress (making use of Buffer B).
6. Transmit the MC to the destination.
7. Wait for acknowledgement.
8. Acknowledged.
9. Release the Qdisc plug for Buffer A.
10. Qdisc Buffer B now becomes (symbolically rename) the most recent Buffer A
11. Go back to Step 2

This implementation *currently* only supports buffering for the network. (Any help on implementing disk support would be greatly appreciated). Due to this lack of disk support, this requires that the VM's root disk or any non-ephemeral disks also be made network-accessible directly from within the VM. Until the aforementioned buffering or mirroring support is available (ideally through drive-mirror), the only "consistent" way to provide full fault tolerance of the VM's non-ephemeral disks is to construct a VM whose root disk is made to boot directly from iSCSI or NFS or similar such that all disk I/O is translated into network I/O.

Buffering is performed with the combination of an IFB device attached to the KVM tap device combined with a netlink Qdisc plug (exactly like the Xen remus solution).

Failure Recovery

Due to the high-frequency nature of micro-checkpointing, we expect a new MC to be generated many times per second. Even missing just a few MCs easily constitutes a failure. Because of the consistent buffering of device I/O, this is safe because device I/O is not committed to the outside world until the MC has been received at the destination.

Failure is thus assumed under two conditions:

1. MC over TCP/IP: Once the socket connection breaks, we assume failure. This happens very early in the loss of the latest MC not only because a very large amount of bytes is typically being sequenced in a TCP stream but perhaps also because of the timeout in acknowledgement of the receipt of a commit message by the destination.

2. MC over RDMA: Since Infiniband does not provide any underlying timeout mechanisms, this implementation enhances QEMU's RDMA migration protocol to include a simple keep-alive. Upon the loss of multiple keep-alive messages, the sender is deemed to have failed.

In both cases, either due to a failed TCP socket connection or lost RDMA keep-alive group, both the sender or the receiver can be deemed to have failed.

If the sender is deemed to have failed, the destination takes over immediately using the contents of the last checkpoint.

If the destination is deemed to be lost, we perform the same action as a live migration: resume the sender normally and wait for management software to make a policy decision about whether or not to re-protect the VM, which may involve a third-party to identify a new destination host again to use as a backup for the VM.

Optimizations

Memory Management

Managing QEMU memory usage in this implementation is critical to the performance of any micro-checkpointing (MC) implementation.

MCs are typically only a few MB when idle. However, they can easily be very large during heavy workloads. In the *extreme* worst-case, QEMU will need double the amount of main memory than that of what was originally allocated to the virtual machine.

To support this variability during transient periods, a MC consists of a linked list of slabs, each of identical size. A better name would be welcome, as the name was only chosen because it resembles linux memory allocation. Because MCs occur several times per second (a frequency of 10s of milliseconds), slabs allow MCs to grow and shrink without constantly re-allocating all memory in place during each checkpoint. During steady-state, the 'head' slab is permanently allocated and never goes away, so when the VM is idle, there is no memory allocation at all. This design supports the use of RDMA. Since RDMA requires memory pinning, we must be able to hold on to a slab for a reasonable amount of time to get any real use out of it.

Regardless, the current strategy taken will be:

1. If the checkpoint size increases, then grow the number of slabs to support it.
2. If the next checkpoint size is smaller than the last one, then that's a "strike".
3. After N strikes, cut the size of the slab cache in half (to a minimum of 1 slab as described before).

As of this writing, the average size of a Linux-based Idle-VM checkpoint is under 5MB.

RDMA Integration

RDMA is instrumental in enabling better MC performance, which is the reason why it was introduced into QEMU first.

RDMA is used for two different reasons:

1. Checkpoint generation (RDMA-based memcpy):
2. Checkpoint transmission

Checkpoint generation must be done while the VM is paused. In the worst case, the size of the checkpoint can be equal in size to the amount of memory in total use by the VM. In order to resume VM execution as fast as possible, the checkpoint is copied consistently locally into a staging area before transmission. A standard memcpy() of potentially such a large amount of memory not only gets no use out of the CPU cache but also potentially clogs up the CPU pipeline which would otherwise be useful by other neighbor VMs on the same physical node that could be scheduled for execution. To minimize the effect on neighbor VMs, we use RDMA to perform a "local" memcpy(), bypassing the host processor. On more recent processors, a 'beefy' enough memory bus architecture can move memory just as fast (sometimes faster) as a pure-software CPU-only optimized memcpy() from libc. However, on older computers, this feature only gives you the benefit of lower CPU-utilization at the expense of MC performance, so for sometime, most users with older memory speeds will want to leave this feature disabled by default.

Checkpoint transmission can potentially also consume very large amounts of both bandwidth as well as CPU utilization that could otherwise by used by the VM itself or its neighbors. Once the aforementioned local copy of the checkpoint is saved, this implementation makes use of the same RDMA hardware to perform the transmission exactly the same way that a live migration happens over RDMA (see docs/rdma.txt).

Usage

BEFORE Running

First, compile QEMU with '--enable-mc' and ensure that the corresponding libraries for netlink (libnl3) are available. The netlink 'plug' support from the Qdisc functionality is required in particular, because it allows QEMU to direct the kernel to buffer outbound network packages between checkpoints as described previously. Do not proceed without this support in a production environment, or you risk corrupting the state of your I/O.

$ git clone http://github.com/hinesmr/qemu.git
$ git checkout 'mc'
$ ./configure --enable-mc [other options]

Next, start the VM that you want to protect using your standard procedures.

Enable MC like this:

QEMU Monitor Command:

$ migrate_set_capability mc on # disabled by default

Currently, only one network interface is supported, *and* currently you must ensure that the root disk of your VM is booted either directly from iSCSI or NFS, as described previously. This will be rectified with future improvements.

For testing only, you can ignore the aforementioned requirements if you simply want to get an understanding of the performance penalties associated with this feature activated.

Current required until testing is complete. There are some COLO disk replication patches that I am testing, but they don't work yet, so you have to explicitly set this:

QEMU Monitor Command:

$ migrate_set_capability mc-disk-disable on # disk replication activated by default 

Next, you can optionally disable network-buffering for additional test-only execution. This is useful if you want to get a breakdown only of what the cost of checkpointing the memory state is without the cost of checkpointing device state.

QEMU Monitor Command:

$ migrate_set_capability mc-net-disable on # buffering activated by default 

Next, you can optionally enable RDMA 'memcpy' support. This is only valid if you have RDMA support compiled into QEMU and you intend to use the 'rdma' migration URI upon initiating MC as described later.

QEMU Monitor Command:

$ migrate_set_capability mc-rdma-copy on # disabled by default

Additionally, you can tune the checkpoint frequency. By default it is set to checkpoint every 100 milliseconds. You can change that at any time, like this:

QEMU Monitor Command:

$ migrate-set-mc-delay 100 # checkpoint every 100 milliseconds

Finally, if you are using QEMU's support for RDMA migration, you will want to enable RDMA keep-alive support to allow quick detection of failure. If you are using TCP/IP, this is not required:

QEMU Monitor Command:

$ migrate_set_capability rdma-keepalive on # disabled by default

libnl / NETLINK compatibility

Unfortunately, You cannot just install any version of libnl, as we depend on a recently introduced feature from Xen Remus into libnl called "Qdisc Plugs" which perform the network buffering functions of micro-checkpointing in the host linux kernel.

As of today, the minimum version you would need from my Ubuntu system would be the following packages (or their equivalents on Redhat/Fedora/Debian....etc)

libnl-3-200_3.2.16-0ubuntu1_amd64.deb
libnl-3-dev_3.2.16-0ubuntu1_amd64.deb
libnl-cli-3-200_3.2.16-0ubuntu1_amd64.deb
libnl-cli-3-dev_3.2.16-0ubuntu1_amd64.deb
libnl-genl-3-200_3.2.16-0ubuntu1_amd64.deb
libnl-genl-3-dev_3.2.16-0ubuntu1_amd64.deb
libnl-nf-3-200_3.2.16-0ubuntu1_amd64.deb
libnl-nf-3-dev_3.2.16-0ubuntu1_amd64.deb
libnl-route-3-200_3.2.16-0ubuntu1_amd64.deb
libnl-route-3-dev_3.2.16-0ubuntu1_amd64.deb
libnl-utils_3.2.16-0ubuntu1_amd64.deb

There have also been reports of failure on newer versions, so there may need to be some extra work in case libnl is introducing backwards-incompatible changes.

Running

First, make sure the IFB device kernel module is loaded

$ modprobe ifb numifbs=100 # (or some large number)

Now, install a Qdisc plug to the tap device using the same naming convention as the tap device created by QEMU (it must be the same, because QEMU needs to interact with the IFB device and the only mechanism we have right now of knowing the name of the IFB devices is to assume that it matches the tap device numbering scheme):

$ ip link set up ifb0 # <= corresponds to tap device 'tap0'
$ tc qdisc add dev tap0 ingress
$ tc filter add dev tap0 parent ffff: proto ip pref 10 u32 match u32 0 0 action mirred egress redirect dev ifb0

(You will need a script to automate the part above until the libvirt patches are more complete).

Now, that the network buffering connection is ready:

MC can be initiated with exactly the same command as standard live migration:

QEMU Monitor Command:

$ migrate -d (tcp|rdma):host:port

Upon failure, the destination VM will detect a loss in network connectivity and automatically revert to the last checkpoint taken and resume execution immediately. There is no need for additional QEMU monitor commands to initiate the recovery process.

Performance

By far, the biggest cost is network throughput. Virtual machines are capable of dirtying memory well in excess of the bandwidth provided a commodity 1 Gbps network link. If so, the MC process will always lag behind the virtual machine and forward progress will be poor. It is highly recommended to use at least a 10 Gbps link when using MC.

Numbers are still coming in, but without output buffering of network I/O, the performance penalty of a typical 4GB RAM Java-based application server workload using a 10 Gbps link (a good worst case for testing due Java's constant garbage collection) is on the order of 25%. With network buffering activated, this can be as high as 50%.

Assuming that you have a reasonable 10G (or RDMA) network in place, the majority of the penalty is due to the time it takes to copy the dirty memory into a staging area before transmission of the checkpoint. Any optimizations / proposals to speed this up would be welcome!

The remaining penalty comes from network buffering is typically due to checkpoints not occurring fast enough since a typical "round trip" time between the request of an application-level transaction and the corresponding response should ideally be larger than the time it takes to complete a checkpoint, otherwise, the response to the application within the VM will appear to be congested since the VM's network endpoint may not have even received the TX request from the application in the first place.

We believe that this effect is "amplified" due to the poor performance in processing copying the dirty memory to staging since an application-level RTT cannot be serviced with more frequent checkpoints, network I/O tends to get held in the buffer too long. This has the effect of causing the guest TCP/IP stack to experience congestion, propagating this artificially created delay all the way up to the application.

Libvirt Support

NOTE: This is not supported yet. Network buffering and disk replication simply do not exist. It is only for benchmarking.

If you want to contribute patches, you're more than welcome.

This does work if you checkout the branch mentioned at the beginning of this page, with one catch:

$ virsh migrate --live --mc --mc-net-disable test qemu+tcp://ftdest/system

You must use the "mc-net-disable option" for now because the packet buffer support required by netlink and setup of the IFB device has not been written yet. As a result, this option is only for performance testing until someone (probably me) has time to setup the netlink commands inside libvirt properly.

FAQ / Frequently Asked Questions

What happens if a failure occurs in the *middle* of a flush of the network buffer?

Micro-Checkpointing depends *heavily* on the correctness of TCP/IP. Thus, this is not a problem because the network buffer holds packets only for the last *committed* checkpoint (meaning that the last micro checkpoint must have been acknowledged as received successfully by the backup host). After understanding this, it is then important to understand how network buffering is repeated between checkpoints. *ALL* packets go through the buffer - there is no exception or gaps. There is no such situation where while the buffer is being flushed other newer packets are going through - that's not how it works. Please refer to the previous section "I/O buffering" for a detailed description of how network buffering works.

Why is this not a problem?

Example: Let's say we have packets "A" and "B" in the buffer.

Packet A is sent successfully and a failure occurs before packet B is transmitted.

Packet A) This is acceptable. The guest checkpoint has already recorded delivery of the packet from the guest's perspective. The network fabric can deliver or not deliver as it sees fit. Thus the buffer simply has the same effect of an additional network switch - it does not alter the effect of fault tolerance as viewed by the external world any more so than another faulty hop in the traditional network architecture would cause congestion in the network. The packet will never get RE-generated because the checkpoint has already been committed at the destination which corresponds to the transmission of that packet from the perspective of the virtual machine. Any FUTURE packets generated while the VM resumes execution are *also* buffered as described previously.

Packet B) This is acceptable. This packet will be lost. This will result in a TCP-level timeout on the peer side of the connection in the case that packet B is an ACK or will result in a timeout on the guest-side of the connection in the case that the packet is a TCP PUSH. Either way, the packet will get re-transmitted either because the data was never acknowledged or never received as soon as the virtual machine resumes execution.

What's different about this implementation?

Several things about this implementation attempt are different from previous implementations:

1. This implementation is 100% compatible with RDMA.

2. Memory management is completely overhauled - malloc()/free() churn is reduced to a minimum.

3. This is not port of Kemari - it is (yet another) re-write, focusing on performance.

4. Network I/O buffering is outsourced to the host kernel, using netlink code introduced by the Remus/Xen project.

5. We make every attempt to change as little of the existing migration call path as possible.

TODO

1. Main bottleneck is to try to improve performance of the local memory copy to staging memory. The faster we can copy, the faster we can flush then network buffer.

2. Integrate with disk replication from COLO team.

3. Implement output commit buffering for shared storage.

4. TOO MANY COMMANDS! A conversation with Eric Blake has a nice recommendation which "someone" needs to implement:

Mailing list discussion:

>> We're building up a LOT of migrate- tunable commands. Maybe it's time

>> to think about building a more generic migrate-set-parameter, which

>> takes both the name of the parameter to set and its value, so that a >> single command serves all parameters, instead of needing a proliferation

>> of commands. Of course, for that to be useful, we also need a way to

>> introspect which parameters can be tuned; whereas with the current

>> approach of one command per parameter (well, 2 for set vs. get) the

>> introspection is based on whether the command exists. >

> I asked to have that. My suggestion was that > > migrate_set_capability auto-throotle on > > So we could add it to new variables without extra change. > > And I agree that having a way to read them, and ask what values they > have is a good idea. > > Luiz, any good idea about how to do it through QMP?

I'm trying to thing of a back-compat method, which exploits the fact that we now have flat unions (something we didn't have when migrate-set-capabilities was first added). Maybe something like:

{ 'type': 'MigrationCapabilityBase',
  'data': { 'capability': 'MigrationCapability' } }
{ 'type': 'MigrationCapabilityBool',
  'data': { 'state': 'bool' } }
{ 'type': 'Migration CapabilityInt',
  'data': { 'value': 'int' } }
{ 'union': 'MigrationCapabilityStatus',
  'base': 'MigrationCapabilityBase',
  'discriminator': 'capability',
  'data': {
    'xbzrle': 'MigrationCapabilityBool',
    'auto-converge': 'MigrationCapabilityBool',
...
    'mc-delay': 'MigrationCapabilityInt'
  } }

along with a tweak to query-migrate-capabilities for full back-compat:

# @query-migrate-capabilities
# @extended: #optional defaults to false; set to true to see non-boolean
 capabilities (since 2.1)
{ 'command: 'query-migrate-capabilities',
  'data': { '*extended': 'bool' },
  'returns': ['MigrationCapabilityStatus'] }

Now, observe what happens. If an old client calls { "execute": "query-migrate-capabilities" }, they get a return that lists ONLY the boolean members of the MigrationCapabilityStatus array (good, because if we returned a non-boolean, we would confuse the consumer when they are expecting a 'state' variable that is not present) - what's more, this representation is identical on the wire to the format used in earlier qemu. But new clients can call { "execute": "query-migrate-capabilities", "arguments": { "extended": true } }, and get back:

{ "capabilities": [
   { "capability": "xbzrle", "state": true },
   { "capability": "auto-converge", "state": false },
...
   { "capability": "mc-delay", "value": 100 }
  ] }

Also, once a new client has learned of non-boolean extended capabilities, they can also set them via the existing command:

{ "execute": "migrate-set-capabilities",
  "arguments": [
     { "capability": "xbzrle", "state": false },
     { "capability": "mc-delay", "value": 200 }
  ] }

So, what do you think? My slick type manipulation means that we need zero new commands, just a new option the the query command, and a new flat union type that replaces the current struct type. The existence (but not the type) of non-boolean parameters is already introspectible to a client new enough to request an 'extended' query, and down the road, if we ever gain full QAPI introspection, then a client also would gain the ability to learn the type of any non-boolean parameter as well.

Stability wise, as long as we never change the type of a capability once first exposed, then if a client plans on using a particular parameter when available, it can already hard-code what type that parameter should have without even needing full QAPI introspection (that is, if libvirt is taught to manipulate mc-delay, libvirt will already know to expect mc-delay as an int, and not any other type, and merely needs to probe if qemu supports the 'mc-delay' extended capability). And of course, this new schema idea can retroactively cover all existing migration tunables, such as migrate_set_downtime, migrate_set_speed, migrate-set-cache-size, and so on.