Features/MicroCheckpointing

From QEMU

Summary

Patches welcome! (A machine would be welcome too. We have limited hardware access).

This is an implementation of Micro Checkpointing for memory and cpu state. Also known as: "Continuous Replication" or "Fault Tolerance" or 100 other different names - choose your poison.

Contact

Wiki: http://wiki.qemu.org/Features/MicroCheckpointing

Github: https://github.com/hinesmr/qemu/tree/mc, 'mc' branch

(not ready) Libvirt Support: https://github.com/hinesmr/libvirt/tree/mc, 'mc' branch

Copyright (C) 2015 IBM Michael R. Hines <mrhines@us.ibm.com>

Introduction

Micro-Checkpointing (MC) is one method for providing Fault Tolerance to a running virtual machine (VM) with little or no runtime assistance from the guest kernel or guest application software. Furthermore, Fault Tolerance is one method of providing high availability to a VM such that, from the perspective of the outside world (clients, devices, and neighboring VMs that may be paired with it), the VM and its applications have not lost any runtime state in the event of either a failure of the hypervisor/hardware to allow the VM to make forward progress or a complete loss of power. This mechanism for providing fault tolerance does *not* provide any protection whatsoever against software-level faults in the guest kernel or applications. In fact, due to the potentially extended lifetime of the VM because of this type of high availability, such software-level bugs may in fact manifest themselves more often than they otherwise ordinarily would, in which case you would need to employ other forms of availability to guard against such software-level faults.

This implementation is also fully compatible with RDMA and has undergone special optimizations to support the use of RDMA. (See docs/rdma.txt for more details).

The Micro-Checkpointing Process

Basic Algorithm

Micro-Checkpoints (MC) work against the existing live migration path in QEMU, and can effectively be understood as a "live migration that never ends". As such, iteration rounds happen at the granularity of 10s of milliseconds and perform the following steps:

1. After N milliseconds, stop the VM.
3. Generate a MC by invoking the live migration software path to identify and copy dirty memory into a local staging area inside QEMU.
4. Resume the VM immediately so that it can make forward progress.
5. Transmit the checkpoint to the destination.
6. Repeat

Upon failure, load the contents of the last MC at the destination back into memory and run the VM normally.

I/O buffering

Additionally, a MC must include a consistent view of device I/O, particularly the network, a problem commonly referred to as "output commit". This means that the outside world can not be allowed to experience duplicate state that was committed by the virtual machine after failure. This is possible because a checkpoint may diverge by N milliseconds of time and commit state while the current MC is being transmitted to the destination.

To guard against this problem, first, we must "buffer" the TX output of the network (not the input) between MCs until the current MC is safely received by the destination. For example, all outbound network packets must be held at the source until the MC is transmitted. After transmission is complete, those packets can be released. Similarly, in the case of disk I/O, we must ensure that either the contents of the local disk are safely mirrored to a remote disk before completing a MC or that the output to a shared disk, such as iSCSI, is also buffered between checkpoints and then later released in the same way.

For the network in particular, buffering is performed using a series of netlink (libnl3) Qdisc "plugs", introduced by the Xen Remus implementation. All packets go through netlink in the host kernel - there are no exceptions and no gaps. Even while one buffer is being released (say, after a checkpoint has been saved), another plug will have already been initiated to hold the next round of packets simultaneously while the current round of packets are being released. Thus, at any given time, there may be as many as two simultaneous buffers in place.

With this in mind, here is the extended procedure for the micro checkpointing process:

1. Insert a new Qdisc plug (Buffer A).

Repeat Forever:

2. After N milliseconds, stop the VM.
3. Generate a MC by invoking the live migration software path to identify and copy dirty memory into a local staging area inside QEMU.
4. Insert a *new* Qdisc plug (Buffer B). This buffers all new packets only.
5. Resume the VM immediately so that it can make forward progress (making use of Buffer B).
6. Transmit the MC to the destination.
7. Wait for acknowledgement.
8. Acknowledged.
9. Release the Qdisc plug for Buffer A.
10. Qdisc Buffer B now becomes (symbolically rename) the most recent Buffer A
11. Go back to Step 2

This implementation *currently* only supports buffering for the network. (Any help on implementing disk support would be greatly appreciated). Due to this lack of disk support, this requires that the VM's root disk or any non-ephemeral disks also be made network-accessible directly from within the VM. Until the aforementioned buffering or mirroring support is available (ideally through drive-mirror), the only "consistent" way to provide full fault tolerance of the VM's non-ephemeral disks is to construct a VM whose root disk is made to boot directly from iSCSI or NFS or similar such that all disk I/O is translated into network I/O.

Buffering is performed with the combination of an IFB device attached to the KVM tap device combined with a netlink Qdisc plug (exactly like the Xen remus solution).

Failure Recovery

Due to the high-frequency nature of micro-checkpointing, we expect a new MC to be generated many times per second. Even missing just a few MCs easily constitutes a failure. Because of the consistent buffering of device I/O, this is safe because device I/O is not committed to the outside world until the MC has been received at the destination.

Failure is thus assumed under two conditions:

1. MC over TCP/IP: Once the socket connection breaks, we assume failure. This happens very early in the loss of the latest MC not only because a very large amount of bytes is typically being sequenced in a TCP stream but perhaps also because of the timeout in acknowledgement of the receipt of a commit message by the destination.

2. MC over RDMA: Since Infiniband does not provide any underlying timeout mechanisms, this implementation enhances QEMU's RDMA migration protocol to include a simple keep-alive. Upon the loss of multiple keep-alive messages, the sender is deemed to have failed.

In both cases, either due to a failed TCP socket connection or lost RDMA keep-alive group, both the sender or the receiver can be deemed to have failed.

If the sender is deemed to have failed, the destination takes over immediately using the contents of the last checkpoint.

If the destination is deemed to be lost, we perform the same action as a live migration: resume the sender normally and wait for management software to make a policy decision about whether or not to re-protect the VM, which may involve a third-party to identify a new destination host again to use as a backup for the VM.

Optimizations

Memory Management

Managing QEMU memory usage in this implementation is critical to the performance of any micro-checkpointing (MC) implementation.

MCs are typically only a few MB when idle. However, they can easily be very large during heavy workloads. In the *extreme* worst-case, QEMU will need double the amount of main memory than that of what was originally allocated to the virtual machine.

To support this variability during transient periods, a MC consists of a linked list of slabs, each of identical size. A better name would be welcome, as the name was only chosen because it resembles linux memory allocation. Because MCs occur several times per second (a frequency of 10s of milliseconds), slabs allow MCs to grow and shrink without constantly re-allocating all memory in place during each checkpoint. During steady-state, the 'head' slab is permanently allocated and never goes away, so when the VM is idle, there is no memory allocation at all. This design supports the use of RDMA. Since RDMA requires memory pinning, we must be able to hold on to a slab for a reasonable amount of time to get any real use out of it.

Regardless, the current strategy taken will be:

1. If the checkpoint size increases, then grow the number of slabs to support it.
2. If the next checkpoint size is smaller than the last one, then that's a "strike".
3. After N strikes, cut the size of the slab cache in half (to a minimum of 1 slab as described before).

As of this writing, the average size of a Linux-based Idle-VM checkpoint is under 5MB.

RDMA Integration

RDMA is instrumental in enabling better MC performance, which is the reason why it was introduced into QEMU first.

RDMA is used for two different reasons:

1. Checkpoint generation (RDMA-based memcpy):
2. Checkpoint transmission

Checkpoint generation must be done while the VM is paused. In the worst case, the size of the checkpoint can be equal in size to the amount of memory in total use by the VM. In order to resume VM execution as fast as possible, the checkpoint is copied consistently locally into a staging area before transmission. A standard memcpy() of potentially such a large amount of memory not only gets no use out of the CPU cache but also potentially clogs up the CPU pipeline which would otherwise be useful by other neighbor VMs on the same physical node that could be scheduled for execution. To minimize the effect on neighbor VMs, we use RDMA to perform a "local" memcpy(), bypassing the host processor. On more recent processors, a 'beefy' enough memory bus architecture can move memory just as fast (sometimes faster) as a pure-software CPU-only optimized memcpy() from libc. However, on older computers, this feature only gives you the benefit of lower CPU-utilization at the expense of MC performance, so for sometime, most users with older memory speeds will want to leave this feature disabled by default.

Checkpoint transmission can potentially also consume very large amounts of both bandwidth as well as CPU utilization that could otherwise by used by the VM itself or its neighbors. Once the aforementioned local copy of the checkpoint is saved, this implementation makes use of the same RDMA hardware to perform the transmission exactly the same way that a live migration happens over RDMA (see docs/rdma.txt).

Usage

BEFORE Running

First, compile QEMU with '--enable-mc' and ensure that the corresponding libraries for netlink (libnl3) are available. The netlink 'plug' support from the Qdisc functionality is required in particular, because it allows QEMU to direct the kernel to buffer outbound network packages between checkpoints as described previously. Do not proceed without this support in a production environment, or you risk corrupting the state of your I/O.

$ git clone http://github.com/hinesmr/qemu.git
$ git checkout 'mc'
$ ./configure --enable-mc [other options]

Next, start the VM that you want to protect using your standard procedures.

Enable MC like this:

QEMU Monitor Command:

$ migrate_set_capability mc on # disabled by default

Currently, only one network interface is supported, *and* currently you must ensure that the root disk of your VM is booted either directly from iSCSI or NFS, as described previously. This will be rectified with future improvements.

For testing only, you can ignore the aforementioned requirements if you simply want to get an understanding of the performance penalties associated with this feature activated.

Current required until testing is complete. There are some COLO disk replication patches that I am testing, but they don't work yet, so you have to explicitly set this:

QEMU Monitor Command:

$ migrate_set_capability mc-disk-disable on # disk replication activated by default 

Next, you can optionally disable network-buffering for additional test-only execution. This is useful if you want to get a breakdown only of what the cost of checkpointing the memory state is without the cost of checkpointing device state.

QEMU Monitor Command:

$ migrate_set_capability mc-net-disable on # buffering activated by default 

Next, you can optionally enable RDMA 'memcpy' support. This is only valid if you have RDMA support compiled into QEMU and you intend to use the 'rdma' migration URI upon initiating MC as described later.

QEMU Monitor Command:

$ migrate_set_capability mc-rdma-copy on # disabled by default

Additionally, you can tune the checkpoint frequency. By default it is set to checkpoint every 100 milliseconds. You can change that at any time, like this:

QEMU Monitor Command:

$ migrate-set-mc-delay 100 # checkpoint every 100 milliseconds

Finally, if you are using QEMU's support for RDMA migration, you will want to enable RDMA keep-alive support to allow quick detection of failure. If you are using TCP/IP, this is not required:

QEMU Monitor Command:

$ migrate_set_capability rdma-keepalive on # disabled by default

libnl / NETLINK compatibility

Unfortunately, You cannot just install any version of libnl, as we depend on a recently introduced feature from Xen Remus into libnl called "Qdisc Plugs" which perform the network buffering functions of micro-checkpointing in the host linux kernel.

As of today, the minimum version you would need from my Ubuntu system would be the following packages (or their equivalents on Redhat/Fedora/Debian....etc)

libnl-3-200_3.2.16-0ubuntu1_amd64.deb
libnl-3-dev_3.2.16-0ubuntu1_amd64.deb
libnl-cli-3-200_3.2.16-0ubuntu1_amd64.deb
libnl-cli-3-dev_3.2.16-0ubuntu1_amd64.deb
libnl-genl-3-200_3.2.16-0ubuntu1_amd64.deb
libnl-genl-3-dev_3.2.16-0ubuntu1_amd64.deb
libnl-nf-3-200_3.2.16-0ubuntu1_amd64.deb
libnl-nf-3-dev_3.2.16-0ubuntu1_amd64.deb
libnl-route-3-200_3.2.16-0ubuntu1_amd64.deb
libnl-route-3-dev_3.2.16-0ubuntu1_amd64.deb
libnl-utils_3.2.16-0ubuntu1_amd64.deb

There have also been reports of failure on newer versions, so there may need to be some extra work in case libnl is introducing backwards-incompatible changes.

Running

First, make sure the IFB device kernel module is loaded

$ modprobe ifb numifbs=100 # (or some large number)

Now, install a Qdisc plug to the tap device using the same naming convention as the tap device created by QEMU (it must be the same, because QEMU needs to interact with the IFB device and the only mechanism we have right now of knowing the name of the IFB devices is to assume that it matches the tap device numbering scheme):

$ ip link set up ifb0 # <= corresponds to tap device 'tap0'
$ tc qdisc add dev tap0 ingress
$ tc filter add dev tap0 parent ffff: proto ip pref 10 u32 match u32 0 0 action mirred egress redirect dev ifb0

(You will need a script to automate the part above until the libvirt patches are more complete).

Now, that the network buffering connection is ready:

MC can be initiated with exactly the same command as standard live migration:

QEMU Monitor Command:

$ migrate -d (tcp|rdma):host:port

Upon failure, the destination VM will detect a loss in network connectivity and automatically revert to the last checkpoint taken and resume execution immediately. There is no need for additional QEMU monitor commands to initiate the recovery process.

Performance

By far, the biggest cost is network throughput. Virtual machines are capable of dirtying memory well in excess of the bandwidth provided a commodity 1 Gbps network link. If so, the MC process will always lag behind the virtual machine and forward progress will be poor. It is highly recommended to use at least a 10 Gbps link when using MC.

Numbers are still coming in, but without output buffering of network I/O, the performance penalty of a typical 4GB RAM Java-based application server workload using a 10 Gbps link (a good worst case for testing due Java's constant garbage collection) is on the order of 25%. With network buffering activated, this can be as high as 50%.

Assuming that you have a reasonable 10G (or RDMA) network in place, the majority of the penalty is due to the time it takes to copy the dirty memory into a staging area before transmission of the checkpoint. Any optimizations / proposals to speed this up would be welcome!

The remaining penalty comes from network buffering is typically due to checkpoints not occurring fast enough since a typical "round trip" time between the request of an application-level transaction and the corresponding response should ideally be larger than the time it takes to complete a checkpoint, otherwise, the response to the application within the VM will appear to be congested since the VM's network endpoint may not have even received the TX request from the application in the first place.

We believe that this effect is "amplified" due to the poor performance in processing copying the dirty memory to staging since an application-level RTT cannot be serviced with more frequent checkpoints, network I/O tends to get held in the buffer too long. This has the effect of causing the guest TCP/IP stack to experience congestion, propagating this artificially created delay all the way up to the application.

Libvirt Support

NOTE: This is not supported yet. Network buffering and disk replication simply do not exist. It is only for benchmarking.

If you want to contribute patches, you're more than welcome.

This does work if you checkout the branch mentioned at the beginning of this page, with one catch:

$ virsh migrate --live --mc --mc-net-disable test qemu+tcp://ftdest/system

You must use the "mc-net-disable option" for now because the packet buffer support required by netlink and setup of the IFB device has not been written yet. As a result, this option is only for performance testing until someone (probably me) has time to setup the netlink commands inside libvirt properly.

FAQ / Frequently Asked Questions

What happens if a failure occurs in the *middle* of a flush of the network buffer?

Micro-Checkpointing depends *heavily* on the correctness of TCP/IP. Thus, this is not a problem because the network buffer holds packets only for the last *committed* checkpoint (meaning that the last micro checkpoint must have been acknowledged as received successfully by the backup host). After understanding this, it is then important to understand how network buffering is repeated between checkpoints. *ALL* packets go through the buffer - there is no exception or gaps. There is no such situation where while the buffer is being flushed other newer packets are going through - that's not how it works. Please refer to the previous section "I/O buffering" for a detailed description of how network buffering works.

Why is this not a problem?

Example: Let's say we have packets "A" and "B" in the buffer.

Packet A is sent successfully and a failure occurs before packet B is transmitted.

Packet A) This is acceptable. The guest checkpoint has already recorded delivery of the packet from the guest's perspective. The network fabric can deliver or not deliver as it sees fit. Thus the buffer simply has the same effect of an additional network switch - it does not alter the effect of fault tolerance as viewed by the external world any more so than another faulty hop in the traditional network architecture would cause congestion in the network. The packet will never get RE-generated because the checkpoint has already been committed at the destination which corresponds to the transmission of that packet from the perspective of the virtual machine. Any FUTURE packets generated while the VM resumes execution are *also* buffered as described previously.

Packet B) This is acceptable. This packet will be lost. This will result in a TCP-level timeout on the peer side of the connection in the case that packet B is an ACK or will result in a timeout on the guest-side of the connection in the case that the packet is a TCP PUSH. Either way, the packet will get re-transmitted either because the data was never acknowledged or never received as soon as the virtual machine resumes execution.

What's different about this implementation?

Several things about this implementation attempt are different from previous implementations:

1. This implementation is 100% compatible with RDMA.

2. Memory management is completely overhauled - malloc()/free() churn is reduced to a minimum.

3. This is not port of Kemari - it is (yet another) re-write, focusing on performance.

4. Network I/O buffering is outsourced to the host kernel, using netlink code introduced by the Remus/Xen project.

5. We make every attempt to change as little of the existing migration call path as possible.

TODO

1. Main bottleneck is to try to improve performance of the local memory copy to staging memory. The faster we can copy, the faster we can flush then network buffer.

2. Integrate with disk replication from COLO team.

3. Implement output commit buffering for shared storage.

4. TOO MANY COMMANDS! A conversation with Eric Blake has a nice recommendation which "someone" needs to implement:

Mailing list discussion:

>> We're building up a LOT of migrate- tunable commands. Maybe it's time

>> to think about building a more generic migrate-set-parameter, which

>> takes both the name of the parameter to set and its value, so that a >> single command serves all parameters, instead of needing a proliferation

>> of commands. Of course, for that to be useful, we also need a way to

>> introspect which parameters can be tuned; whereas with the current

>> approach of one command per parameter (well, 2 for set vs. get) the

>> introspection is based on whether the command exists. >

> I asked to have that. My suggestion was that > > migrate_set_capability auto-throotle on > > So we could add it to new variables without extra change. > > And I agree that having a way to read them, and ask what values they > have is a good idea. > > Luiz, any good idea about how to do it through QMP?

I'm trying to thing of a back-compat method, which exploits the fact that we now have flat unions (something we didn't have when migrate-set-capabilities was first added). Maybe something like:

{ 'type': 'MigrationCapabilityBase',
  'data': { 'capability': 'MigrationCapability' } }
{ 'type': 'MigrationCapabilityBool',
  'data': { 'state': 'bool' } }
{ 'type': 'Migration CapabilityInt',
  'data': { 'value': 'int' } }
{ 'union': 'MigrationCapabilityStatus',
  'base': 'MigrationCapabilityBase',
  'discriminator': 'capability',
  'data': {
    'xbzrle': 'MigrationCapabilityBool',
    'auto-converge': 'MigrationCapabilityBool',
...
    'mc-delay': 'MigrationCapabilityInt'
  } }

along with a tweak to query-migrate-capabilities for full back-compat:

# @query-migrate-capabilities
# @extended: #optional defaults to false; set to true to see non-boolean
 capabilities (since 2.1)
{ 'command: 'query-migrate-capabilities',
  'data': { '*extended': 'bool' },
  'returns': ['MigrationCapabilityStatus'] }

Now, observe what happens. If an old client calls { "execute": "query-migrate-capabilities" }, they get a return that lists ONLY the boolean members of the MigrationCapabilityStatus array (good, because if we returned a non-boolean, we would confuse the consumer when they are expecting a 'state' variable that is not present) - what's more, this representation is identical on the wire to the format used in earlier qemu. But new clients can call { "execute": "query-migrate-capabilities", "arguments": { "extended": true } }, and get back:

{ "capabilities": [
   { "capability": "xbzrle", "state": true },
   { "capability": "auto-converge", "state": false },
...
   { "capability": "mc-delay", "value": 100 }
  ] }

Also, once a new client has learned of non-boolean extended capabilities, they can also set them via the existing command:

{ "execute": "migrate-set-capabilities",
  "arguments": [
     { "capability": "xbzrle", "state": false },
     { "capability": "mc-delay", "value": 200 }
  ] }

So, what do you think? My slick type manipulation means that we need zero new commands, just a new option the the query command, and a new flat union type that replaces the current struct type. The existence (but not the type) of non-boolean parameters is already introspectible to a client new enough to request an 'extended' query, and down the road, if we ever gain full QAPI introspection, then a client also would gain the ability to learn the type of any non-boolean parameter as well.

Stability wise, as long as we never change the type of a capability once first exposed, then if a client plans on using a particular parameter when available, it can already hard-code what type that parameter should have without even needing full QAPI introspection (that is, if libvirt is taught to manipulate mc-delay, libvirt will already know to expect mc-delay as an int, and not any other type, and merely needs to probe if qemu supports the 'mc-delay' extended capability). And of course, this new schema idea can retroactively cover all existing migration tunables, such as migrate_set_downtime, migrate_set_speed, migrate-set-cache-size, and so on.