Features/MicroCheckpointing: Difference between revisions

From QEMU
No edit summary
Line 5: Line 5:
=== Contact ===
=== Contact ===
* '''Name:''' [[User:Michael Hines|Michael Hines]]
* '''Name:''' [[User:Michael Hines|Michael Hines]]
* ''' Email:''' mrhines@us.ibm.com (and hinesmr@cn.ibm.com, currently on assignment in China)
* ''' Email:''' mrhines@us.ibm.com


Wiki: [http://wiki.qemu.org/Features/MicroCheckpointing http://wiki.qemu.org/Features/MicroCheckpointing]
Wiki: [http://wiki.qemu.org/Features/MicroCheckpointing http://wiki.qemu.org/Features/MicroCheckpointing]
Line 18: Line 18:


Micro-Checkpointing (MC) is one method for providing Fault Tolerance to a
Micro-Checkpointing (MC) is one method for providing Fault Tolerance to a
running virtual machine (VM) with neither runtime assistance from the guest
running virtual machine (VM) with little or no runtime assistance from the guest
kernel nor from the guest application software. Furthermore, Fault Tolerance
kernel or guest application software. Furthermore, Fault Tolerance
is one method of providing high availability to a VM such that, from the
is one method of providing high availability to a VM such that, from the
perspective of the outside world (clients, devices, and neighboring VMs that
perspective of the outside world (clients, devices, and neighboring VMs that
Line 29: Line 29:
the potentially extended lifetime of the VM because of this type of high
the potentially extended lifetime of the VM because of this type of high
availability, such software-level bugs may in fact manifest themselves  
availability, such software-level bugs may in fact manifest themselves  
more often than they ordinarily would, in which case you would need to
more often than they otherwise ordinarily would, in which case you would need to
employ other forms of availability to guard against such software-level faults.
employ other forms of availability to guard against such software-level faults.


This implementation is also fully compatible with RDMA. (See docs/rdma.txt for more details).
This implementation is also fully compatible with RDMA and has undergone special
optimizations to suppor the use of RDMA. (See docs/rdma.txt for more details).


== The Micro-Checkpointing Process ==
== The Micro-Checkpointing Process ==


=== Basic Algorithm ===
=== Basic Algorithm ===
Micro-Checkpointing works against the existing live migration path in QEMU,
Micro-Checkpoints (MC) work against the existing live migration path in QEMU,
and can effectively be understood as a "live migration that never ends".
and can effectively be understood as a "live migration that never ends".
As such, iterations rounds happen at the granularity of 10s of milliseconds
As such, iteration rounds happen at the granularity of 10s of milliseconds
and perform the following steps:
and perform the following steps:


Line 46: Line 47:
  4. Resume the VM immediately so that it can make forward progress.
  4. Resume the VM immediately so that it can make forward progress.
  5. Transmit the checkpoint to the destination.
  5. Transmit the checkpoint to the destination.
  6. Repeat  
  6. Repeat


Upon failure, load the contents of the last MC at the destination back
Upon failure, load the contents of the last MC at the destination back
Line 58: Line 59:
state that was committed by the virtual machine after failure. This is
state that was committed by the virtual machine after failure. This is
possible because a checkpoint may diverge by N milliseconds of time and
possible because a checkpoint may diverge by N milliseconds of time and
commit state while the current checkpoint is being transmitted to the
commit state while the current MC is being transmitted to the destination.  
destination.  


To guard against this problem, first, we must "buffer" the TX output of the
To guard against this problem, first, we must "buffer" the TX output of the
Line 66: Line 66:
at the source until the MC is transmitted. After transmission is complete,  
at the source until the MC is transmitted. After transmission is complete,  
those packets can be released. Similarly, in the case of disk I/O, we must
those packets can be released. Similarly, in the case of disk I/O, we must
ensure that either the contents of the local disk is safely mirrored to a  
ensure that either the contents of the local disk are safely mirrored to a  
remote disk before completing a MC or that the output to a shared disk,  
remote disk before completing a MC or that the output to a shared disk,  
such as iSCSI, is also buffered between checkpoints and then later released
such as iSCSI, is also buffered between checkpoints and then later released
Line 72: Line 72:


For the network in particular, buffering is performed using a series of  
For the network in particular, buffering is performed using a series of  
netlink Qdisc "plugs", introduced by the Xen Remus implementation. All packets
netlink (libnl3) Qdisc "plugs", introduced by the Xen Remus implementation.  
go through netlink in the host kernel - there are no exceptions and no gaps.
All packets go through netlink in the host kernel - there are no exceptions and no gaps.
Even while one buffer is being released (say, after a checkpoint has been
Even while one buffer is being released (say, after a checkpoint has been
saved), another plug will have already been initiated to hold the next round
saved), another plug will have already been initiated to hold the next round
Line 88: Line 88:
  3. Generate a MC by invoking the live migration software path to identify and copy dirty memory into a local staging area inside QEMU.
  3. Generate a MC by invoking the live migration software path to identify and copy dirty memory into a local staging area inside QEMU.
  4. Insert a *new* Qdisc plug (Buffer B). This buffers all new packets only.
  4. Insert a *new* Qdisc plug (Buffer B). This buffers all new packets only.
  5. Resume the VM immediately so that it can make forward progress.
  5. Resume the VM immediately so that it can make forward progress (making use of Buffer B).
  6. Transmit the checkpoint to the destination.
  6. Transmit the MC to the destination.
  7. Wait for acknowledgement.
  7. Wait for acknowledgement.
  8. Acknowledged.
  8. Acknowledged.
Line 97: Line 97:


This implementation *currently* only supports buffering for the network.
This implementation *currently* only supports buffering for the network.
This requires that the VM's root disk or any non-ephemeral disks also be  
(Any help on implementing disk support would be greatly appreciated).
Due to this lack of disk support, this requires that the VM's
root disk or any non-ephemeral disks also be
made network-accessible directly from within the VM. Until the aforementioned
made network-accessible directly from within the VM. Until the aforementioned
buffering or mirroring support is available (ideally through drive-mirror),
buffering or mirroring support is available (ideally through drive-mirror),
Line 112: Line 114:


Due to the high-frequency nature of micro-checkpointing, we expect
Due to the high-frequency nature of micro-checkpointing, we expect
a new checkpoint to be generated many times per second. Even missing just
a new MC to be generated many times per second. Even missing just
a few checkpoints easily constitutes a failure. Because of the consistent
a few MCs easily constitutes a failure. Because of the consistent
buffering of device I/O, this is safe because device I/O is not committed
buffering of device I/O, this is safe because device I/O is not committed
to the outside world until the checkpoint has been received at the
to the outside world until the MC has been received at the destination.
destination.


Failure is thus assumed under two conditions:
Failure is thus assumed under two conditions:


1. MC over TCP/IP: Once the socket connection breaks, we assume failure. This happens very early in the loss of the latest checkpoint not only because a very large amount of bytes is typically being sequenced in a TCP stream but perhaps also because of the timeout in acknowledgement of the receipt of a commit message by the destination.
1. MC over TCP/IP: Once the socket connection breaks, we assume failure. This happens very early in the loss of the latest MC not only because a very large amount of bytes is typically being sequenced in a TCP stream but perhaps also because of the timeout in acknowledgement of the receipt of a commit message by the destination.


2. MC over RDMA:  Since Infiniband does not provide any user-level timeout mechanisms, this implementation enhances QEMU's RDMA migration protocol to include a simple keep-alive. Upon the loss of multiple keep-alive messages, the sender is deemed to be failed.
2. MC over RDMA:  Since Infiniband does not provide any underlying timeout mechanisms, this implementation enhances QEMU's RDMA migration protocol to include a simple keep-alive. Upon the loss of multiple keep-alive messages, the sender is deemed to have failed.


In both cases, either due to a failed TCP socket connection or lost RDMA
In both cases, either due to a failed TCP socket connection or lost RDMA
keep-alive group, both the sender or the receiver can be deemed to be failed.
keep-alive group, both the sender or the receiver can be deemed to have failed.


If the sender is deemed to be failed, the destination takes over immediately
If the sender is deemed to have failed, the destination takes over immediately
using the contents of the last checkpoint.
using the contents of the last checkpoint.


Line 135: Line 136:
which may involve a third-party to identify a new destination host again to
which may involve a third-party to identify a new destination host again to
use as a backup for the VM.
use as a backup for the VM.


== Optimizations ==
== Optimizations ==
Line 152: Line 152:
  1. If the checkpoint size increases, then grow the number of slabs to support it.
  1. If the checkpoint size increases, then grow the number of slabs to support it.
  2. If the next checkpoint size is smaller than the last one, then that's a "strike".
  2. If the next checkpoint size is smaller than the last one, then that's a "strike".
  3. After N strikes, cut the size of the slab cache in half (to a minimum of 1 slab as described before.
  3. After N strikes, cut the size of the slab cache in half (to a minimum of 1 slab as described before).


As of this writing, the average size of an Idle-VM checkpoint is under 5MB.
As of this writing, the average size of a Linux-based Idle-VM checkpoint is under 5MB.


=== RDMA Integration ===
=== RDMA Integration ===
Line 160: Line 160:
RDMA is instrumental in enabling better MC performance, which is the reason
RDMA is instrumental in enabling better MC performance, which is the reason
why it was introduced into QEMU first.
why it was introduced into QEMU first.
RDMA is used for two different reasons:


  1. Checkpoint generation (RDMA-based memcpy):
  1. Checkpoint generation (RDMA-based memcpy):
  2. Checkpoint transmission (for performance and less CPU impact)
  2. Checkpoint transmission


Checkpoint generation (step 2 in the previous section) must be done while
Checkpoint generation must be done while the VM is paused.  
the VM is paused. In the worst case, the size of the checkpoint can be  
In the worst case, the size of the checkpoint can be equal in size  
equal in size to the amount of memory in total use by the VM. In order
to the amount of memory in total use by the VM. In order to resume  
to resume VM execution as fast as possible, the checkpoint is copied
VM execution as fast as possible, the checkpoint is copied consistently
consistently locally into a staging area before transmission. A standard
locally into a staging area before transmission. A standard
memcpy() of potentially such a large amount of memory not only gets
memcpy() of potentially such a large amount of memory not only gets
no use out of the CPU cache but also potentially clogs up the CPU pipeline
no use out of the CPU cache but also potentially clogs up the CPU pipeline
which would otherwise be useful by other neighbor VMs on the same
which would otherwise be useful by other neighbor VMs on the same
physical node that could be scheduled for execution by Linux. To minimize
physical node that could be scheduled for execution. To minimize
the effect on neighbor VMs, we use RDMA to perform a "local" memcpy(),
the effect on neighbor VMs, we use RDMA to perform a "local" memcpy(),
bypassing the host processor.
bypassing the host processor. On more recent processors, a 'beefy'
enough memory bus architecture can move memory just as fast (sometimes
faster) as a pure-software CPU-only optimized memcpy() from libc.
However, on older computers, this feature only gives you the benefit
of lower CPU-utilization at the expense of MC performance, so for
sometime, most users with older memory speeds will want to leave this
feature disabled by default.


Checkpoint transmission can potentially consume very large amounts of
Checkpoint transmission can potentially also consume very large amounts of
both bandwidth as well as CPU utilization that could otherwise by used by
both bandwidth as well as CPU utilization that could otherwise by used by
the VM itself or its neighbors. Once the aforementioned local copy of the
the VM itself or its neighbors. Once the aforementioned local copy of the
checkpoint is saved, this implementation makes use of the same RDMA
checkpoint is saved, this implementation makes use of the same RDMA
hardware to perform the transmission, similar to the way a live migration
hardware to perform the transmission exactly the same way that a live migration
happens over RDMA (see docs/rdma.txt).
happens over RDMA (see docs/rdma.txt).


Line 187: Line 195:


First, compile QEMU with '--enable-mc' and ensure that the corresponding
First, compile QEMU with '--enable-mc' and ensure that the corresponding
libraries for netlink are available. The netlink 'plug' support from the
libraries for netlink (libnl3) are available. The netlink 'plug' support from the
Qdisc functionality is required in particular, because it allows QEMU to
Qdisc functionality is required in particular, because it allows QEMU to
direct the kernel to buffer outbound network packages between checkpoints
direct the kernel to buffer outbound network packages between checkpoints
as described previously.
as described previously. Do not proceed without this support in a production
environment, or you risk corrupting the state of your I/O.


  $ git clone http://github.com/hinesmr/qemu.git
  $ git clone http://github.com/hinesmr/qemu.git

Revision as of 07:06, 18 February 2014

Summary

This is an implementation of Micro Checkpointing for memory and cpu state. Also known as: "Continuous Replication" or "Fault Tolerance" or 100 other different names - choose your poison.

Contact

Wiki: http://wiki.qemu.org/Features/MicroCheckpointing

Github: http://github.com/hinesmr/qemu.git, 'mc' branch

Libvirt Support: http://github.com/hinesmr/libvirt.git, 'mc' branch

Copyright (C) 2014 IBM Michael R. Hines <mrhines@us.ibm.com>

Introduction

Micro-Checkpointing (MC) is one method for providing Fault Tolerance to a running virtual machine (VM) with little or no runtime assistance from the guest kernel or guest application software. Furthermore, Fault Tolerance is one method of providing high availability to a VM such that, from the perspective of the outside world (clients, devices, and neighboring VMs that may be paired with it), the VM and its applications have not lost any runtime state in the event of either a failure of the hypervisor/hardware to allow the VM to make forward progress or a complete loss of power. This mechanism for providing fault tolerance does *not* provide any protection whatsoever against software-level faults in the guest kernel or applications. In fact, due to the potentially extended lifetime of the VM because of this type of high availability, such software-level bugs may in fact manifest themselves more often than they otherwise ordinarily would, in which case you would need to employ other forms of availability to guard against such software-level faults.

This implementation is also fully compatible with RDMA and has undergone special optimizations to suppor the use of RDMA. (See docs/rdma.txt for more details).

The Micro-Checkpointing Process

Basic Algorithm

Micro-Checkpoints (MC) work against the existing live migration path in QEMU, and can effectively be understood as a "live migration that never ends". As such, iteration rounds happen at the granularity of 10s of milliseconds and perform the following steps:

1. After N milliseconds, stop the VM.
3. Generate a MC by invoking the live migration software path to identify and copy dirty memory into a local staging area inside QEMU.
4. Resume the VM immediately so that it can make forward progress.
5. Transmit the checkpoint to the destination.
6. Repeat

Upon failure, load the contents of the last MC at the destination back into memory and run the VM normally.

I/O buffering

Additionally, a MC must include a consistent view of device I/O, particularly the network, a problem commonly referred to as "output commit". This means that the outside world can not be allowed to experience duplicate state that was committed by the virtual machine after failure. This is possible because a checkpoint may diverge by N milliseconds of time and commit state while the current MC is being transmitted to the destination.

To guard against this problem, first, we must "buffer" the TX output of the network (not the input) between MCs until the current MC is safely received by the destination. For example, all outbound network packets must be held at the source until the MC is transmitted. After transmission is complete, those packets can be released. Similarly, in the case of disk I/O, we must ensure that either the contents of the local disk are safely mirrored to a remote disk before completing a MC or that the output to a shared disk, such as iSCSI, is also buffered between checkpoints and then later released in the same way.

For the network in particular, buffering is performed using a series of netlink (libnl3) Qdisc "plugs", introduced by the Xen Remus implementation. All packets go through netlink in the host kernel - there are no exceptions and no gaps. Even while one buffer is being released (say, after a checkpoint has been saved), another plug will have already been initiated to hold the next round of packets simultaneously while the current round of packets are being released. Thus, at any given time, there may be as many as two simultaneous buffers in place.

With this in mind, here is the extended procedure for the micro checkpointing process:

1. Insert a new Qdisc plug (Buffer A).

Repeat Forever:

2. After N milliseconds, stop the VM.
3. Generate a MC by invoking the live migration software path to identify and copy dirty memory into a local staging area inside QEMU.
4. Insert a *new* Qdisc plug (Buffer B). This buffers all new packets only.
5. Resume the VM immediately so that it can make forward progress (making use of Buffer B).
6. Transmit the MC to the destination.
7. Wait for acknowledgement.
8. Acknowledged.
9. Release the Qdisc plug for Buffer A.
10. Qdisc Buffer B now becomes (symbolically rename) the most recent Buffer A
11. Go back to Step 2

This implementation *currently* only supports buffering for the network. (Any help on implementing disk support would be greatly appreciated). Due to this lack of disk support, this requires that the VM's root disk or any non-ephemeral disks also be made network-accessible directly from within the VM. Until the aforementioned buffering or mirroring support is available (ideally through drive-mirror), the only "consistent" way to provide full fault tolerance of the VM's non-ephemeral disks is to construct a VM whose root disk is made to boot directly from iSCSI or NFS or similar such that all disk I/O is translated into network I/O.

Buffering is performed with the combination of an IFB device attached to the KVM tap device combined with a netlink Qdisc plug (exactly like the Xen remus solution).

Failure Recovery

Due to the high-frequency nature of micro-checkpointing, we expect a new MC to be generated many times per second. Even missing just a few MCs easily constitutes a failure. Because of the consistent buffering of device I/O, this is safe because device I/O is not committed to the outside world until the MC has been received at the destination.

Failure is thus assumed under two conditions:

1. MC over TCP/IP: Once the socket connection breaks, we assume failure. This happens very early in the loss of the latest MC not only because a very large amount of bytes is typically being sequenced in a TCP stream but perhaps also because of the timeout in acknowledgement of the receipt of a commit message by the destination.

2. MC over RDMA: Since Infiniband does not provide any underlying timeout mechanisms, this implementation enhances QEMU's RDMA migration protocol to include a simple keep-alive. Upon the loss of multiple keep-alive messages, the sender is deemed to have failed.

In both cases, either due to a failed TCP socket connection or lost RDMA keep-alive group, both the sender or the receiver can be deemed to have failed.

If the sender is deemed to have failed, the destination takes over immediately using the contents of the last checkpoint.

If the destination is deemed to be lost, we perform the same action as a live migration: resume the sender normally and wait for management software to make a policy decision about whether or not to re-protect the VM, which may involve a third-party to identify a new destination host again to use as a backup for the VM.

Optimizations

Memory Management

Managing QEMU memory usage in this implementation is critical to the performance of any micro-checkpointing (MC) implementation.

MCs are typically only a few MB when idle. However, they can easily be very large during heavy workloads. In the *extreme* worst-case, QEMU will need double the amount of main memory than that of what was originally allocated to the virtual machine.

To support this variability during transient periods, a MC consists of a linked list of slabs, each of identical size. A better name would be welcome, as the name was only chosen because it resembles linux memory allocation. Because MCs occur several times per second (a frequency of 10s of milliseconds), slabs allow MCs to grow and shrink without constantly re-allocating all memory in place during each checkpoint. During steady-state, the 'head' slab is permanently allocated and never goes away, so when the VM is idle, there is no memory allocation at all. This design supports the use of RDMA. Since RDMA requires memory pinning, we must be able to hold on to a slab for a reasonable amount of time to get any real use out of it.

Regardless, the current strategy taken will be:

1. If the checkpoint size increases, then grow the number of slabs to support it.
2. If the next checkpoint size is smaller than the last one, then that's a "strike".
3. After N strikes, cut the size of the slab cache in half (to a minimum of 1 slab as described before).

As of this writing, the average size of a Linux-based Idle-VM checkpoint is under 5MB.

RDMA Integration

RDMA is instrumental in enabling better MC performance, which is the reason why it was introduced into QEMU first.

RDMA is used for two different reasons:

1. Checkpoint generation (RDMA-based memcpy):
2. Checkpoint transmission

Checkpoint generation must be done while the VM is paused. In the worst case, the size of the checkpoint can be equal in size to the amount of memory in total use by the VM. In order to resume VM execution as fast as possible, the checkpoint is copied consistently locally into a staging area before transmission. A standard memcpy() of potentially such a large amount of memory not only gets no use out of the CPU cache but also potentially clogs up the CPU pipeline which would otherwise be useful by other neighbor VMs on the same physical node that could be scheduled for execution. To minimize the effect on neighbor VMs, we use RDMA to perform a "local" memcpy(), bypassing the host processor. On more recent processors, a 'beefy' enough memory bus architecture can move memory just as fast (sometimes faster) as a pure-software CPU-only optimized memcpy() from libc. However, on older computers, this feature only gives you the benefit of lower CPU-utilization at the expense of MC performance, so for sometime, most users with older memory speeds will want to leave this feature disabled by default.

Checkpoint transmission can potentially also consume very large amounts of both bandwidth as well as CPU utilization that could otherwise by used by the VM itself or its neighbors. Once the aforementioned local copy of the checkpoint is saved, this implementation makes use of the same RDMA hardware to perform the transmission exactly the same way that a live migration happens over RDMA (see docs/rdma.txt).

Usage

BEFORE Running

First, compile QEMU with '--enable-mc' and ensure that the corresponding libraries for netlink (libnl3) are available. The netlink 'plug' support from the Qdisc functionality is required in particular, because it allows QEMU to direct the kernel to buffer outbound network packages between checkpoints as described previously. Do not proceed without this support in a production environment, or you risk corrupting the state of your I/O.

$ git clone http://github.com/hinesmr/qemu.git
$ git checkout 'mc'
$ ./configure --enable-mc [other options]

Next, start the VM that you want to protect using your standard procedures.

Enable MC like this:

QEMU Monitor Command:

$ migrate_set_capability x-mc on # disabled by default

Currently, only one network interface is supported, *and* currently you must ensure that the root disk of your VM is booted either directly from iSCSI or NFS, as described previously. This will be rectified with future improvements.

For testing only, you can ignore the aforementioned requirements if you simply want to get an understanding of the performance penalties associated with this feature activated.

Next, you can optionally disable network-buffering for additional test-only execution. This is useful if you want to get a breakdown only what the cost of the checkpointing the memory state is without the cost of checkpointing device state.

QEMU Monitor Command:

$ migrate_set_capability mc-net-disable on # buffering activated by default 

Next, you can optionally enable RDMA 'memcpy' support. This is only valid if you have RDMA support compiled into QEMU and you intend to use the 'rdma' migration URI upon initiating MC as described later.

QEMU Monitor Command:

$ migrate_set_capability mc-rdma-copy on # disabled by default

Next, you can optionally enable the 'bitworkers' feature of QEMU. This is allows QEMU to use all available host CPU cores to parallelize the process of processing the migration dirty bitmap as described previously. For normal live migrations, we disable this by default as migration is typically a short-lived operation.

QEMU Monitor Command:

$ migrate_set_capability bitworkers on # disabled by default

Finally, if you are using QEMU's support for RDMA migration, you will want to enable RDMA keep-alive support to allow quick detection of failure. If you are using TCP/IP, this is not required:

QEMU Monitor Command:

$ migrate_set_capability rdma-keepalive on # disabled by default


Running

First, make sure the IFB device kernel module is loaded

$ modprobe ifb numifbs=100 # (or some large number)

Now, install a Qdisc plug to the tap device using the same naming convention as the tap device created by QEMU:

$ ip link set up ifb0 # <= corresponds to tap device 'tap0'
$ tc qdisc add dev tap0 ingress
$ tc filter add dev tap0 parent ffff: proto ip pref 10 u32 match u32 0 0 action mirred egress redirect dev ifb0

(You will need a script to automate the part above until the libvirt patches are more complete).

Now, that the network buffering connection is ready:

MC can be initiated with exactly the same command as standard live migration:

QEMU Monitor Command:

$ migrate -d (tcp|rdma):host:port

Upon failure, the destination VM will detect a loss in network connectivity and automatically revert to the last checkpoint taken and resume execution immediately. There is no need for additional QEMU monitor commands to initiate the recovery process.

Performance

By far, the biggest cost is network throughput. Virtual machines are capable of dirtying memory well in excess of the bandwidth provided a commodity 1 Gbps network link. If so, the MC process will always lag behind the virtual machine and forward progress will be poor. It is highly recommended to use at least a 10 Gbps link when using MC.

Numbers are still coming in, but without output buffering of network I/O, the performance penalty on a typical 4GB RAM Java-based application server workload using a 10 Gbps link (a good worst case for testing due Java's constant garbage collection) is on the order of 25%. With network buffering activated, this can be as high as 50%.

The majority of the 25% penalty is due to the preparation of the QEMU migration dirty bitmap, which can incur tens of milliseconds of downtime against the guest.

The remaining 25% penalty comes from network buffering is typically due to checkpoints not occurring fast enough since a typical "round trip" time between the request of an application-level transaction and the corresponding response should ideally be larger than the time it takes to complete a checkpoint, otherwise, the response to the application within the VM will appear to be congested since the VM's network endpoint may not have even received the TX request from the application in the first place.

We believe that this effect is "amplified" due to the poor performance in processing the migration bitmap and thus since an application-level RTT cannot be serviced with more frequent checkpoints, network I/O tends to get held in the buffer too long. This has the effect of causing the guest TCP/IP stack to experience congestion, propagating this artificially created delay all the way up to the application.

TODO

1. Eliminate as much of the cost of migration dirty bitmap preparation as possible. Parallelization is really only a stop-gap measure.

2. Implement local disk mirroring by integrating with QEMU's 'drive-mirror' feature in order to full support virtual machines with local storage.

3. Implement output commit buffering for shared storage.

FAQ / Frequently Asked Questions

What happens if a failure occurs in the *middle* of a flush of the network buffer?

Micro-Checkpointing depends *heavily* on the correctness of TCP/IP. Thus, this is not a problem because the network buffer holds packets only for the last *committed* checkpoint (meaning that the last micro checkpoint must have been acknowledged as received successfully by the backup host). After understanding this, it is then important to understand how network buffering is repeated between checkpoints. *ALL* packets go through the buffer - there is no exception or gaps. There is no such situation where while the buffer is being flushed other newer packets are going through - that's not how it works. Please refer to the previous section "I/O buffering" for a detailed description of how network buffering works.

Why is this not a problem?

Example: Let's say we have packets "A" and "B" in the buffer.

Packet A is sent successfully and a failure occurs before packet B is transmitted.

Packet A) This is acceptable. The guest checkpoint has already recorded delivery of the packet from the guest's perspective. The network fabric can deliver or not deliver as it sees fit. Thus the buffer simply has the same effect of an additional network switch - it does not alter the effect of fault tolerance as viewed by the external world any more so than another faulty hop in the traditional network architecture would cause congestion in the network. The packet will never get RE-generated because the checkpoint has already been committed at the destination which corresponds to the transmission of that packet from the perspective of the virtual machine. Any FUTURE packets generated while the VM resumes execution are *also* buffered as described previously.

Packet B) This is acceptable. This packet will be lost. This will result in a TCP-level timeout on the peer side of the connection in the case that packet B is an ACK or will result in a timeout on the guest-side of the connection in the case that the packet is a TCP PUSH. Either way, the packet will get re-transmitted either because the data was never acknowledged or never received as soon as the virtual machine resumes execution.

What's different about this implementation?

Several things about this implementation attempt are different from previous implementations:

1. We are dedicated to see this through the community review process and stay current with the master branch.

2. This implementation is 100% compatible with RDMA.

3. Memory management is completely overhauled - malloc()/free() churn is reduced to a minimum.

4. This is not port of Kemari. Kemari is obsolete and incompatible with the most recent QEMU.

5. Network I/O buffering is outsourced to the host kernel, using netlink code introduced by the Remus/Xen project.

6. We make every attempt to change as little of the existing migration call path as possible.