Features/BlockReplication

From QEMU
Revision as of 08:16, 25 December 2014 by Yang (talk | contribs) (Created page with '= Summary = The blkcolo block driver enables disk replication for continuous checkpoints. It is designed for COLO that Secondary VM is running. It can also be applied for FT/HA s…')
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Summary

The blkcolo block driver enables disk replication for continuous checkpoints. It is designed for COLO that Secondary VM is running. It can also be applied for FT/HA scene that Secondary VM is not running.

Design

Disk replication using blkcolo
----------------------------------------
Copyright Fujitsu, Corp. 2014

This work is licensed under the terms of the GNU GPL, version 2 or later.
See the COPYING file in the top-level directory.

The blkcolo block driver enables disk replication for continuous checkpoints.
It is designed for COLO that Secondary VM is running. It can also be applied
for FT/HA scene that Secondary VM is not running.

This document gives an overview of blkcolo's design.

== Background ==
High availability solutions such as micro checkpoint and COLO will do
consecutive checkpoint. The VM state of Primary VM and Secondary VM is
identical right after a VM checkpoint, but becomes different as the VM
executes till the next checkpoint. To support disk contents checkpoint,
the modified disk contents in the Secondary VM must be buffered, and are
only dropped at next checkpoint time. To reduce the network transportation
effort at the time of checkpoint, the disk modification operations of
Primary disk are asynchronously forwarded to the Secondary node.

== Disk Buffer ==
The following is the image of Disk buffer:

        +----------------------+            +------------------------+
        |Primary Write Requests|            |Secondary Write Requests|
        +----------------------+            +------------------------+
                  |                                       |
                  |                                      (4)
                  |                                       V
                  |                              /-------------\
                  |      Copy and Forward        |             |
                  |---------(1)----------+       | Disk Buffer |
                  |                      |       |             |
                  |                     (3)      \-------------/
                  |                 speculative      ^   
                  |                write through    (2) 
                  |                      |           |
                  V                      V           |
           +--------------+           +----------------+
           | Primary Disk |           | Secondary Disk |
           +--------------+           +----------------+
    1) Primary write requests will be copied and forwarded to Secondary
       QEMU.
    2) Before Primary write requests are written to Secondary disk, the
       original sector content will be read from Secondary disk and
       buffered in the Disk buffer, but it will not overwrite the existing
       sector content in the Disk buffer.
    3) Primary write requests will be written to Secondary disk.
    4) Secondary write requests will be bufferd in the Disk buffer and it
       will overwrite the existing sector content in the buffer.

== Capture I/O request ==
The blkcolo is a new block driver protocol, so all I/O requests can be
captured in the driver interface bdrv_co_readv()/bdrv_co_writev().

== Checkpoint & failover ==
The blkcolo buffers the write requests in Secondary QEMU. And the buffer
should be dropped at a checkpoint, or be flushed to Secondary disk when
failover. We add three block driver interfaces to do this:
a. bdrv_wait_recv_completed()
   This interface may block, and return when all Primary write
   requests are forwarded to Secondary QEMU.
b. bdrv_handle_checkpoint()
   This interface is called after all VM state is transfered to
   Secondary QEMU. The Disk buffer will be dropped in this interface.
c. bdrv_cancel_checkpoint()
   It is called when doing failover. We will flush the Disk buffer into
   Secondary Disk and stop disk replication.

== Usage ==
On both Primary/Secondary host, invoke QEMU with the following parameters:
    "-drive file=blkcolo:host:port:/path/to/image"
a. host
   Hostname or IP of the Secondary host.
b. port
   The Secondary QEMU will listen on this port, and the Primary QEMU
   will connect to this port.