Features/BlockReplication: Difference between revisions

From QEMU
Line 14: Line 14:
Copyright Fujitsu, Corp. 2015
Copyright Fujitsu, Corp. 2015
Copyright (c) 2015 Intel Corporation
Copyright (c) 2015 Intel Corporation
Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.
Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.


This work is licensed under the terms of the GNU GPL, version 2 or later.
This work is licensed under the terms of the GNU GPL, version 2 or later.
See the COPYING file in the top-level directory.
See the COPYING file in the top-level directory.


The block replication is used for continuous checkpoints. It is designed
Block replication is used for continuous checkpoints. It is designed
for COLO that Secondary VM is running. It can also be applied for FT/HA
for COLO (COurse-grain LOck-stepping) where the Secondary VM is running.
scene that Secondary VM is not running.
It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
where the Secondary VM is not running.


This document gives an overview of block replication's design.
This document gives an overview of block replication's design.
Line 27: Line 28:
== Background ==
== Background ==
High availability solutions such as micro checkpoint and COLO will do
High availability solutions such as micro checkpoint and COLO will do
consecutive checkpoint. The VM state of Primary VM and Secondary VM is
consecutive checkpoints. The VM state of Primary VM and Secondary VM is
identical right after a VM checkpoint, but becomes different as the VM
identical right after a VM checkpoint, but becomes different as the VM
executes till the next checkpoint. To support disk contents checkpoint,
executes till the next checkpoint. To support disk contents checkpoint,
Line 62: Line 63:
       original sector content will be read from Secondary disk and
       original sector content will be read from Secondary disk and
       buffered in the Disk buffer, but it will not overwrite the existing
       buffered in the Disk buffer, but it will not overwrite the existing
       sector content in the Disk buffer.
       sector content(it could be from either "Secondary Write Requests" or
      previous COW of "Primary Write Requests") in the Disk buffer.
     3) Primary write requests will be written to Secondary disk.
     3) Primary write requests will be written to Secondary disk.
     4) Secondary write requests will be bufferd in the Disk buffer and it
     4) Secondary write requests will be buffered in the Disk buffer and it
       will overwrite the existing sector content in the buffer.
       will overwrite the existing sector content in the buffer.


== Architecture ==
== Architecture ==
We are going to implement COLO block replication from many basic
We are going to implement block replication from many basic
blocks that are already in QEMU.
blocks that are already in QEMU.


Line 77: Line 79:
         /      \        ||
         /      \        ||
         /        \        ||
         /        \        ||
   Primary      2 NBD  ------->  2 NBD
   Primary   2 filter
    disk      client    ||    server                                         virtio-blk
     disk        ^                                                            virtio-blk
                  |                                                                  ^
                3 NBD  ------->  3 NBD                                               |
                client    ||    server                                         2 filter
                           ||        ^                                                ^
                           ||        ^                                                ^
--------.                ||        |                                                |
--------.                ||        |                                                |
Primary |                ||  Secondary disk <--------- hidden-disk 4 <--------- active-disk 3
Primary |                ||  Secondary disk <--------- hidden-disk 5 <--------- active-disk 4
--------'                ||        |          backing        ^      backing
--------'                ||        |          backing        ^      backing
                           ||        |                        |
                           ||        |                        |
Line 94: Line 99:
NBD.
NBD.


2) The secondary disk receives writes from the primary VM through QEMU's
2) The new block filter(the name is replication) will control the block
replication.
 
3) The secondary disk receives writes from the primary VM through QEMU's
embedded NBD server (speculative write-through).
embedded NBD server (speculative write-through).


3) The disk on the secondary is represented by a custom block device
4) The disk on the secondary is represented by a custom block device
(called active-disk). It should be an empty disk, and the format should
(called active-disk). It should be an empty disk, and the format should
be qcow2.
support bdrv_make_empty() and backing file.


4) The hidden-disk is created automatically. It buffers the original content
5) The hidden-disk is created automatically. It buffers the original content
that is modified by the primary VM. It should also be an empty disk, and
that is modified by the primary VM. It should also be an empty disk, and
the dirver supports bdrv_make_empty().
the driver supports bdrv_make_empty() and backing file.
 
== Failure Handling ==
There are 6 internal errors when block replication is running:
1. I/O error on primary disk
2. Forwarding primary write requests failed
3. Backup failed
4. I/O error on secondary disk
5. I/O error on active disk
6. Making active disk or hidden disk empty failed
In case 1 and 5, we just report the error to the disk layer. In case 2, 3,
4 and 6, we just report block replication's error to FT/HA manager(which
decides when to do a new checkpoint, when to do failover).
There is one internal error when doing failover:
1. Commiting the data in active disk/hidden disk to secondary disk failed
We just to report this error to FT/HA manager.


== New block driver interface ==
== New block driver interface ==
Line 110: Line 133:
   Start block replication, called in migration/checkpoint thread.
   Start block replication, called in migration/checkpoint thread.
   We must call bdrv_start_replication() in secondary QEMU before
   We must call bdrv_start_replication() in secondary QEMU before
   calling bdrv_start_replication() in primary QEMU.
   calling bdrv_start_replication() in primary QEMU. The caller
  must hold the I/O mutex lock if it is in migration/checkpoint
  thread.
b. bdrv_do_checkpoint()
b. bdrv_do_checkpoint()
   This interface is called after all VM state is transfered to
   This interface is called after all VM state is transferred to
   Secondary QEMU. The Disk buffer will be dropped in this interface.
   Secondary QEMU. The Disk buffer will be dropped in this interface.
  The caller must hold the I/O mutex lock if it is in migration/checkpoint
  thread.
c. bdrv_stop_replication()
c. bdrv_stop_replication()
   It is called when failover. We will flush the Disk buffer into
   It is called on failover. We will flush the Disk buffer into
   Secondary Disk and stop block replication. The vm should be stopped
   Secondary Disk and stop block replication. The vm should be stopped
   before calling it.
   before calling it. The caller must hold the I/O mutex lock if it is
  in migration/checkpoint thread.


== Usage ==
== Usage ==
Primary:
Primary:
   -drive if=xxx,driver=quorum,read-pattern=fifo,\
   -drive if=xxx,driver=quorum,read-pattern=fifo,no-connect=on,\
         children.0.file.filename=1.raw,\
         children.0.file.filename=1.raw,\
         children.0.driver=raw,\
         children.0.driver=raw,\
         children.1.file.driver=nbd+colo,\
         children.1.file.driver=nbd,\
         children.1.file.host=xxx,\
         children.1.file.host=xxx,\
         children.1.file.port=xxx,\
         children.1.file.port=xxx,\
         children.1.file.export=xxx,\
         children.1.file.export=xxx,\
         children.1.driver=raw,\
         children.1.driver=replication,\
        children.1.mode=primary,\
         children.1.ignore-errors=on
         children.1.ignore-errors=on
   Note:
   Note:
Line 135: Line 164:
   3. host is the secondary physical machine's hostname or IP
   3. host is the secondary physical machine's hostname or IP
   4. Each disk must have its own export name.
   4. Each disk must have its own export name.
  5. It is all a single argument to -drive, and you should ignore
    the leading whitespace.


Secondary:
Secondary:
   -drive if=none,driver=raw,file=1.raw,id=nbd_target1 \
   -drive if=none,driver=raw,file=1.raw,id=nbd_target1 \
   -drive if=xxx,driver=qcow2+colo,file=active_disk.qcow2,export=xxx,\
   -drive if=xxx,driver=replication,mode=secondary,export=xxx,\
         backing_reference.drive_id=nbd_target1,\
        file.file.filename=active_disk.qcow2,\
         backing_reference.hidden-disk.file.filename=hidden_disk.qcow2,\
        file.driver=qcow2,\
         backing_reference.hidden-disk.driver=qcow2,\
         file.backing_reference.drive_id=nbd_target1,\
         backing_reference.hidden-disk.allow-write-backing-file=on
         file.backing_reference.hidden-disk.file.filename=hidden_disk.qcow2,\
         file.backing_reference.hidden-disk.driver=qcow2,\
         file.backing_reference.hidden-disk.allow-write-backing-file=on
   Then run qmp command:
   Then run qmp command:
     nbd_server_start host:port
     nbd-server-start host:port
   Note:
   Note:
   1. The export name for the same disk must be the same in primary
   1. The export name for the same disk must be the same in primary
     and secondary QEMU command line
     and secondary QEMU command line
   2. The qmp command nbd_server_start must be run before running the
   2. The qmp command nbd-server-start must be run before running the
     qmp command migrate on primary QEMU
     qmp command migrate on primary QEMU
   3. Don't use nbd_server_start's other options
   3. Don't use nbd-server-start's other options
   4. Active disk, hidden disk and nbd target's length should be the
   4. Active disk, hidden disk and nbd target's length should be the
     same.
     same.
   5. It is better to put active disk and hidden disk in ramdisk.
   5. It is better to put active disk and hidden disk in ramdisk.
  6. It is all a single argument to -drive, and you should ignore
    the leading whitespace.


</pre>
</pre>

Revision as of 01:07, 2 July 2015

Summary

The blkcolo block driver enables disk replication for continuous checkpoints. It is designed for COLO that Secondary VM is running. It can also be applied for FT/HA scene that Secondary VM is not running.

You can get the patches here: https://github.com/wencongyang/qemu-colo/commits/block-replication-v2

Design

Block replication
----------------------------------------
Copyright Fujitsu, Corp. 2015
Copyright (c) 2015 Intel Corporation
Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD.

This work is licensed under the terms of the GNU GPL, version 2 or later.
See the COPYING file in the top-level directory.

Block replication is used for continuous checkpoints. It is designed
for COLO (COurse-grain LOck-stepping) where the Secondary VM is running.
It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
where the Secondary VM is not running.

This document gives an overview of block replication's design.

== Background ==
High availability solutions such as micro checkpoint and COLO will do
consecutive checkpoints. The VM state of Primary VM and Secondary VM is
identical right after a VM checkpoint, but becomes different as the VM
executes till the next checkpoint. To support disk contents checkpoint,
the modified disk contents in the Secondary VM must be buffered, and are
only dropped at next checkpoint time. To reduce the network transportation
effort at the time of checkpoint, the disk modification operations of
Primary disk are asynchronously forwarded to the Secondary node.

== Workflow ==
The following is the image of block replication workflow:

        +----------------------+            +------------------------+
        |Primary Write Requests|            |Secondary Write Requests|
        +----------------------+            +------------------------+
                  |                                       |
                  |                                      (4)
                  |                                       V
                  |                              /-------------\
                  |      Copy and Forward        |             |
                  |---------(1)----------+       | Disk Buffer |
                  |                      |       |             |
                  |                     (3)      \-------------/
                  |                 speculative      ^
                  |                write through    (2)
                  |                      |           |
                  V                      V           |
           +--------------+           +----------------+
           | Primary Disk |           | Secondary Disk |
           +--------------+           +----------------+

    1) Primary write requests will be copied and forwarded to Secondary
       QEMU.
    2) Before Primary write requests are written to Secondary disk, the
       original sector content will be read from Secondary disk and
       buffered in the Disk buffer, but it will not overwrite the existing
       sector content(it could be from either "Secondary Write Requests" or
       previous COW of "Primary Write Requests") in the Disk buffer.
    3) Primary write requests will be written to Secondary disk.
    4) Secondary write requests will be buffered in the Disk buffer and it
       will overwrite the existing sector content in the buffer.

== Architecture ==
We are going to implement block replication from many basic
blocks that are already in QEMU.

         virtio-blk       ||
             ^            ||                            .----------
             |            ||                            | Secondary
        1 Quorum          ||                            '----------
         /      \         ||
        /        \        ||
   Primary    2 filter
     disk         ^                                                             virtio-blk
                  |                                                                  ^
                3 NBD  ------->  3 NBD                                               |
                client    ||     server                                          2 filter
                          ||        ^                                                ^
--------.                 ||        |                                                |
Primary |                 ||  Secondary disk <--------- hidden-disk 5 <--------- active-disk 4
--------'                 ||        |          backing        ^       backing
                          ||        |                         |
                          ||        |                         |
                          ||        '-------------------------'
                          ||           drive-backup sync=none

1) The disk on the primary is represented by a block device with two
children, providing replication between a primary disk and the host that
runs the secondary VM. The read pattern for quorum can be extended to
make the primary always read from the local disk instead of going through
NBD.

2) The new block filter(the name is replication) will control the block
replication.

3) The secondary disk receives writes from the primary VM through QEMU's
embedded NBD server (speculative write-through).

4) The disk on the secondary is represented by a custom block device
(called active-disk). It should be an empty disk, and the format should
support bdrv_make_empty() and backing file.

5) The hidden-disk is created automatically. It buffers the original content
that is modified by the primary VM. It should also be an empty disk, and
the driver supports bdrv_make_empty() and backing file.

== Failure Handling ==
There are 6 internal errors when block replication is running:
1. I/O error on primary disk
2. Forwarding primary write requests failed
3. Backup failed
4. I/O error on secondary disk
5. I/O error on active disk
6. Making active disk or hidden disk empty failed
In case 1 and 5, we just report the error to the disk layer. In case 2, 3,
4 and 6, we just report block replication's error to FT/HA manager(which
decides when to do a new checkpoint, when to do failover).
There is one internal error when doing failover:
1. Commiting the data in active disk/hidden disk to secondary disk failed
We just to report this error to FT/HA manager.

== New block driver interface ==
We add three block driver interfaces to control block replication:
a. bdrv_start_replication()
   Start block replication, called in migration/checkpoint thread.
   We must call bdrv_start_replication() in secondary QEMU before
   calling bdrv_start_replication() in primary QEMU. The caller
   must hold the I/O mutex lock if it is in migration/checkpoint
   thread.
b. bdrv_do_checkpoint()
   This interface is called after all VM state is transferred to
   Secondary QEMU. The Disk buffer will be dropped in this interface.
   The caller must hold the I/O mutex lock if it is in migration/checkpoint
   thread.
c. bdrv_stop_replication()
   It is called on failover. We will flush the Disk buffer into
   Secondary Disk and stop block replication. The vm should be stopped
   before calling it. The caller must hold the I/O mutex lock if it is
   in migration/checkpoint thread.

== Usage ==
Primary:
  -drive if=xxx,driver=quorum,read-pattern=fifo,no-connect=on,\
         children.0.file.filename=1.raw,\
         children.0.driver=raw,\
         children.1.file.driver=nbd,\
         children.1.file.host=xxx,\
         children.1.file.port=xxx,\
         children.1.file.export=xxx,\
         children.1.driver=replication,\
         children.1.mode=primary,\
         children.1.ignore-errors=on
  Note:
  1. NBD Client should not be the first child of quorum.
  2. There should be only one NBD Client.
  3. host is the secondary physical machine's hostname or IP
  4. Each disk must have its own export name.
  5. It is all a single argument to -drive, and you should ignore
     the leading whitespace.

Secondary:
  -drive if=none,driver=raw,file=1.raw,id=nbd_target1 \
  -drive if=xxx,driver=replication,mode=secondary,export=xxx,\
         file.file.filename=active_disk.qcow2,\
         file.driver=qcow2,\
         file.backing_reference.drive_id=nbd_target1,\
         file.backing_reference.hidden-disk.file.filename=hidden_disk.qcow2,\
         file.backing_reference.hidden-disk.driver=qcow2,\
         file.backing_reference.hidden-disk.allow-write-backing-file=on
  Then run qmp command:
    nbd-server-start host:port
  Note:
  1. The export name for the same disk must be the same in primary
     and secondary QEMU command line
  2. The qmp command nbd-server-start must be run before running the
     qmp command migrate on primary QEMU
  3. Don't use nbd-server-start's other options
  4. Active disk, hidden disk and nbd target's length should be the
     same.
  5. It is better to put active disk and hidden disk in ramdisk.
  6. It is all a single argument to -drive, and you should ignore
     the leading whitespace.