Features/BlockReplication: Difference between revisions

From QEMU
No edit summary
Line 5: Line 5:
running.
running.


= Drift Design =
You can get the patch here:
https://github.com/wencongyang/qemu-colo/commits/block-replication-v1
 
= Design =
<pre>
<pre>
Disk replication using blkcolo
Block replication
----------------------------------------
----------------------------------------
Copyright Fujitsu, Corp. 2014
Copyright Fujitsu, Corp. 2015
Copyright (c) 2015 Intel Corporation
Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.


This work is licensed under the terms of the GNU GPL, version 2 or later.
This work is licensed under the terms of the GNU GPL, version 2 or later.
See the COPYING file in the top-level directory.
See the COPYING file in the top-level directory.


The blkcolo block driver enables disk replication for continuous checkpoints.
The block replication is used for continuous checkpoints. It is designed
It is designed for COLO that Secondary VM is running. It can also be applied
for COLO that Secondary VM is running. It can also be applied for FT/HA
for FT/HA scene that Secondary VM is not running.
scene that Secondary VM is not running.


This document gives an overview of blkcolo's design.
This document gives an overview of block replication's design.


== Background ==
== Background ==
Line 30: Line 35:
Primary disk are asynchronously forwarded to the Secondary node.
Primary disk are asynchronously forwarded to the Secondary node.


== Disk Buffer ==
== Workflow ==
The following is the image of Disk buffer:
The following is the image of block replication workflow:


         +----------------------+            +------------------------+
         +----------------------+            +------------------------+
Line 44: Line 49:
                   |                      |      |            |
                   |                      |      |            |
                   |                    (3)      \-------------/
                   |                    (3)      \-------------/
                   |                speculative      ^  
                   |                speculative      ^
                   |                write through    (2)  
                   |                write through    (2)
                   |                      |          |
                   |                      |          |
                   V                      V          |
                   V                      V          |
Line 51: Line 56:
           | Primary Disk |          | Secondary Disk |
           | Primary Disk |          | Secondary Disk |
           +--------------+          +----------------+
           +--------------+          +----------------+
     1) Primary write requests will be copied and forwarded to Secondary
     1) Primary write requests will be copied and forwarded to Secondary
       QEMU.
       QEMU.
Line 61: Line 67:
       will overwrite the existing sector content in the buffer.
       will overwrite the existing sector content in the buffer.


== Capture I/O request ==
== Architecture ==
The blkcolo is a new block driver protocol, so all I/O requests can be
We are going to implement COLO block replication from many basic
captured in the driver interface bdrv_co_readv()/bdrv_co_writev().
blocks that are already in QEMU.
 
        virtio-blk      ||
            ^            ||                            .----------
            |            ||                            | Secondary
        1 Quorum          ||                            '----------
        /     \        ||
        /       \        ||
  Primary      2 NBD  ------->  2 NBD
    disk      client    ||    server                  virtio-blk
                          ||        ^                        ^
--------.                 ||        |                        |
Primary |                ||  Secondary disk <--------- COLO buffer 3
--------'                ||                  backing


== Checkpoint & failover ==
1) The disk on the primary is represented by a block device with two
The blkcolo buffers the write requests in Secondary QEMU. And the buffer
children, providing replication between a primary disk and the host that
should be dropped at a checkpoint, or be flushed to Secondary disk when
runs the secondary VM. The read pattern for quorum can be extended to
failover. We add three block driver interfaces to do this:
make the primary always read from the local disk instead of going through
a. bdrv_wait_recv_completed()
NBD.
   This interface may block, and return when all Primary write
 
   requests are forwarded to Secondary QEMU.
2) The secondary disk receives writes from the primary VM through QEMU's
b. bdrv_handle_checkpoint()
embedded NBD server (speculative write-through).
 
3) The disk on the secondary is represented by a custom block device
("COLO buffer"). The disk buffer's backing image is the secondary disk,
and the disk buffer uses bdrv_add_before_write_notifier to implement
copy-on-write, similar to block/backup.c.
 
== New block driver interface ==
We add three block driver interfaces to control block replication:
a. bdrv_start_replication()
   Start block replication, called in migration/checkpoint thread.
  We must call bdrv_start_replication() in secondary QEMU before
   calling bdrv_start_replication() in primary QEMU.
b. bdrv_do_checkpoint()
   This interface is called after all VM state is transfered to
   This interface is called after all VM state is transfered to
   Secondary QEMU. The Disk buffer will be dropped in this interface.
   Secondary QEMU. The Disk buffer will be dropped in this interface.
c. bdrv_cancel_checkpoint()
c. bdrv_stop_replication()
   It is called when doing failover. We will flush the Disk buffer into
   It is called when failover. We will flush the Disk buffer into
   Secondary Disk and stop disk replication.
   Secondary Disk and stop block replication.


== Usage ==
== Usage ==
On both Primary/Secondary host, invoke QEMU with the following parameters:
Primary:
    "-drive file=blkcolo:host:port:/path/to/image"
  -drive if=xxx,driver=quorum,read-pattern=first,\
a. host
        children.0.file.filename=1.raw,\
  Hostname or IP of the Secondary host.
        children.0.driver=raw,\
b. port
        children.1.file.driver=nbd+colo,\
  The Secondary QEMU will listen on this port, and the Primary QEMU
        children.1.file.host=xxx,\
  will connect to this port.
        children.1.file.port=xxx,\
        children.1.file.export=xxx,\
        children.1.driver=raw
  Note:
  1. NBD Client should not be the first child of quorum.
  2. There should be only one NBD Client.
  3. host is the secondary physical machine's hostname or IP
  4. Each disk must have its own export name.
 
Secondary:
  -drive if=xxx,driver=blkcolo,export=xxx,\
        backing.file.filename=1.raw,\
        backing.driver=raw
  Then run qmp command:
    nbd_server_start host:port
  Note:
  1. The export name for the same disk must be the same in primary
    and secondary QEMU command line
  2. The qmp command nbd_server_start must be run before running the
    qmp command migrate on primary QEMU
  3. Don't use nbd_server_start's other options
</pre>
</pre>

Revision as of 03:27, 12 February 2015

Summary

The blkcolo block driver enables disk replication for continuous checkpoints. It is designed for COLO that Secondary VM is running. It can also be applied for FT/HA scene that Secondary VM is not running.

You can get the patch here: https://github.com/wencongyang/qemu-colo/commits/block-replication-v1

Design

Block replication
----------------------------------------
Copyright Fujitsu, Corp. 2015
Copyright (c) 2015 Intel Corporation
Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD.

This work is licensed under the terms of the GNU GPL, version 2 or later.
See the COPYING file in the top-level directory.

The block replication is used for continuous checkpoints. It is designed
for COLO that Secondary VM is running. It can also be applied for FT/HA
scene that Secondary VM is not running.

This document gives an overview of block replication's design.

== Background ==
High availability solutions such as micro checkpoint and COLO will do
consecutive checkpoint. The VM state of Primary VM and Secondary VM is
identical right after a VM checkpoint, but becomes different as the VM
executes till the next checkpoint. To support disk contents checkpoint,
the modified disk contents in the Secondary VM must be buffered, and are
only dropped at next checkpoint time. To reduce the network transportation
effort at the time of checkpoint, the disk modification operations of
Primary disk are asynchronously forwarded to the Secondary node.

== Workflow ==
The following is the image of block replication workflow:

        +----------------------+            +------------------------+
        |Primary Write Requests|            |Secondary Write Requests|
        +----------------------+            +------------------------+
                  |                                       |
                  |                                      (4)
                  |                                       V
                  |                              /-------------\
                  |      Copy and Forward        |             |
                  |---------(1)----------+       | Disk Buffer |
                  |                      |       |             |
                  |                     (3)      \-------------/
                  |                 speculative      ^
                  |                write through    (2)
                  |                      |           |
                  V                      V           |
           +--------------+           +----------------+
           | Primary Disk |           | Secondary Disk |
           +--------------+           +----------------+

    1) Primary write requests will be copied and forwarded to Secondary
       QEMU.
    2) Before Primary write requests are written to Secondary disk, the
       original sector content will be read from Secondary disk and
       buffered in the Disk buffer, but it will not overwrite the existing
       sector content in the Disk buffer.
    3) Primary write requests will be written to Secondary disk.
    4) Secondary write requests will be bufferd in the Disk buffer and it
       will overwrite the existing sector content in the buffer.

== Architecture ==
We are going to implement COLO block replication from many basic
blocks that are already in QEMU. 

         virtio-blk       ||
             ^            ||                            .----------
             |            ||                            | Secondary
        1 Quorum          ||                            '----------
         /      \         ||
        /        \        ||
   Primary      2 NBD  ------->  2 NBD
     disk       client    ||     server                  virtio-blk
                          ||        ^                         ^
--------.                 ||        |                         |
Primary |                 ||  Secondary disk <--------- COLO buffer 3
--------'                 ||                   backing

1) The disk on the primary is represented by a block device with two
children, providing replication between a primary disk and the host that
runs the secondary VM. The read pattern for quorum can be extended to
make the primary always read from the local disk instead of going through
NBD.

2) The secondary disk receives writes from the primary VM through QEMU's
embedded NBD server (speculative write-through).

3) The disk on the secondary is represented by a custom block device
("COLO buffer"). The disk buffer's backing image is the secondary disk,
and the disk buffer uses bdrv_add_before_write_notifier to implement
copy-on-write, similar to block/backup.c.

== New block driver interface ==
We add three block driver interfaces to control block replication:
a. bdrv_start_replication()
   Start block replication, called in migration/checkpoint thread.
   We must call bdrv_start_replication() in secondary QEMU before
   calling bdrv_start_replication() in primary QEMU.
b. bdrv_do_checkpoint()
   This interface is called after all VM state is transfered to
   Secondary QEMU. The Disk buffer will be dropped in this interface.
c. bdrv_stop_replication()
   It is called when failover. We will flush the Disk buffer into
   Secondary Disk and stop block replication.

== Usage ==
Primary:
  -drive if=xxx,driver=quorum,read-pattern=first,\
         children.0.file.filename=1.raw,\
         children.0.driver=raw,\
         children.1.file.driver=nbd+colo,\
         children.1.file.host=xxx,\
         children.1.file.port=xxx,\
         children.1.file.export=xxx,\
         children.1.driver=raw
  Note:
  1. NBD Client should not be the first child of quorum.
  2. There should be only one NBD Client.
  3. host is the secondary physical machine's hostname or IP
  4. Each disk must have its own export name.

Secondary:
  -drive if=xxx,driver=blkcolo,export=xxx,\
         backing.file.filename=1.raw,\
         backing.driver=raw
  Then run qmp command:
    nbd_server_start host:port
  Note:
  1. The export name for the same disk must be the same in primary
     and secondary QEMU command line
  2. The qmp command nbd_server_start must be run before running the
     qmp command migrate on primary QEMU
  3. Don't use nbd_server_start's other options