Features/BlockReplication: Difference between revisions

From QEMU
 
(16 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Summary =
= Summary =
The blkcolo block driver enables disk replication for continuous
The replication block driver enables disk replication for continuous
checkpoints. It is designed for COLO that Secondary VM is running.
checkpoints.
It can also be applied for FT/HA scene that Secondary VM is not
running.


= Drift Design =
You can get the patches here:
https://github.com/Pating/qemu/tree/changlox/block-replication-v24
 
= Design =
<pre>
<pre>
Disk replication using blkcolo
Block replication
----------------------------------------
----------------------------------------
Copyright Fujitsu, Corp. 2014
Copyright Fujitsu, Corp. 2016
Copyright (c) 2016 Intel Corporation
Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.


This work is licensed under the terms of the GNU GPL, version 2 or later.
This work is licensed under the terms of the GNU GPL, version 2 or later.
See the COPYING file in the top-level directory.
See the COPYING file in the top-level directory.


The blkcolo block driver enables disk replication for continuous checkpoints.
Block replication is used for continuous checkpoints. It is designed
It is designed for COLO that Secondary VM is running. It can also be applied
for COLO (COarse-grain LOck-stepping) where the Secondary VM is running.
for FT/HA scene that Secondary VM is not running.
It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
where the Secondary VM is not running.


This document gives an overview of blkcolo's design.
This document gives an overview of block replication's design.


== Background ==
== Background ==
High availability solutions such as micro checkpoint and COLO will do
High availability solutions such as micro checkpoint and COLO will do
consecutive checkpoint. The VM state of Primary VM and Secondary VM is
consecutive checkpoints. The VM state of the Primary and Secondary VM is
identical right after a VM checkpoint, but becomes different as the VM
identical right after a VM checkpoint, but becomes different as the VM
executes till the next checkpoint. To support disk contents checkpoint,
executes till the next checkpoint. To support disk contents checkpoint,
the modified disk contents in the Secondary VM must be buffered, and are
the modified disk contents in the Secondary VM must be buffered, and are
only dropped at next checkpoint time. To reduce the network transportation
only dropped at next checkpoint time. To reduce the network transportation
effort at the time of checkpoint, the disk modification operations of
effort during a vmstate checkpoint, the disk modification operations of
Primary disk are asynchronously forwarded to the Secondary node.
the Primary disk are asynchronously forwarded to the Secondary node.


== Disk Buffer ==
== Workflow ==
The following is the image of Disk buffer:
The following is the image of block replication workflow:


         +----------------------+            +------------------------+
         +----------------------+            +------------------------+
Line 44: Line 48:
                   |                      |      |            |
                   |                      |      |            |
                   |                    (3)      \-------------/
                   |                    (3)      \-------------/
                   |                speculative      ^  
                   |                speculative      ^
                   |                write through    (2)  
                   |                write through    (2)
                   |                      |          |
                   |                      |          |
                   V                      V          |
                   V                      V          |
Line 51: Line 55:
           | Primary Disk |          | Secondary Disk |
           | Primary Disk |          | Secondary Disk |
           +--------------+          +----------------+
           +--------------+          +----------------+
     1) Primary write requests will be copied and forwarded to Secondary
     1) Primary write requests will be copied and forwarded to Secondary
       QEMU.
       QEMU.
Line 56: Line 61:
       original sector content will be read from Secondary disk and
       original sector content will be read from Secondary disk and
       buffered in the Disk buffer, but it will not overwrite the existing
       buffered in the Disk buffer, but it will not overwrite the existing
       sector content in the Disk buffer.
       sector content (it could be from either "Secondary Write Requests" or
      previous COW of "Primary Write Requests") in the Disk buffer.
     3) Primary write requests will be written to Secondary disk.
     3) Primary write requests will be written to Secondary disk.
     4) Secondary write requests will be bufferd in the Disk buffer and it
     4) Secondary write requests will be buffered in the Disk buffer and it
       will overwrite the existing sector content in the buffer.
       will overwrite the existing sector content in the buffer.


== Capture I/O request ==
== Architecture ==
The blkcolo is a new block driver protocol, so all I/O requests can be
We are going to implement block replication from many basic
captured in the driver interface bdrv_co_readv()/bdrv_co_writev().
blocks that are already in QEMU.
 
        virtio-blk      ||
            ^            ||                            .----------
            |            ||                            | Secondary
        1 Quorum          ||                            '----------
        /      \        ||
        /       \        ||
  Primary    2 filter
    disk        ^                                                            virtio-blk
                  |                                                                  ^
                3 NBD  ------->  3 NBD                                              |
                client    ||    server                                          2 filter
                          ||        ^                                                ^
--------.                ||        |                                                |
Primary |                ||  Secondary disk <--------- hidden-disk 5 <--------- active-disk 4
--------'                ||        |          backing        ^      backing
                          ||        |                        |
                          ||        |                        |
                          ||        '-------------------------'
                          ||          drive-backup sync=none 6
 
1) The disk on the primary is represented by a block device with two
children, providing replication between a primary disk and the host that
runs the secondary VM. The read pattern (fifo) for quorum can be extended
to make the primary always read from the local disk instead of going through
NBD.
 
2) The new block filter (the name is replication) will control the block
replication.
 
3) The secondary disk receives writes from the primary VM through QEMU's
embedded NBD server (speculative write-through).
 
4) The disk on the secondary is represented by a custom block device
(called active-disk). It should start as an empty disk, and the format
should support bdrv_make_empty() and backing file.
 
5) The hidden-disk is created automatically. It buffers the original content
that is modified by the primary VM. It should also start as an empty disk,
and the driver supports bdrv_make_empty() and backing file.
 
6) The drive-backup job (sync=none) is run to allow hidden-disk to buffer
any state that would otherwise be lost by the speculative write-through
of the NBD server into the secondary disk. So before block replication,
the primary disk and secondary disk should contain the same data.
 
== Failure Handling ==
There are 7 internal errors when block replication is running:
1. I/O error on primary disk
2. Forwarding primary write requests failed
3. Backup failed
4. I/O error on secondary disk
5. I/O error on active disk
6. Making active disk or hidden disk empty failed
7. Doing failover failed
In case 1 and 5, we just report the error to the disk layer. In case 2, 3,
4 and 6, we just report block replication's error to FT/HA manager (which
decides when to do a new checkpoint, when to do failover).
In case 7, if active commit failed, we use replication failover failed state
in Secondary's write operation (what decides which target to write).


== Checkpoint & failover ==
== New block driver interface ==
The blkcolo buffers the write requests in Secondary QEMU. And the buffer
We add four block driver interfaces to control block replication:
should be dropped at a checkpoint, or be flushed to Secondary disk when
a. replication_start_all()
failover. We add three block driver interfaces to do this:
   Start block replication, called in migration/checkpoint thread.
a. bdrv_wait_recv_completed()
  We must call block_replication_start_all() in secondary QEMU before
   This interface may block, and return when all Primary write
   calling block_replication_start_all() in primary QEMU. The caller
   requests are forwarded to Secondary QEMU.
  must hold the I/O mutex lock if it is in migration/checkpoint
b. bdrv_handle_checkpoint()
  thread.
   This interface is called after all VM state is transfered to
b. replication_do_checkpoint_all()
   This interface is called after all VM state is transferred to
   Secondary QEMU. The Disk buffer will be dropped in this interface.
   Secondary QEMU. The Disk buffer will be dropped in this interface.
c. bdrv_cancel_checkpoint()
  The caller must hold the I/O mutex lock if it is in migration/checkpoint
   It is called when doing failover. We will flush the Disk buffer into
  thread.
   Secondary Disk and stop disk replication.
c. replication_get_error_all()
  This interface is called to check if error happened in replication.
  The caller must hold the I/O mutex lock if it is in migration/checkpoint
  thread.
d. replication_stop_all()
   It is called on failover. We will flush the Disk buffer into
   Secondary Disk and stop block replication. The vm should be stopped
  before calling it if you use this API to shutdown the guest, or other
  things except failover. The caller must hold the I/O mutex lock if it is
  in migration/checkpoint thread.


== Usage ==
== Usage ==
On both Primary/Secondary host, invoke QEMU with the following parameters:
Primary:
     "-drive file=blkcolo:host:port:/path/to/image"
  -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\
a. host
        children.0.file.filename=1.raw,\
  Hostname or IP of the Secondary host.
        children.0.driver=raw
b. port
 
  The Secondary QEMU will listen on this port, and the Primary QEMU
  Run qmp command in primary qemu:
  will connect to this port.
     { 'execute': 'human-monitor-command',
      'arguments': {
          'command-line': 'drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=xxxx,file.port=xxxx,file.export=colo1,node-name=nbd_client1'
      }
    }
    { 'execute': 'x-blockdev-change',
      'arguments': {
          'parent': 'colo1',
          'node': 'nbd_client1'
      }
    }
  Note:
  1. There should be only one NBD Client for each primary disk.
  2. host is the secondary physical machine's hostname or IP
  3. Each disk must have its own export name.
  4. It is all a single argument to -drive and you should ignore the
    leading whitespace.
  5. The qmp command line must be run after running qmp command line in
    secondary qemu.
  6. After failover we need remove children.1 (replication driver).
 
Secondary:
  -drive if=none,driver=raw,file.filename=1.raw,id=colo1 \
  -drive if=xxx,id=topxxx,driver=replication,mode=secondary,top-id=topxxx\
        file.file.filename=active_disk.qcow2,\
        file.driver=qcow2,\
        file.backing.file.filename=hidden_disk.qcow2,\
        file.backing.driver=qcow2,\
        file.backing.backing=colo1
 
  Then run qmp command in secondary qemu:
    { 'execute': 'nbd-server-start',
      'arguments': {
          'addr': {
              'type': 'inet',
              'data': {
                  'host': 'xxx',
                  'port': 'xxx'
              }
          }
      }
    }
    { 'execute': 'nbd-server-add',
      'arguments': {
          'device': 'colo1',
          'writable': true
      }
    }
 
  Note:
  1. The export name in secondary QEMU command line is the secondary
    disk's id.
  2. The export name for the same disk must be the same
  3. The qmp command nbd-server-start and nbd-server-add must be run
    before running the qmp command migrate on primary QEMU
  4. Active disk, hidden disk and nbd target's length should be the
    same.
  5. It is better to put active disk and hidden disk in ramdisk.
  6. It is all a single argument to -drive, and you should ignore
    the leading whitespace.
 
After Failover:
Primary:
  The secondary host is down, so we should run the following qmp command
  to remove the nbd child from the quorum:
  { 'execute': 'x-blockdev-change',
    'arguments': {
        'parent': 'colo1',
        'child': 'children.1'
    }
  }
  { 'execute': 'human-monitor-command',
    'arguments': {
        'command-line': 'drive_del xxxx'
    }
  }
  Note: there is no qmp command to remove the blockdev now
 
Secondary:
  The primary host is down, so we should do the following thing:
  { 'execute': 'nbd-server-stop' }
 
TODO:
1. Continuous block replication
2. Shared disk
 
</pre>
</pre>

Latest revision as of 06:29, 17 August 2016

Summary

The replication block driver enables disk replication for continuous checkpoints.

You can get the patches here: https://github.com/Pating/qemu/tree/changlox/block-replication-v24

Design

Block replication
----------------------------------------
Copyright Fujitsu, Corp. 2016
Copyright (c) 2016 Intel Corporation
Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.

This work is licensed under the terms of the GNU GPL, version 2 or later.
See the COPYING file in the top-level directory.

Block replication is used for continuous checkpoints. It is designed
for COLO (COarse-grain LOck-stepping) where the Secondary VM is running.
It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
where the Secondary VM is not running.

This document gives an overview of block replication's design.

== Background ==
High availability solutions such as micro checkpoint and COLO will do
consecutive checkpoints. The VM state of the Primary and Secondary VM is
identical right after a VM checkpoint, but becomes different as the VM
executes till the next checkpoint. To support disk contents checkpoint,
the modified disk contents in the Secondary VM must be buffered, and are
only dropped at next checkpoint time. To reduce the network transportation
effort during a vmstate checkpoint, the disk modification operations of
the Primary disk are asynchronously forwarded to the Secondary node.

== Workflow ==
The following is the image of block replication workflow:

        +----------------------+            +------------------------+
        |Primary Write Requests|            |Secondary Write Requests|
        +----------------------+            +------------------------+
                  |                                       |
                  |                                      (4)
                  |                                       V
                  |                              /-------------\
                  |      Copy and Forward        |             |
                  |---------(1)----------+       | Disk Buffer |
                  |                      |       |             |
                  |                     (3)      \-------------/
                  |                 speculative      ^
                  |                write through    (2)
                  |                      |           |
                  V                      V           |
           +--------------+           +----------------+
           | Primary Disk |           | Secondary Disk |
           +--------------+           +----------------+

    1) Primary write requests will be copied and forwarded to Secondary
       QEMU.
    2) Before Primary write requests are written to Secondary disk, the
       original sector content will be read from Secondary disk and
       buffered in the Disk buffer, but it will not overwrite the existing
       sector content (it could be from either "Secondary Write Requests" or
       previous COW of "Primary Write Requests") in the Disk buffer.
    3) Primary write requests will be written to Secondary disk.
    4) Secondary write requests will be buffered in the Disk buffer and it
       will overwrite the existing sector content in the buffer.

== Architecture ==
We are going to implement block replication from many basic
blocks that are already in QEMU.

         virtio-blk       ||
             ^            ||                            .----------
             |            ||                            | Secondary
        1 Quorum          ||                            '----------
         /      \         ||
        /        \        ||
   Primary    2 filter
     disk         ^                                                             virtio-blk
                  |                                                                  ^
                3 NBD  ------->  3 NBD                                               |
                client    ||     server                                          2 filter
                          ||        ^                                                ^
--------.                 ||        |                                                |
Primary |                 ||  Secondary disk <--------- hidden-disk 5 <--------- active-disk 4
--------'                 ||        |          backing        ^       backing
                          ||        |                         |
                          ||        |                         |
                          ||        '-------------------------'
                          ||           drive-backup sync=none 6

1) The disk on the primary is represented by a block device with two
children, providing replication between a primary disk and the host that
runs the secondary VM. The read pattern (fifo) for quorum can be extended
to make the primary always read from the local disk instead of going through
NBD.

2) The new block filter (the name is replication) will control the block
replication.

3) The secondary disk receives writes from the primary VM through QEMU's
embedded NBD server (speculative write-through).

4) The disk on the secondary is represented by a custom block device
(called active-disk). It should start as an empty disk, and the format
should support bdrv_make_empty() and backing file.

5) The hidden-disk is created automatically. It buffers the original content
that is modified by the primary VM. It should also start as an empty disk,
and the driver supports bdrv_make_empty() and backing file.

6) The drive-backup job (sync=none) is run to allow hidden-disk to buffer
any state that would otherwise be lost by the speculative write-through
of the NBD server into the secondary disk. So before block replication,
the primary disk and secondary disk should contain the same data.

== Failure Handling ==
There are 7 internal errors when block replication is running:
1. I/O error on primary disk
2. Forwarding primary write requests failed
3. Backup failed
4. I/O error on secondary disk
5. I/O error on active disk
6. Making active disk or hidden disk empty failed
7. Doing failover failed
In case 1 and 5, we just report the error to the disk layer. In case 2, 3,
4 and 6, we just report block replication's error to FT/HA manager (which
decides when to do a new checkpoint, when to do failover).
In case 7, if active commit failed, we use replication failover failed state
in Secondary's write operation (what decides which target to write).

== New block driver interface ==
We add four block driver interfaces to control block replication:
a. replication_start_all()
   Start block replication, called in migration/checkpoint thread.
   We must call block_replication_start_all() in secondary QEMU before
   calling block_replication_start_all() in primary QEMU. The caller
   must hold the I/O mutex lock if it is in migration/checkpoint
   thread.
b. replication_do_checkpoint_all()
   This interface is called after all VM state is transferred to
   Secondary QEMU. The Disk buffer will be dropped in this interface.
   The caller must hold the I/O mutex lock if it is in migration/checkpoint
   thread.
c. replication_get_error_all()
   This interface is called to check if error happened in replication.
   The caller must hold the I/O mutex lock if it is in migration/checkpoint
   thread.
d. replication_stop_all()
   It is called on failover. We will flush the Disk buffer into
   Secondary Disk and stop block replication. The vm should be stopped
   before calling it if you use this API to shutdown the guest, or other
   things except failover. The caller must hold the I/O mutex lock if it is
   in migration/checkpoint thread.

== Usage ==
Primary:
  -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\
         children.0.file.filename=1.raw,\
         children.0.driver=raw

  Run qmp command in primary qemu:
    { 'execute': 'human-monitor-command',
      'arguments': {
          'command-line': 'drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=xxxx,file.port=xxxx,file.export=colo1,node-name=nbd_client1'
      }
    }
    { 'execute': 'x-blockdev-change',
      'arguments': {
          'parent': 'colo1',
          'node': 'nbd_client1'
      }
    }
  Note:
  1. There should be only one NBD Client for each primary disk.
  2. host is the secondary physical machine's hostname or IP
  3. Each disk must have its own export name.
  4. It is all a single argument to -drive and you should ignore the
     leading whitespace.
  5. The qmp command line must be run after running qmp command line in
     secondary qemu.
  6. After failover we need remove children.1 (replication driver).

Secondary:
  -drive if=none,driver=raw,file.filename=1.raw,id=colo1 \
  -drive if=xxx,id=topxxx,driver=replication,mode=secondary,top-id=topxxx\
         file.file.filename=active_disk.qcow2,\
         file.driver=qcow2,\
         file.backing.file.filename=hidden_disk.qcow2,\
         file.backing.driver=qcow2,\
         file.backing.backing=colo1

  Then run qmp command in secondary qemu:
    { 'execute': 'nbd-server-start',
      'arguments': {
          'addr': {
              'type': 'inet',
              'data': {
                  'host': 'xxx',
                  'port': 'xxx'
              }
          }
      }
    }
    { 'execute': 'nbd-server-add',
      'arguments': {
          'device': 'colo1',
          'writable': true
      }
    }

  Note:
  1. The export name in secondary QEMU command line is the secondary
     disk's id.
  2. The export name for the same disk must be the same
  3. The qmp command nbd-server-start and nbd-server-add must be run
     before running the qmp command migrate on primary QEMU
  4. Active disk, hidden disk and nbd target's length should be the
     same.
  5. It is better to put active disk and hidden disk in ramdisk.
  6. It is all a single argument to -drive, and you should ignore
     the leading whitespace.

After Failover:
Primary:
  The secondary host is down, so we should run the following qmp command
  to remove the nbd child from the quorum:
  { 'execute': 'x-blockdev-change',
    'arguments': {
        'parent': 'colo1',
        'child': 'children.1'
    }
  }
  { 'execute': 'human-monitor-command',
    'arguments': {
        'command-line': 'drive_del xxxx'
    }
  }
  Note: there is no qmp command to remove the blockdev now

Secondary:
  The primary host is down, so we should do the following thing:
  { 'execute': 'nbd-server-stop' }

TODO:
1. Continuous block replication
2. Shared disk