Features/BlockReplication: Difference between revisions
(→Design) |
|||
Line 69: | Line 69: | ||
== Architecture == | == Architecture == | ||
We are going to implement COLO block replication from many basic | We are going to implement COLO block replication from many basic | ||
blocks that are already in QEMU. | blocks that are already in QEMU. | ||
virtio-blk || | virtio-blk || | ||
Line 78: | Line 78: | ||
/ \ || | / \ || | ||
Primary 2 NBD -------> 2 NBD | Primary 2 NBD -------> 2 NBD | ||
disk client || server | disk client || server virtio-blk | ||
|| ^ | || ^ ^ | ||
--------. || | | --------. || | | | ||
Primary | || Secondary disk <--------- | Primary | || Secondary disk <--------- hidden-disk 4 <--------- active-disk 3 | ||
--------' || | --------' || | backing ^ backing | ||
|| | | | |||
|| | | | |||
|| '-------------------------' | |||
|| drive-backup sync=none | |||
1) The disk on the primary is represented by a block device with two | 1) The disk on the primary is represented by a block device with two | ||
Line 94: | Line 98: | ||
3) The disk on the secondary is represented by a custom block device | 3) The disk on the secondary is represented by a custom block device | ||
( | (called active-disk). It should be an empty disk, and the format should | ||
be qcow2. | |||
4) The hidden-disk is created automatically. It buffers the original content | |||
that is modified by the primary VM. It should also be an empty disk, and | |||
the dirver supports bdrv_make_empty(). | |||
== New block driver interface == | == New block driver interface == | ||
Line 109: | Line 116: | ||
c. bdrv_stop_replication() | c. bdrv_stop_replication() | ||
It is called when failover. We will flush the Disk buffer into | It is called when failover. We will flush the Disk buffer into | ||
Secondary Disk and stop block replication. | Secondary Disk and stop block replication. The vm should be stopped | ||
before calling it. | |||
== Usage == | == Usage == | ||
Primary: | Primary: | ||
-drive if=xxx,driver=quorum,read-pattern= | -drive if=xxx,driver=quorum,read-pattern=fifo,\ | ||
children.0.file.filename=1.raw,\ | children.0.file.filename=1.raw,\ | ||
children.0.driver=raw,\ | children.0.driver=raw,\ | ||
Line 120: | Line 128: | ||
children.1.file.port=xxx,\ | children.1.file.port=xxx,\ | ||
children.1.file.export=xxx,\ | children.1.file.export=xxx,\ | ||
children.1.driver=raw | children.1.driver=raw,\ | ||
children.1.ignore-errors=on | |||
Note: | Note: | ||
1. NBD Client should not be the first child of quorum. | 1. NBD Client should not be the first child of quorum. | ||
Line 128: | Line 137: | ||
Secondary: | Secondary: | ||
-drive if=xxx,driver= | -drive if=none,driver=raw,file=1.raw,id=nbd_target1 \ | ||
-drive if=xxx,driver=qcow2+colo,file=active_disk.qcow2,export=xxx,\ | |||
backing_reference.drive_id=nbd_target1,\ | |||
backing_reference.hidden-disk.file.filename=hidden_disk.qcow2,\ | |||
backing_reference.hidden-disk.driver=qcow2,\ | |||
backing_reference.hidden-disk.allow-write-backing-file=on | |||
Then run qmp command: | Then run qmp command: | ||
nbd_server_start host:port | nbd_server_start host:port | ||
Line 139: | Line 151: | ||
qmp command migrate on primary QEMU | qmp command migrate on primary QEMU | ||
3. Don't use nbd_server_start's other options | 3. Don't use nbd_server_start's other options | ||
4. Active disk, hidden disk and nbd target's length should be the | |||
same. | |||
5. It is better to put active disk and hidden disk in ramdisk. | |||
</pre> | </pre> |
Revision as of 09:14, 31 March 2015
Summary
The blkcolo block driver enables disk replication for continuous checkpoints. It is designed for COLO that Secondary VM is running. It can also be applied for FT/HA scene that Secondary VM is not running.
You can get the patches here: https://github.com/wencongyang/qemu-colo/commits/block-replication-v2
Design
Block replication ---------------------------------------- Copyright Fujitsu, Corp. 2015 Copyright (c) 2015 Intel Corporation Copyright (c) 2015 HUAWEI TECHNOLOGIES CO.,LTD. This work is licensed under the terms of the GNU GPL, version 2 or later. See the COPYING file in the top-level directory. The block replication is used for continuous checkpoints. It is designed for COLO that Secondary VM is running. It can also be applied for FT/HA scene that Secondary VM is not running. This document gives an overview of block replication's design. == Background == High availability solutions such as micro checkpoint and COLO will do consecutive checkpoint. The VM state of Primary VM and Secondary VM is identical right after a VM checkpoint, but becomes different as the VM executes till the next checkpoint. To support disk contents checkpoint, the modified disk contents in the Secondary VM must be buffered, and are only dropped at next checkpoint time. To reduce the network transportation effort at the time of checkpoint, the disk modification operations of Primary disk are asynchronously forwarded to the Secondary node. == Workflow == The following is the image of block replication workflow: +----------------------+ +------------------------+ |Primary Write Requests| |Secondary Write Requests| +----------------------+ +------------------------+ | | | (4) | V | /-------------\ | Copy and Forward | | |---------(1)----------+ | Disk Buffer | | | | | | (3) \-------------/ | speculative ^ | write through (2) | | | V V | +--------------+ +----------------+ | Primary Disk | | Secondary Disk | +--------------+ +----------------+ 1) Primary write requests will be copied and forwarded to Secondary QEMU. 2) Before Primary write requests are written to Secondary disk, the original sector content will be read from Secondary disk and buffered in the Disk buffer, but it will not overwrite the existing sector content in the Disk buffer. 3) Primary write requests will be written to Secondary disk. 4) Secondary write requests will be bufferd in the Disk buffer and it will overwrite the existing sector content in the buffer. == Architecture == We are going to implement COLO block replication from many basic blocks that are already in QEMU. virtio-blk || ^ || .---------- | || | Secondary 1 Quorum || '---------- / \ || / \ || Primary 2 NBD -------> 2 NBD disk client || server virtio-blk || ^ ^ --------. || | | Primary | || Secondary disk <--------- hidden-disk 4 <--------- active-disk 3 --------' || | backing ^ backing || | | || | | || '-------------------------' || drive-backup sync=none 1) The disk on the primary is represented by a block device with two children, providing replication between a primary disk and the host that runs the secondary VM. The read pattern for quorum can be extended to make the primary always read from the local disk instead of going through NBD. 2) The secondary disk receives writes from the primary VM through QEMU's embedded NBD server (speculative write-through). 3) The disk on the secondary is represented by a custom block device (called active-disk). It should be an empty disk, and the format should be qcow2. 4) The hidden-disk is created automatically. It buffers the original content that is modified by the primary VM. It should also be an empty disk, and the dirver supports bdrv_make_empty(). == New block driver interface == We add three block driver interfaces to control block replication: a. bdrv_start_replication() Start block replication, called in migration/checkpoint thread. We must call bdrv_start_replication() in secondary QEMU before calling bdrv_start_replication() in primary QEMU. b. bdrv_do_checkpoint() This interface is called after all VM state is transfered to Secondary QEMU. The Disk buffer will be dropped in this interface. c. bdrv_stop_replication() It is called when failover. We will flush the Disk buffer into Secondary Disk and stop block replication. The vm should be stopped before calling it. == Usage == Primary: -drive if=xxx,driver=quorum,read-pattern=fifo,\ children.0.file.filename=1.raw,\ children.0.driver=raw,\ children.1.file.driver=nbd+colo,\ children.1.file.host=xxx,\ children.1.file.port=xxx,\ children.1.file.export=xxx,\ children.1.driver=raw,\ children.1.ignore-errors=on Note: 1. NBD Client should not be the first child of quorum. 2. There should be only one NBD Client. 3. host is the secondary physical machine's hostname or IP 4. Each disk must have its own export name. Secondary: -drive if=none,driver=raw,file=1.raw,id=nbd_target1 \ -drive if=xxx,driver=qcow2+colo,file=active_disk.qcow2,export=xxx,\ backing_reference.drive_id=nbd_target1,\ backing_reference.hidden-disk.file.filename=hidden_disk.qcow2,\ backing_reference.hidden-disk.driver=qcow2,\ backing_reference.hidden-disk.allow-write-backing-file=on Then run qmp command: nbd_server_start host:port Note: 1. The export name for the same disk must be the same in primary and secondary QEMU command line 2. The qmp command nbd_server_start must be run before running the qmp command migrate on primary QEMU 3. Don't use nbd_server_start's other options 4. Active disk, hidden disk and nbd target's length should be the same. 5. It is better to put active disk and hidden disk in ramdisk.