Features/BlockReplication: Difference between revisions
(→Design) |
|||
Line 63: | Line 63: | ||
original sector content will be read from Secondary disk and | original sector content will be read from Secondary disk and | ||
buffered in the Disk buffer, but it will not overwrite the existing | buffered in the Disk buffer, but it will not overwrite the existing | ||
sector content(it could be from either "Secondary Write Requests" or | sector content (it could be from either "Secondary Write Requests" or | ||
previous COW of "Primary Write Requests") in the Disk buffer. | previous COW of "Primary Write Requests") in the Disk buffer. | ||
3) Primary write requests will be written to Secondary disk. | 3) Primary write requests will be written to Secondary disk. | ||
Line 91: | Line 91: | ||
|| | | | || | | | ||
|| '-------------------------' | || '-------------------------' | ||
|| drive-backup sync=none | || drive-backup sync=none 6 | ||
1) The disk on the primary is represented by a block device with two | 1) The disk on the primary is represented by a block device with two | ||
Line 99: | Line 99: | ||
NBD. | NBD. | ||
2) The new block filter(the name is replication) will control the block | 2) The new block filter (the name is replication) will control the block | ||
replication. | replication. | ||
Line 106: | Line 106: | ||
4) The disk on the secondary is represented by a custom block device | 4) The disk on the secondary is represented by a custom block device | ||
(called active-disk). It should | (called active-disk). It should start as an empty disk, and the format | ||
support bdrv_make_empty() and backing file. | should support bdrv_make_empty() and backing file. | ||
5) The hidden-disk is created automatically. It buffers the original content | 5) The hidden-disk is created automatically. It buffers the original content | ||
that is modified by the primary VM. It should also | that is modified by the primary VM. It should also start as an empty disk, | ||
the driver supports bdrv_make_empty() and backing file. | and the driver supports bdrv_make_empty() and backing file. | ||
6) The drive-backup job(sync=none) is run to allow hidden-disk to buffer | |||
any state that would otherwise be lost by the speculative write-through | |||
of the NBD server into the secondary disk. So before block replication, | |||
the primary disk and secondary disk should contain the same data. | |||
== Failure Handling == | == Failure Handling == | ||
Line 122: | Line 127: | ||
6. Making active disk or hidden disk empty failed | 6. Making active disk or hidden disk empty failed | ||
In case 1 and 5, we just report the error to the disk layer. In case 2, 3, | In case 1 and 5, we just report the error to the disk layer. In case 2, 3, | ||
4 and 6, we just report block replication's error to FT/HA manager(which | 4 and 6, we just report block replication's error to FT/HA manager (which | ||
decides when to do a new checkpoint, when to do failover). | decides when to do a new checkpoint, when to do failover). | ||
There is | There is no internal error when doing failover. | ||
== New block driver interface == | == New block driver interface == | ||
Line 144: | Line 147: | ||
It is called on failover. We will flush the Disk buffer into | It is called on failover. We will flush the Disk buffer into | ||
Secondary Disk and stop block replication. The vm should be stopped | Secondary Disk and stop block replication. The vm should be stopped | ||
before calling it. The caller must hold the I/O mutex lock if it is | before calling it if you use this API to shutdown the guest, or other | ||
things except failover. The caller must hold the I/O mutex lock if it is | |||
in migration/checkpoint thread. | in migration/checkpoint thread. | ||
== Usage == | == Usage == | ||
Primary: | Primary: | ||
-drive if=xxx,driver=quorum,read-pattern=fifo,id= | -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1\ | ||
children.0.file.filename=1.raw,\ | children.0.file.filename=1.raw,\ | ||
children.0.driver=raw,\ | children.0.driver=raw,\ | ||
Run qmp command in primary qemu: | Run qmp command in primary qemu: | ||
{ 'execute': 'blockdev-add', | |||
'arguments': { | |||
child | 'options': { | ||
'driver': 'replication', | |||
'mode': 'primary', | |||
'node-name': 'nbd_client1', | |||
'file': { | |||
'host': 'xxx', | |||
'port': 'xxx', | |||
'export': 'colo1', | |||
'driver': 'nbd' | |||
} | |||
} | |||
} | |||
} | |||
{ 'execute': 'x-child-add', | |||
'arguments': { | |||
'parent': 'colo1', | |||
'child': 'nbd_client1' | |||
} | |||
} | |||
Note: | Note: | ||
1. There should be only one NBD Client for each primary disk. | 1. There should be only one NBD Client for each primary disk. | ||
2. host is the secondary physical machine's hostname or IP | 2. host is the secondary physical machine's hostname or IP | ||
3. Each disk must have its own export name. | 3. Each disk must have its own export name. | ||
4. It is all a single argument to -drive | 4. It is all a single argument to -drive and you should ignore the | ||
leading whitespace. | |||
5. The qmp command line must be run after running qmp command line in | 5. The qmp command line must be run after running qmp command line in | ||
secondary qemu. | secondary qemu. | ||
Secondary: | Secondary: | ||
-drive if=none,driver=raw,file= | -drive if=none,driver=raw,file=/dev/null,id=colo1 \ | ||
-drive if=xxx,driver=replication,mode=secondary,\ | -drive if=xxx,driver=replication,mode=secondary,\ | ||
file.file.filename=active_disk.qcow2,\ | file.file.filename=active_disk.qcow2,\ | ||
Line 174: | Line 196: | ||
file.backing.driver=qcow2,\ | file.backing.driver=qcow2,\ | ||
file.backing.allow-write-backing-file=on,\ | file.backing.allow-write-backing-file=on,\ | ||
file.backing.backing. | file.backing.backing.file.filename=1.raw,\ | ||
file.backing.backing.driver=raw, | |||
file.backing.backing.allow-write-backing-file=on,\ | |||
file.backing.backing.node-name=secondary-disk1 | |||
Then run qmp command in secondary qemu: | Then run qmp command in secondary qemu: | ||
nbd-server-start host:port | { 'execute': 'blockdev-remove-medium', | ||
nbd-server-add | 'arguments': { | ||
'device': 'colo1' | |||
} | |||
} | |||
{ 'execute': 'blockdev-insert-medium', | |||
'arguments': { | |||
'device': 'colo1', | |||
'node-name': 'secondary-disk1' | |||
} | |||
} | |||
{ 'execute': 'nbd-server-start', | |||
'arguments': { | |||
'addr': { | |||
'type': inet', | |||
'data': { | |||
'host': 'xxx', | |||
'port': 'xxx' | |||
} | |||
} | |||
} | |||
} | |||
{ 'execute': 'nbd-server-add', | |||
'arguments': { | |||
'device': 'colo1', | |||
'writable': true | |||
} | |||
} | |||
Note: | Note: | ||
Line 186: | Line 237: | ||
3. The qmp command nbd-server-start and nbd-server-add must be run | 3. The qmp command nbd-server-start and nbd-server-add must be run | ||
before running the qmp command migrate on primary QEMU | before running the qmp command migrate on primary QEMU | ||
4 | 4. Active disk, hidden disk and nbd target's length should be the | ||
same. | same. | ||
5. It is better to put active disk and hidden disk in ramdisk. | |||
6. It is all a single argument to -drive, and you should ignore | |||
the leading whitespace. | the leading whitespace. | ||
After Failover: | |||
Primary: | |||
The secondary host is down, so we should run the following qmp command | |||
to remove the nbd child from the quorum: | |||
{ 'execute': 'child-del', | |||
'arguments': { | |||
'parent': 'colo1', | |||
'child': 'nbd_client1' | |||
} | |||
} | |||
Note: there is no qmp command to remove the blockdev now | |||
Secondary: | |||
The primary host is down, so we should do the following thing: | |||
{ 'execute': 'nbd-server-stop' | |||
'arguments': { } | |||
} | |||
{ 'execute': 'blockdev-remove-medium' | |||
'arguments': { | |||
'device': 'colo1' | |||
} | |||
} | |||
TODO: | |||
1. Continuous block replication | |||
2. Shared disk | |||
</pre> | </pre> |
Revision as of 05:52, 25 September 2015
Summary
The blkcolo block driver enables disk replication for continuous checkpoints. It is designed for COLO that Secondary VM is running. It can also be applied for FT/HA scene that Secondary VM is not running.
You can get the patches here: https://github.com/coloft/qemu/tree/wency/block-replication-v10
Design
Block replication ---------------------------------------- Copyright Fujitsu, Corp. 2015 Copyright (c) 2015 Intel Corporation Copyright (c) 2015 HUAWEI TECHNOLOGIES CO., LTD. This work is licensed under the terms of the GNU GPL, version 2 or later. See the COPYING file in the top-level directory. Block replication is used for continuous checkpoints. It is designed for COLO (COurse-grain LOck-stepping) where the Secondary VM is running. It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario, where the Secondary VM is not running. This document gives an overview of block replication's design. == Background == High availability solutions such as micro checkpoint and COLO will do consecutive checkpoints. The VM state of Primary VM and Secondary VM is identical right after a VM checkpoint, but becomes different as the VM executes till the next checkpoint. To support disk contents checkpoint, the modified disk contents in the Secondary VM must be buffered, and are only dropped at next checkpoint time. To reduce the network transportation effort at the time of checkpoint, the disk modification operations of Primary disk are asynchronously forwarded to the Secondary node. == Workflow == The following is the image of block replication workflow: +----------------------+ +------------------------+ |Primary Write Requests| |Secondary Write Requests| +----------------------+ +------------------------+ | | | (4) | V | /-------------\ | Copy and Forward | | |---------(1)----------+ | Disk Buffer | | | | | | (3) \-------------/ | speculative ^ | write through (2) | | | V V | +--------------+ +----------------+ | Primary Disk | | Secondary Disk | +--------------+ +----------------+ 1) Primary write requests will be copied and forwarded to Secondary QEMU. 2) Before Primary write requests are written to Secondary disk, the original sector content will be read from Secondary disk and buffered in the Disk buffer, but it will not overwrite the existing sector content (it could be from either "Secondary Write Requests" or previous COW of "Primary Write Requests") in the Disk buffer. 3) Primary write requests will be written to Secondary disk. 4) Secondary write requests will be buffered in the Disk buffer and it will overwrite the existing sector content in the buffer. == Architecture == We are going to implement block replication from many basic blocks that are already in QEMU. virtio-blk || ^ || .---------- | || | Secondary 1 Quorum || '---------- / \ || / \ || Primary 2 filter disk ^ virtio-blk | ^ 3 NBD -------> 3 NBD | client || server 2 filter || ^ ^ --------. || | | Primary | || Secondary disk <--------- hidden-disk 5 <--------- active-disk 4 --------' || | backing ^ backing || | | || | | || '-------------------------' || drive-backup sync=none 6 1) The disk on the primary is represented by a block device with two children, providing replication between a primary disk and the host that runs the secondary VM. The read pattern for quorum can be extended to make the primary always read from the local disk instead of going through NBD. 2) The new block filter (the name is replication) will control the block replication. 3) The secondary disk receives writes from the primary VM through QEMU's embedded NBD server (speculative write-through). 4) The disk on the secondary is represented by a custom block device (called active-disk). It should start as an empty disk, and the format should support bdrv_make_empty() and backing file. 5) The hidden-disk is created automatically. It buffers the original content that is modified by the primary VM. It should also start as an empty disk, and the driver supports bdrv_make_empty() and backing file. 6) The drive-backup job(sync=none) is run to allow hidden-disk to buffer any state that would otherwise be lost by the speculative write-through of the NBD server into the secondary disk. So before block replication, the primary disk and secondary disk should contain the same data. == Failure Handling == There are 6 internal errors when block replication is running: 1. I/O error on primary disk 2. Forwarding primary write requests failed 3. Backup failed 4. I/O error on secondary disk 5. I/O error on active disk 6. Making active disk or hidden disk empty failed In case 1 and 5, we just report the error to the disk layer. In case 2, 3, 4 and 6, we just report block replication's error to FT/HA manager (which decides when to do a new checkpoint, when to do failover). There is no internal error when doing failover. == New block driver interface == We add three block driver interfaces to control block replication: a. bdrv_start_replication() Start block replication, called in migration/checkpoint thread. We must call bdrv_start_replication() in secondary QEMU before calling bdrv_start_replication() in primary QEMU. The caller must hold the I/O mutex lock if it is in migration/checkpoint thread. b. bdrv_do_checkpoint() This interface is called after all VM state is transferred to Secondary QEMU. The Disk buffer will be dropped in this interface. The caller must hold the I/O mutex lock if it is in migration/checkpoint thread. c. bdrv_stop_replication() It is called on failover. We will flush the Disk buffer into Secondary Disk and stop block replication. The vm should be stopped before calling it if you use this API to shutdown the guest, or other things except failover. The caller must hold the I/O mutex lock if it is in migration/checkpoint thread. == Usage == Primary: -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1\ children.0.file.filename=1.raw,\ children.0.driver=raw,\ Run qmp command in primary qemu: { 'execute': 'blockdev-add', 'arguments': { 'options': { 'driver': 'replication', 'mode': 'primary', 'node-name': 'nbd_client1', 'file': { 'host': 'xxx', 'port': 'xxx', 'export': 'colo1', 'driver': 'nbd' } } } } { 'execute': 'x-child-add', 'arguments': { 'parent': 'colo1', 'child': 'nbd_client1' } } Note: 1. There should be only one NBD Client for each primary disk. 2. host is the secondary physical machine's hostname or IP 3. Each disk must have its own export name. 4. It is all a single argument to -drive and you should ignore the leading whitespace. 5. The qmp command line must be run after running qmp command line in secondary qemu. Secondary: -drive if=none,driver=raw,file=/dev/null,id=colo1 \ -drive if=xxx,driver=replication,mode=secondary,\ file.file.filename=active_disk.qcow2,\ file.driver=qcow2,\ file.backing.file.filename=hidden_disk.qcow2,\ file.backing.driver=qcow2,\ file.backing.allow-write-backing-file=on,\ file.backing.backing.file.filename=1.raw,\ file.backing.backing.driver=raw, file.backing.backing.allow-write-backing-file=on,\ file.backing.backing.node-name=secondary-disk1 Then run qmp command in secondary qemu: { 'execute': 'blockdev-remove-medium', 'arguments': { 'device': 'colo1' } } { 'execute': 'blockdev-insert-medium', 'arguments': { 'device': 'colo1', 'node-name': 'secondary-disk1' } } { 'execute': 'nbd-server-start', 'arguments': { 'addr': { 'type': inet', 'data': { 'host': 'xxx', 'port': 'xxx' } } } } { 'execute': 'nbd-server-add', 'arguments': { 'device': 'colo1', 'writable': true } } Note: 1. The export name in secondary QEMU command line is the secondary disk's id. 2. The export name for the same disk must be the same 3. The qmp command nbd-server-start and nbd-server-add must be run before running the qmp command migrate on primary QEMU 4. Active disk, hidden disk and nbd target's length should be the same. 5. It is better to put active disk and hidden disk in ramdisk. 6. It is all a single argument to -drive, and you should ignore the leading whitespace. After Failover: Primary: The secondary host is down, so we should run the following qmp command to remove the nbd child from the quorum: { 'execute': 'child-del', 'arguments': { 'parent': 'colo1', 'child': 'nbd_client1' } } Note: there is no qmp command to remove the blockdev now Secondary: The primary host is down, so we should do the following thing: { 'execute': 'nbd-server-stop' 'arguments': { } } { 'execute': 'blockdev-remove-medium' 'arguments': { 'device': 'colo1' } } TODO: 1. Continuous block replication 2. Shared disk