Features/BlockReplication
Summary
The blkcolo block driver enables disk replication for continuous checkpoints. It is designed for COLO that Secondary VM is running. It can also be applied for FT/HA scene that Secondary VM is not running.
Design
Disk replication using blkcolo ---------------------------------------- Copyright Fujitsu, Corp. 2014 This work is licensed under the terms of the GNU GPL, version 2 or later. See the COPYING file in the top-level directory. The blkcolo block driver enables disk replication for continuous checkpoints. It is designed for COLO that Secondary VM is running. It can also be applied for FT/HA scene that Secondary VM is not running. This document gives an overview of blkcolo's design. == Background == High availability solutions such as micro checkpoint and COLO will do consecutive checkpoint. The VM state of Primary VM and Secondary VM is identical right after a VM checkpoint, but becomes different as the VM executes till the next checkpoint. To support disk contents checkpoint, the modified disk contents in the Secondary VM must be buffered, and are only dropped at next checkpoint time. To reduce the network transportation effort at the time of checkpoint, the disk modification operations of Primary disk are asynchronously forwarded to the Secondary node. == Disk Buffer == The following is the image of Disk buffer: +----------------------+ +------------------------+ |Primary Write Requests| |Secondary Write Requests| +----------------------+ +------------------------+ | | | (4) | V | /-------------\ | Copy and Forward | | |---------(1)----------+ | Disk Buffer | | | | | | (3) \-------------/ | speculative ^ | write through (2) | | | V V | +--------------+ +----------------+ | Primary Disk | | Secondary Disk | +--------------+ +----------------+ 1) Primary write requests will be copied and forwarded to Secondary QEMU. 2) Before Primary write requests are written to Secondary disk, the original sector content will be read from Secondary disk and buffered in the Disk buffer, but it will not overwrite the existing sector content in the Disk buffer. 3) Primary write requests will be written to Secondary disk. 4) Secondary write requests will be bufferd in the Disk buffer and it will overwrite the existing sector content in the buffer. == Capture I/O request == The blkcolo is a new block driver protocol, so all I/O requests can be captured in the driver interface bdrv_co_readv()/bdrv_co_writev(). == Checkpoint & failover == The blkcolo buffers the write requests in Secondary QEMU. And the buffer should be dropped at a checkpoint, or be flushed to Secondary disk when failover. We add three block driver interfaces to do this: a. bdrv_wait_recv_completed() This interface may block, and return when all Primary write requests are forwarded to Secondary QEMU. b. bdrv_handle_checkpoint() This interface is called after all VM state is transfered to Secondary QEMU. The Disk buffer will be dropped in this interface. c. bdrv_cancel_checkpoint() It is called when doing failover. We will flush the Disk buffer into Secondary Disk and stop disk replication. == Usage == On both Primary/Secondary host, invoke QEMU with the following parameters: "-drive file=blkcolo:host:port:/path/to/image" a. host Hostname or IP of the Secondary host. b. port The Secondary QEMU will listen on this port, and the Primary QEMU will connect to this port.