Features/LiveBlockMigration

From QEMU

Future qemu is expected to support these features (some already implemented):

Live features

Live block copy

  Ability to copy 1+ virtual disk from the source backing file/block 
  device to a new target that is accessible by the host. The copy 
  supposed to be executed while the VM runs in a transparent way.

Live snapshots and live snapshot merge

  Live snapshot is already incorporated (by Jes) in qemu (still need 
  virt-agent work to freeze the guest FS).
  Live snapshot merge is required in order of reducing the overhead 
  caused by the additional snapshots (sometimes over raw device).
  We'll use live copy to do the live merge

Image streaming (Copy on read)

  Ability to start guest execution while the parent image reside 
  remotely and each block access is replicated to a local copy (image 
  format snapshot)
  Such functionality can be hooked together with live block migration  
  instead of the 'post copy' method.

Live block migration (pre/post)

  Beyond live block copy we'll sometimes need to move both the storage 
  and the guest. There are two main approached here:
  - pre copy
    First live copy the image and only then live migration the VM.
    It is simple and safer approach in terms of management app, but if the 
    purpose of the whole live block migration was to balance the cpu load, it 
    won't be practical to use since copying an image of 100GB will take too long.
  - post copy (streaming / copy on read)
    First live migrate the VM, then on line stream its blocks.
    It's better approach for HA/load balancing but it might make 
    management complex (need to keep the source VM alive, handling failures)
  In addition there are two cases for the storage access:
  1. Shared storage
     Live block copy enable this capability, its seems like a rare case for live block migration.
  2.  There are some cases where the is no NFS/SAN storage and live migration is needed
     It should be similar to VMW's storage VM motion.
     http://www.vmware.com/files/pdf/VMware-Storage-VMotion-DS-EN.pdf
     http://www.vmware.com/products/storage-vmotion/features.html

Using external dirty block bitmap

  FVD has an option to use external dirty block bitmap file in 
  addition to the regular mapping/data files.
  We can consider using it for live block migration and live merge too.
  It can also allow additional usages of 3rd party tools to calculate  
  diffs between the snapshots.
  There is a big down side thought since it will make management 
  complicated and there is the risky of the image and its bitmap file 
  get out of sync. It's much better choice to have qemu-img tool to be 
  the single interface to the dirty block bitmap data.

Solutions

Non shared storage

  Either use ISCSI (target and initiator) or NBD or proprietary qemu solution.
  ISCSI can be used externally to QEMU level ,QEMU will access the ISCSI exports as local devices.
  For a COW files chain , we need to create an ISCSI LUN (export) for every COW file in the chain.
  The file name and directory structure needs to be preserved for QEMU (for example if all the files are in the same directory , 
  it needs to appear the same in the destination).
  The images exported in ISCSI should be read only !

Live block migration

  Use the streaming approach + regular live migration + iscsi:  
  Execute regular live migration and at the end of it, start streaming.
  If there is no shared storage, use the external iscsi and behave as if the image is 
  local. At the end of the streaming operation there will be a new local base image.

Block mirror layer

  Was invented in order to duplicate write IOs for the source and destination images.
  It prevents the potential race when both qemu and the management crash at the end of the
  block copy stage and it is unknown whether management should pick the source or the 
  destination

Streaming

  No need for mirror since only the destination changes and is writeable.

Block copy background task

  Can be shared between block copy and streaming

Live snapshot

  It can be seen as a (local) stream that preserve the current COW chain

Use cases

1. Basic streaming, single base master image on source storage, need to be instantiated on 
   destination storage
    The base image is a single level COW format (file or lvm).
    The base is RO and only new destination is RW. base' is empty at the beginning.
    The base image content is being copied in the background to base'. At the end of the 
    operation, base' is a standalone image w/o depending on the base image.
    a. Case of a shared storage streaming guest boot
    Before:           src storage: base             dst storage: none
    After             src storage: base             dst storage: base'
    b. Case of no shared storage streaming guest boot
       Every thing is the same, we use external iscsi target on the src host and external 
       iscsi initiator on the destination host.
       Qemu boots from the destination by using the iscsi access. This is transparent to 
       qemu (expect cmd syntax change ). Once the streaming is 
       over, we can live drop the usage of iscsi and open the image directly (some sort of 
       null live copy)
    c. Live block migration (using streaming) w/ shared storage.
       Exactly like 1.a. First create the destination image, then we run live migration 
       there w/o data in the new image. Now we stream like the boot scenario.
    d. Live block migration (using streaming) w/o shared storage.
       Like 1.b. + 1.c.
    *** There is complexity to handle multiple block device belonging to the same VM.
    *** Management will need to track each stream finish event and manage failures accordingly.
2. Basic streaming of raw files/devices
   Here we have an issue - what happens if there is a failure in the middle?
   Regular COW can sustain a failure since the intermediate base' contains information 
   dirty bit block information. Such a base' intermediate raw image will be broken. We  
   cannot revert back to the original base and start over because new writes were written 
   only to the base'.
   Approaches:
   a. Don't support that
   b. Use intermediate COW image and then live copy it into raw (waste time, IO, space)
      One can easily add new COW over the source and continue from there.
   c. Use external metadata of dirty-block-bitmap even for raw
   Suggestion: at this stage, do either recommendation #a or #b


3. Basic live copy, single base master image on source storage, need to be copied to the 
   destination storage
   The base image is a single level COW format or a raw file/device.
   The base image content is being copied in the background to base'. At the end of the 
   operation, base' is a standalone image w/o depending on the base image.
   In this case we only take into account a running VM, no need to do that for boot stage. 
   So it is either VM running locally and about to change its storage or
   a VM live migration.
   The plan is to use the mirror driver approach. Both src/dst are writable.
    a. Case of a shared storage, a VM changes its block device
    Before:           src storage: base             dst storage: none
    After             src storage: base             dst storage: base'
    This is a plain live copy w/o moving the VM.
    The case w/o shared storage seems not relevant here.
    We might want to move multiple block devices of the VM.
    It is written here for completeness - it shouldn't change anything.
    Still management/events will use the block name/id.
    b. Live block migration (w/o streaming) w/ shared storage.
       Unlike in the streaming case, the order here is reversed:
       Run live copy. When it ends and we're in the mirror state, run live migration.
       When it ends, stop the mirroring and make the VM continue on the destination.
       That's probably a rare use case.
    c. Live block migration (using streaming) w/o shared storage.
       Like 3.b. by using external iscsi


4. COW chains that preserve the full structure
   Before:           src storage: base <- sn1 <- snx       destination storage: none
   After             src storage: base <- sn1 <- snx       destination storage: base' <- sn1' <- snx'
    All of the original snapshot chains should be copied or stream as is to the new 
    storage. With copying we can do all of the non leaf images using standard 'cp tools'.
    If we're to use iscsi, we'll need to create N such connections.
    Probably not a common use case for streaming, we might ignore this and use this scenario only for 
    copying.
5. Like 4. but the chain can collapse. In fact this is like special case of #4
    
    Before:src storage: base <- sn1 <- sn2 .. <- snx  dst storage: none
    After: src storage: base <- sn1 <- sn2 ...<- snx  dst storage: base' <- sn1' .. <- sny'
   
   There is no difference from #4 other than collapsing some chain path into the dst leaf
6. Live snapshot
   It's here since the interface can be similar. Basically it is similar to live copy but 
   instead of copying, we switch to another COW on top. The only (separate) addition would 
   be to add a verb to ask the guest to flush its file systems.
   Before:           storage: base <- s1 <- sx        
   After             storage: base <- s1 <- sx <-sx+1

Exceptions

1. Hot unplug of the relevant disk
   Prevent that. (or cancel the operation)
1. Live migration in the middle of non migration action from above
   Shall we allow it? It can work but at the end of live migration we need to reopen the
   images (NFS mainly), it might add un-needed complexity.
   We better prevent that.

Interface

Image streaming API

The stream commands populate an image file by streaming data from its backing file. Once all blocks have been streamed, the dependency on the original backing image is removed. The stream commands can be used to implement post-copy live block migration and rapid deployment.

The block_stream command starts streaming the image file. Streaming is performed in the background while the guest is running.

When streaming completes successfully or with an error, the BLOCK_JOB_COMPLETED event is raised.

The progress of a streaming operation can be polled using query-block-jobs. This returns information regarding how much of the image has been streamed for each active streaming operation.

The block_job_cancel command stops streaming the image file. The image file retains its backing file. A new streaming operation can be started at a later time.

The command synopses are as follows:

block_stream
------------

Copy data from a backing file into a block device.

The block streaming operation is performed in the background until the entire
backing file has been copied.  This command returns immediately once streaming
has started.  The status of ongoing block streaming operations can be checked
with query-block-jobs.  The operation can be stopped before it has completed
using the block_job_cancel command.

If a base file is specified then sectors are not copied from that base file and
its backing chain.  When streaming completes the image file will have the base
file as its backing file.  This can be used to stream a subset of the backing
file chain instead of flattening the entire image.

On successful completion the image file is updated to drop the backing file.

Arguments:

- device: device name (json-string)
- base:   common backing file (json-string, optional)

Errors:

DeviceInUse:    streaming is already active on this device
DeviceNotFound: device name is invalid
NotSupported:   image streaming is not supported by this device

Events:

On completion the BLOCK_JOB_COMPLETED event is raised with the following
fields:

- device:   device name (json-string)
- end:      maximum progress value (json-int)
- position: current progress value (json-int)
- error:    error message (json-string, only on error)

The completion event is raised both on success and on failure.  On
success position is equal to end.  On failure position and end can be
used to indicate at which point the operation failed.

Examples:

-> { "execute": "block_stream", "arguments": { "device": "virtio0" } }
<- { "return":  {} }

block_stream_set_speed
----------------------

Set maximum speed for a block streaming operations.

This is a per-block device setting and also affects the active image
streaming operation (if any).

Throttling can be disabled by setting the speed to 0.

Arguments:

- device: device name (json-string)
- value:  maximum speed, in bytes per second (json-int)

Example:

-> { "execute": "block_stream_set_speed",
     "arguments": { "device": "virtio0", "value": 1024 } }
<- { "return": {} }

block_job_cancel
----------------

Stop an active block streaming operation.

This command returns once the active block streaming operation has been
stopped.  It is an error to call this command if no operation is in progress.

The image file retains its backing file unless the streaming operation happens
to complete just as it is being cancelled.

A new block streaming operation can be started at a later time to finish
copying all data from the backing file.

Arguments:

- device: device name (json-string)

Errors:

DeviceNotActive: streaming is not active on this device
DeviceInUse:     cancellation already in progress

Examples:

-> { "execute": "block_job_cancel", "arguments": { "device": "virtio0" } }
<- { "return":  {} }

query-block-jobs
----------------

Show progress of ongoing block device operations.

Return a json-array of all block device operations.  If no operation is
active then return an empty array.  Each operation is a json-object with the
following data:

- type:     job type ("stream" for image streaming, json-string)
- device:   device name (json-string)
- end:      maximum progress value (json-int)
- position: current progress value (json-int)

Progress can be observed as position increases and it reaches end when
the operation completes.  Position and end have undefined units but can be
used to calculate a percentage indicating the progress that has been made.

Example:

-> { "execute": "query-block-stream" }
<- { "return":[
      { "type": "stream", "device": "virtio0",
        "end": 10737418240, "position": 709632}
   ]
 }

How live block copy works

Live block copy does the following:

  1. Create and switch to the destination file: snapshot_blkdev virtio-blk0 destination.$fmt $fmt
  2. Stream the base into the image file: block_stream -a virtio-blk0

We can use existing linux ISCSI target and initiator packages without the need to intergrate them into QEMU.

ISCSI for non shared storage

The target and initiator will need to be configured (management) according to the image layout. The firewall will also need to be configured to allow ISCSI (management) - open port 2360.

Single Image with ISCSI
-----------------------
The ISCSI target will use the source image as backing storage. 
we can allow only read/read write access for the source/destination according to the action, 
by using different users with different permissions. 
The source will use the image as a local ISCSI initiator.
The destination will use the image as a remote ISCSI initiator.
Base image/file chain  with ISCSI
---------------------------------
For access to the different files in the  chain, each file will be mapped to a different LUN of the ISCSI target.
It will be enumarate according to the chain order (base=1, ....).
All image layout logic is handled by the format layer (qcow,qed ...) , this layer should see the files as local.

Example Single image

ISCSI Target configuration (tgt)
---------------------------------------------------------------
starting tgtd
service tgtd start

create the ISCSI target 
tgtadm --lld iscsi --mode target --op new --tid=1 --targetname iqn.2009-02.com.example:for.all

allow the destination to connect to the address 
tgtadm --lld iscsi --mode target --op bind --tid=1 --initiator-address=<destination address>
  
Add user/password for destination
tgtadm --lld iscsi --mode account --op new --user consumer --password Longsw0rd' 

map the lun to the image as readonly
tgtadm --lld iscsi --mode logicalunit --op new --tid 1 --lun 1  --bsoflags=direct --backing-store <image> --params readonly=1

Example ISCSI initiator configuration 
-------------------------------------
config target user/password (can also be done in config file)
iscsiadm -m node --targetname <target name>  --portal <ip:3260> --op=update --name node.session.auth.authmethod --value=CHAP
iscsiadm -m node --targetname <target name>  --portal <ip:3260> --op=update --name node.session.auth.username --value=someuser
iscsiadm -m node --targetname <target name>  --portal <ip:3260> -op=update --name node.session.auth.password --value=secret
 
login
iscsiadm -m node --targetname <targetname> --portal <ip:3260> --login 

The image will be added as a new partition (for example /dev/sdb), we can use it a local image fie.

Example single base master image

ISCSI target
--------------
(The target configuration is similar to the single image example)

map lun 1 to the base image as readonly
tgtadm --lld iscsi --mode logicalunit --op new --tid 1 --lun 1  --bsoflags=direct --backing-store <baseimage> --params readonly=1

map lun 2 to the image
tgtadm --lld iscsi --mode logicalunit --op new --tid 1 --lun 2  --bsoflags=direct --backing-store <image> --params readonly=1

initiator
----------
Configuration is the same as in Single image use case.

There will be two new partition for example the new partiotions are /dev/sdb and /dev/sdc.
The first is the base and the second the image.

In order to use the image we need to link the base to the original file name (as it was during the image create).
for example an image :

qemu-img info /dev/sdc
image: /dev/sdc
file format: qcow2
virtual size: 10G (10737418240 bytes)
disk size: 0
cluster_size: 65536
backing file: /images/base.img (actual path: /images/base.img)

link the base to with the same name:
ln -s /dev/sdb /images/base.img

We can use /dev/sdc as a local image.

Common implementation for live block copy and image streaming

Requirements

  • Live block copy:
    • Open both source and destination read-write
    • Mirror writes to source and destination
  • Image streaming:
    • Source must be read-only
  • Be able to do a "partial" copy, i.e. share a backing file and don't copy data that can be read from this backing file
  • Use the COW provided by image formats, so that a crash doesn't force you to restart the whole copy
    • Need a raw-cow driver that uses an in-memory COW bitmap for a raw image, so that we can use raw output images
    • Anything that can do COW is good enough, so the implementation should be independent of the format
  • libvirt already has implemented block_stream QMP commands, so let's keep them unchanged if possible

Building blocks

  1. Copy on read
    • Improves time it takes to complete the copy
    • Allows simpler implementation of copy
  2. Background task doing the copy (by reading the whole image sequentially)
  3. Mirroring of writes to source and destination
    • source is a backing file of destination, i.e. we need to support writable backing files
  4. Wrappers for live block copy and image streaming that enable the above features
  5. Switching the backing file after the copy has completed (separate monitor command like in block copy?). Ends the mirroring if enabled.

Example (live block copy)

This section tries to explain by example what happens when you perform a live block copy. To make things interesting, we have a backing file (base) that is shared between source and destination, and two overlays (sn1 and sn2) that are only present on the source and must be copied.

So initially the backing file chain look like this:

 base <- sn1 <- sn2

We start by creating the copy as a new image on top of sn2:

 base <- sn1 <- sn2 <- copy

This gets all reads right automatically. For writes we use a mirror mechanism that redirects all writes to both sn2 and copy, so that sn2 and copy read the same at any time (and this mirror mechanism is really the only difference of live block copy from image streaming).

We start a background task that loops over all clusters. For each cluster, there are the following possible cases:

  1. The cluster is already allocated in copy. Nothing to do.
  2. Use something like bdrv_is_allocated() to follow the backing file chain. If the cluster is read from base (which is shared), nothing to do.
    • For image streaming we're already on the destination. If we don't have shared storage and the protocol doesn't provide this bdrv_is_allocated() variant, we can still do something like qemu-img rebase and just compare the read data with data in base in order to decide if we need to copy.
  3. Otherwise copy the cluster into copy

When the copy has completed, the backing file of copy is switched to base (in a qemu-img rebase -u way).

Links