Features/LiveBlockMigration
Future qemu is expected to support these features (some already implemented):
Live features
Live block copy
Ability to copy 1+ virtual disk from the source backing file/block device to a new target that is accessible by the host. The copy supposed to be executed while the VM runs in a transparent way.
Live snapshots and live snapshot merge
Live snapshot is already incorporated (by Jes) in qemu (still need qemu-agent work to freeze the guest FS).
Live snapshot merge is required in order of reducing the overhead caused by the additional snapshots (sometimes over raw device). Currently not implemented for a live running guest We'll use live copy to do the live merge
Copy on read (image streaming)
Ability to start guest execution while the parent image reside remotely and each block access is replicated to a local copy (image format snapshot)
Such functionality can be hooked together with live block migration instead of the 'post copy' method.
Live block migration (pre/post)
Beyond live block copy we'll sometimes need to move both the storage and the guest. There are two main approached here: - pre copy First live copy the image and only then live migration the VM. It is simple but if the purpose of the whole live block migration was to balance the cpu load, it won't be practical to use since copying an image of 100GB will take too long. - post copy (streaming / copy on read) First live migrate the VM, then live copy it's blocks. It's better approach for HA/load balancing but it might make management complex (need to keep the source VM alive, what happens on failures?)
In addition there are two cases for the storage access: 1. Shared storage Sometimes it is preferred to change the underlying storage of a running VM without moving the VM around and while the VM is running. For example, move to a faster storage or out of space of the original storage system. Live block copy enable this capability.
2. There are some cases where the is no NFS/SAN storage and live migration is needed It should be similar to VMW's storage VM motion. http://www.vmware.com/files/pdf/VMware-Storage-VMotion-DS-EN.pdf http://www.vmware.com/products/storage-vmotion/features.html
Using external dirty block bitmap
FVD has an option to use external dirty block bitmap file in addition to the regular mapping/data files.
We can consider using it for live block migration and live merge too. It can also allow additional usages of 3rd party tools to calculate diffs between the snapshots. There is a big down side thought since it will make management complicated and there is the risky of the image and its bitmap file get out of sync. It's much better choice to have qemu-img tool to be the single interface to the dirty block bitmap data.
Solutions
Un-ordered list of possibilities:
Either use iscsi (target and initiator) or NBD or proprietary qemu solution. iScsi in theory is the best but there is a problem of dealing with COW images - iScsi cannot report the COW level where a block exist. This might force us to use proprietary solution.
An interesting option (by Orit Wasserman) was to use iScsi for exporting the images externally to qemu level and qemu will access as if they were a local device. This can work well w/o almost any effort. What do we do with chains of COW files? We create up tp n such iScsi connections.
Live block migration
Use the streaming approach + regular live migration + iscsi: Execute regular live migration and at the end of it, start streaming. If there is no shared storage, use the external iscsi and behave as if the image is local. At the end of the streaming operation there will be a new local base image.
Block mirror layer
Was invented in order to duplicate write IOs for the source and destination images. It prevents the potential race when both qemu and the management crash at the end of the block copy stage and it is unknown whether management should pick the source or the destination
Streaming
No need for mirror since only the destination changes and is writeable.
Block copy background task
Can be shared between block copy and streaming
Live snapshot
Can it be seen as a (local) stream that preserve the current COW chain?
Use cases
#1 Basic streaming, single base master image on source storage, need to be instantiated on destination storage
The base image is a single level COW format (file or lvm). The base is RO and only new destination is RW. base' is empty at the beginning. The base image content is being copied in the background to base'. At the end of the operation, base' is a standalone image w/o depending on the base image.
a. Case of a shared storage streaming guest boot (usually the source)
{{{ Before: src storage: base dst storage: none After src storage: base dst storage: base' }}}
b. Case of no shared storage streaming guest boot Every thing is the same, we use external iscsi target on the src host and external iscsi initiator on the destination host. Qemu boots from the destination by using the iscsi access. This is transparent to qemu (expect cmd syntax change ). Once the streaming is over, we can live drop the usage of iscsi and open the image directly (some sort of null live copy)
c. Live block migration (using streaming) w/ shared storage. Exactly like 1.a. First create the destination image, then we run live migration there w/o data in the new image. Then we so the streaming like the boot scenario.
d. Live block migration (using streaming) w/o shared storage. Like 1.b. + 1.c.
#2 Basic streaming of raw files/devices
Here we have an issue - what happens if there is a failure in the middle? Regular COW can sustain a failure since the intermediate base' contains information dirty bit block information. Such a base' intermediate raw image will be broken. We cannot revert back to the original base and start over because new writes were written only to the base'.
Approaches:
{{{
a. Don't support that b. Use intermediate COW image and then live copy it into raw (waste time, IO, space) One can easily add new COW over the source and continue from there. c. Use external metadata of dirty-block-bitmap even for raw
}}}
Suggestion: at this stage, do either recommendation #a or #b
#3. Basic live copy, single base master image on source storage, need to be copied to the destination storage
The base image is a single level COW format or a raw file/device. The base image content is being copied in the background to base'. At the end of the operation, base' is a standalone image w/o depending on the base image. In this case we only take into account a running VM, no need to do that for boot stage. So it is either VM running locally and about to change its storage or a VM live migration. Here a mirror driver approach is taking place. Both src/dst are writable.
a. Case of a shared storage a a VM changes its block device
{{{ Before: src storage: base dst storage: none After src storage: base dst storage: base' }}}
This is a plain live copy w/o moving the VM. The case w/o shared storage seems not relevant here.
We might want to move multiple block devices of the VM. It is written here for completeness - it shouldn't change anything. Still management/events will use the block name/id.
b. Live block migration (w/o streaming) w/ shared storage. Unlike in the streaming case, the order here is reversed: Run live copy. When it ends and we're in the mirror state, run live migration. When it ends, stop the mirroring and make the VM continue on the destination.
c. Live block migration (using streaming) w/o shared storage. Like 3.b. by using external iscsi
#4. Streaming COW chains that preserve the full structure
{{{
Before: src storage: base <- sn1 <- snx destination storage: none
After src storage: base <- sn1 <- snx destination storage: base' <- sn1' <- snx'
}}}
All of the original snapshot chains should be copied or stream as is to the new storage. With copying we can do all of the non leaf images using standard 'cp tools'. If we're to use iscsi, we'll need to create N such connections.
#5. Like #4 but the chain can collapse. In fact this is like special case of #4
{{{
Before: src storage: base <- sn1 <- sn2 .. <- snx destination storage: none
After src storage: base <- sn1 <- sn2 ...<- snx destination storage: base' <- sn1' .. <- sny'
}}}
There is no difference from #4 other than collapsing some chain path into the dst leaf
#6. Live snapshot
It's here since the interface can be similar. Basically it is similar to live copy but instead of copying, we switch to another COW on top. The only (separate) addition would be to add a verb to ask the guest to flush its file systems.
{{{ Before: storage: base <- s1 <- sx After storage: base <- s1 <- sx <-sx+1 }}}
Exceptions
1. Hot unplug of the relevant disk
Prevent that. (or cancel the operation)
1. Live migration in the middle of non migration action from above Shall we allow it? It can work but at the end of live migration we need to reopen the images (NFS mainly), it might add un-needed complexity. We better prevent that.
Interface
* Streaming:
{{{ By Stefan:
1. Start a background streaming operation:
(qemu) block_stream -a ide0-hd
2. Check the status of the operation:
(qemu) info block-stream Streaming device ide0-hd: Completed 512 of 34359738368 bytes
3. The status changes when the operation completes:
(qemu) info block-stream No active stream
On completion the image file no longer has a backing file dependency. When streaming completes QEMU updates the image file metadata to indicate that no backing file is used.
The QMP interface is similar but provides QMP events to signal streaming completion and failure. Polling to query the streaming status is only used when the management application wishes to refresh progress information.
If guest execution is interrupted by a power failure or QEMU crash, then the image file is still valid but streaming may be incomplete. When QEMU is launched again the block_stream command can be issued to resume streaming. }}}
Common implementation for live block copy and image streaming
Requirements
- Live block copy:
- Open both source and destination read-write
- Mirror writes to source and destination
- Image streaming:
- Source must be read-only
- Be able to do a "partial" copy, i.e. share a backing file and don't copy data that can be read from this backing file
- Use the COW provided by image formats, so that a crash doesn't force you to restart the whole copy
- Need a raw-cow driver that uses an in-memory COW bitmap for a raw image, so that we can use raw output images
- Anything that can do COW is good enough, so the implementation should be independent of the format
- libvirt already has implemented block_stream QMP commands, so let's keep them unchanged if possible
Building blocks
- Copy on read
- Improves time it takes to complete the copy
- Allows simpler implementation of copy
- Background task doing the copy (by reading the whole image sequentially)
- Mirroring of writes to source and destination
- source is a backing file of destination, i.e. we need to support writable backing files
- Wrappers for live block copy and image streaming that enable the above features
- Switching the backing file after the copy has completed (separate monitor command like in block copy?). Ends the mirroring if enabled.
Example (live block copy)
This section tries to explain by example what happens when you perform a live block copy. To make things interesting, we have a backing file (base) that is shared between source and destination, and two overlays (sn1 and sn2) that are only present on the source and must be copied.
So initially the backing file chain look like this:
base <- sn1 <- sn2
We start by creating the copy as a new image on top of sn2:
base <- sn1 <- sn2 <- copy
This gets all reads right automatically. For writes we use a mirror mechanism that redirects all writes to both sn2 and copy, so that sn2 and copy read the same at any time (and this mirror mechanism is really the only difference of live block copy from image streaming).
We start a background task that loops over all clusters. For each cluster, there are the following possible cases:
- The cluster is already allocated in copy. Nothing to do.
- Use something like bdrv_is_allocated() to follow the backing file chain. If the cluster is read from base (which is shared), nothing to do.
- For image streaming we're already on the destination. If we don't have shared storage and the protocol doesn't provide this bdrv_is_allocated() variant, we can still do something like qemu-img rebase and just compare the read data with data in base in order to decide if we need to copy.
- Otherwise copy the cluster into copy
When the copy has completed, the backing file of copy is switched to base (in a qemu-img rebase -u way).