Live Snapshots

This document is describing the current design of live snapshots for QEMU. It is a work in progress and things may change as we progress.

Overall concept

The idea is to be able to issue a command to QEMU via the monitor or QMP, which causes QEMU to create a new snapshot image with the original image as the backing file, mounted read-only. This will allow the original image file to be backed up.

Roll-back to a previous version requires one to boot from the previous backing file, at which point the snapshot file becomes invalid. Unfortunately there is no way to detect that a backing file has been booted, making it important for administrators to take care to not rely on snapshot files being valid after a roll-back.

The snapshot image will have to be in a format which support backing files, ie QCOW2 and QED, however the original image can be of any supported format. Ie. it is possible to make a QCOW2 snapshot of a RAW image, or a QED snapshot of a QED image.

Guest Agent

Certain operations in the snapshot process can be improved through support from within the guest. These features will be implemented in the Guest Agent. Please check the guest Guest Agent page for design and implementation details.

The two main guest agent features of interest to live snapshots are:

File system freeze (fsfreeze/fsthaw): This puts the guest file systems into a consistent state, avoiding the need for fsck next time they are mounted.
Guest application notification: This allows guest applications to register and be notified prior to a snapshot, in order for them to allow flushing their data to disk. This is a future feature!

As of this writing (July 25, 2011), communication with the QEMU guest agent is performed via a virtio serial channel. Commands are sent over the channel encoded as QMP commands, and replies are encoded as QMP replies. There are future plans to implement a passthrough mechanism for agent commands issued via QMP, allowing these commands to be accessible via the QMP monitor instead of an external agent socket on the host.

Note that guest agent collaboration is also needed for snapshots using other methods, such as snapshots performed on btrfs, LVM, enterprise storage, etc.

Snapshot command flow

The snapshot command flow is as follows. Commands are demonstrated using monitor commands for QEMU and agent commands are marked (agent). See the Guest Agent: Example Usage page for details on the specific command implementation for the guest agent commands.

Run the guest, if not currently running:

(qemu) cont

RECOMMENDED: Call guest agent requesting it to freeze all file systems and flush all I/O requests. Note that this runs on the guest, and as such the guest must currently be running:

(agent) guest-fsfreeze-freeze

Initiate synchronous snapshot of device <blockX> to new device snapshot-file:

(qemu) snapshot_blkdev <blockX> <snapshot-file> <format>

'Note:' The above will write the COW headers to the snapshot device, and pivot the block device <blockX> to point to the new device, using the original file/device as it's backing file. It is important to note that it is QEMU which will generate the COW headers in the new snapshot file. During snapshot creation the guest will momentarily be halted by QEMU. Pending I/Os will be flushed to disk, the COW headers will be created in the snapshot file/device, and QEMU will replace the file backing device <blockX> with the new snapshot file. On completion of the command, the guest will resume running as the command returns, unless the admin tool explicitly issued the optional stop command as described above.

This command is repeated for each device that is to be snapshot.

Call guest agent requesting it to thaw/unfreeze all file systems within the guest (if guest-fsfreeze-freeze was issued above):

(agent) guest-fsfreeze-thaw

At this point, the snapshot for the device is complete, and QEMU has pivoted the guest to the new snapshot file for execution.

To visualize this sequence, below are call sequences showing the order and direction of these commands going to both QEMU and the guest agent:

Minimum set of commands:

Guest         Manager                       QEMU
-------       --------                     -------
  |               |                           |
  |               |                           |
  | <<- freeze ---o                           |
  |               |                           |
  |               o--- snapshot_blkdev  --->> |
  |               |                           |
  | <<- thaw -----o                           |
  |               |                           |
  |               |                           |
  |               |                           |
  =               =                           =

HMP command

The HMP (monitor) command is designed to be flexible enough to handle both internal and external snapshots, as well as snapshots to various different snapshot file formats.

snapshot_blkdev device snapshot-file [format]:

Parameter	Description
device	block device to snapshot
snapshot-file	target snapshot file (new image filename)
format	format of snapshot image, valid formats are QCOW2 & QED. If not specified, the image will default to QCOW2.

QMP command

The QMP command matches the behaviour of the human monitor command, except it is named slightly differently to match the fact that the command is synchronous.

blockdev-snapshot-sync device snapshot-file [format]

Parameter (JSON String)	Description
device	block device to snapshot
snapshot-file	target snapshot file (new image filename)
format	format of snapshot image, valid formats are QCOW2 & QED. If not specified, the image will default to QCOW2.

Here is an example of a QMP snapshot command, in JSON format:

{ "execute": "blockdev-snapshot-sync", "arguments": { "device": "virtio0",
                                                      "snapshot-file":
                                                      "/some/place/my-image",
                                                      "format": "qcow2" } }

Atomic Snapshots of Multiple Devices

With the new transaction-based block commands, it is now possible to take atomic snapshots of multiple devices. For more details on the group snapshot API, please see: Atomic Snapshots of Multiple Devices

Live Snapshot Merge

Creating snapshots through the QEMU live snapshot commands allow for incremental guest image files to be created, with each image file containing differences from its parent backing file.

While these snapshot files are useful for backup and other purposes, there exists a need to manage these snapshot files so that they can be merged (flattened). Without the ability to merge and flatten snapshot images, the snapshot chain will continue to grow as new snapshots are made, which may become difficult to manage, in addition to introducing performance concerns.

In order to flatten the image, there are two approaches: block streaming, and block commit. Both of these operations can be performed 'live', while the guest OS is running. Block streaming takes data from parent image(s), and copies (streams) the data to the active layer. Block commit takes data from child(ren) image(s), and copies this data into the parent.

Block Streaming

Streaming to the Active Layer

The current mode for merging QEMU external snapshots while the emulator is 'live' is via block streaming, which streams sectors located in parent snapshots into the active layer (the endmost 'child'). An optional base file can be specified, so that only sectors between the base and the active layer are streamed to the top. Drawing 1, below, shows an example chain of external QEMU snapshots.

Drawing 1: Example Snapshot Chain

During live block merge, performed with the command 'block-stream', the chain can be full or partially collapsed upwards, towards the active layer. Drawing 2 illustrates flattening out part of the chain, leaving only the base backing file in place:

Drawing 2: Partial flattening of a snapshot chain

As Drawing 2 illustrates, Snap-1 and Snap-2 have their sectors streamed into Snap-3, while the RootBase sectors are not streamed into Snap-3. This leaves the final snapshot chain, with Snap-3 as the active layer, consisting of just RootBase as the parent and Snap-3 as the child.

Assuming a virtio driver, and the block stream command to perform the merge shown in Drawing 2 is as follows:

{ "execute": "block-stream", "arguments": 
   { "device": "virtio0", "base": "RootBase.img" } }

It is worth noting that the QEMU block_stream command does not delete any external snapshot images from storage, so it is the user (or management software) responsibility to clean up unwanted or unneeded snapshots. However, these 'unneeded' snapshots are still valid snapshots, and could be used if desired.

Design of Block Streaming to Active Layer

The QMP/QAPI command for block-stream is handled within blockdev.c, and the handler is responsible for finding the specified image file for the active layer, and then initiating the streaming process.

Once the streaming process is initiated, a block job is created. This block job is implemented via a coroutine, that operate by means of a cooperative multitasking with other coroutines and threads. The block job is responsible for copying sectors that are located between the 'base' image and the active layer, up into the active layer.

The block job, as specified in the block-stream command, takes an optional parameter for speed. This may be used to throttle the streaming process, and the block job will cooperatively yield according to the speed parameter. Note, however, that in the absence of speed throttling, cooperative yields still occur in the block job.

Block Stream API

The current block streaming API is:

{ 'command': 'block-stream', 'data': { 'device': 'str', '*base': 'str',
                                       '*speed': 'int' } }

block-stream

Parameter (JSON String)	Description
device	block device to stream
base	base image, only sectors above this image are streamed. optional
speed	speed throttling, in B/s, of the stream operation. optional

Streaming to an Intermediate Layer [proposal]

Similar to streaming to the active layer, it is possible to stream to an intermediate layer. When streaming to an intermediate layer, QEMU must ensure that the intermediate layer is changed from read-only, to read-write for the streaming, and backing to read-only once streaming has completed. Intermediate layers are opened read-only by QEMU.

The command is the same, except that the 'device' argument now also allows a node name.

Block Commit [proposal, preliminary]

There are reasons why it may be desired to stream the commit from the child, up into the parent image. For instance, if there is no desired to keep any of the intermediate images, it may be more efficient to commit the child (or children) into a parent node. Often, the parent node may be the larger image, and as such it would be less I/O intensive to commit the child into the parent, as show in Drawing 3:

Drawing 3: Example Snapshot Commit

Limitations of Block Commit

While live commit can be used while the guest is live, to write data into a base or intermediate image from the active layer, there are important consequences to keep in mind. Since the image commit chain is a directed acyclic graph (DAG), each image may have other children. Such children will be unknown to QEMU, and will also be invalidated by a commit. Drawing 4 illustrates this issue.

Drawing 4: Live Commit, Invalidating Leaf Images

Both Drawing 4 and Drawing 5 show a snapshot chain along the Snap-B branch being flattened into two images – an active image, with a backing file RootBase.

In Drawing 4, the sectors residing in Snap-B-1 and Snap-B-2 are written into their parent image, Snap-1. As Snap-A-1 and its descendants are dependent on the original state of Snap-1, all of Snap-A-1 and its descendants are now invalid. In addition, Snap-B-1 is invalid, and Snap-B-2 will become invalid on the first write to the freshly committed Snap-1. Snap-1 now becomes the active layer, and is functional equivalent to the original Snap-B-2, without the intermediate images.

In contrast, Drawing 5 shows a similar operation, except instead of sectors being copied back into Snap-1, sectors are copied into Snap-B-2 from Snap-1 and Snap-B-1. This leaves the Snap-A branch still valid, and Snap-B-1 valid as well.

Drawing 5: Block Streaming, Keeping Valid Leaf Images

Live commit can prove useful in scenarios in which the operator (or management software) is completely aware of (or apathetic towards) all snapshots downstream from the committed layer. It may, in certain scenarios, be a much faster operation to flatten an image chain by committing to a parent image, rather than streaming into an active layer.

Design of Block Commit

A lot of the basic design of block commit is similar to block-stream. However, there are a couple of important differences:

The destination (base) image is currently opened in the chain read-only, and must be reopened with read-write access modes.
If the active layer is being committed into the 'base' image, convergence becomes an issue. As this is a live operation, the guest may still be writing to the active layer image, so while sectors are copied into the base image, new sectors are constantly created (i.e., new dirty sectors). This may require special handling, similar to the block-mirror code.

The difference in item 1) is handled by the introduction of a new bdrv_reopen() command (proposed by Supriya from IBM), that allows an image to be reopened in read-write mode.

The differences in item 2) is handled by treating 'old' data (data that was dirty when the live commit began) differently from 'new' data (data that has become dirty from new guest activity). The 'old' data would be subject to the speed throttling parameter from the proposed block-commit command, while the 'new' data would not – it would be treated like an active mirror, with data being committed to the 'base' image at a non-throttled rate.

Once the QMP/QAPI command is received, prior to creating the block-commit block job, the handler:

Looks up the base image
Looks up the top image (the active layer for the device if not specified)
Converts the base image from r/o to r/w

Only if all the above steps are performed successfully, will the block-commit job be performed.

Inside the block-commit coroutine, sectors are copied according to the speed throttling parameter, from the layers between the base image and the top image. This data is copied according to where in the image chain it is allocated.

For new data in the top image, an approach similar to active mirroring is used. Sectors are marked as dirty, and new writes update the dirty bitmap sector. The dirty bitmap sectors are copied over, but without speed throttling, in order to encourage convergence.

Once the both existing data and new data are committed into the specified base image, the live image chain is manipulated with bdrv_swap() so that intermediate images are bypassed.

If the top image was the active layer, then the base image will remain r/w and become the new active layer upon the successful job completion. Otherwise, the base image will be reopened as r/o, as it was prior to the block-commit operation.

Drawing 6: Commit From The Active Layer

Note, that if management software such as libvirt is used, it may be possible for it to monitor the state of convergence, and optionally pause the guest if desired.

Block Commit API

The proposed API is show below:

{ 'command': 'block-commit', 'data': { 'device': 'str', '*base': 'str',
                                       '*top': 'str', '*speed': 'int' } }

block-stream [proposal]

Parameter (JSON String)	Description
device	block device to stream
base	base image, the image into which data is copied
top	top image, only sectors below this image are copied, into base image. optional - if not specified, the top image is the active layer
speed	speed throttling, in B/s, of the stream operation. optional

Future features

Internal snapshots to images which support internal snapshots (QCOW2 & QED) are not expected to be supported initially.

There have been requests and suggestions for a number of alternative and enhanced interfaces for accessing live snapshots as follows:

internal snapshots

By making the snapshot-file argument of the monitor and QMP command optional, that could be used as a request to make the snapshot internally instead of to an external file. However, without live block migration of an internal snapshot, there is no way to make a backup of an internal snapshot while still leaving the VM running, so this feature is not planned at the present. For now, the snapshot-file argument is required, and only external snapshots are implemented.

fd passed as target for snapshot file/device

To get around problems with selinux, in particular in conjunction with images based on NFS, there is a wish to be able to pass an already open file descriptor using the getfd interface.

However, this poses a number of problems. When creating the COW headers for the new image file, as the COW header needs to know the file name of the disk image it is pointing to. On Linux this can be obtained through /proc/self/fd/<X> but this is not available on all other operating systems.

An alternative solution would be to extend the getfd interface to take an optional file name. However this would be a hack and open up for errors, as it would allow a broken/hostile guest/QEMU process to create an image which points to the wrong place, but which wouldn't be discovered until the time where the image was actually booted.

Allowing the controlling application to create the COW headers in the new image is not an acceptable solution. It is race prone as the image is not following the backing file which is still in use, and would also cause problems for COW formats where the new COW headers include state as of when they are created.

Separating into multiple commands

There are suggestions for splitting the snapshot process into multiple monitor/QMP commands to allow for asynchronous command processing. The process would be split as follows, using human monitor style commands as example:

(agent) guest-agent-fsfreeze

Call guest agent requesting it to freeze all file systems and flush all I/O requests.

(qemu) freeze-io <blockX>

Instruct QEMU to freeze all I/O processing for block device <blockX>

(qemu) getfd <fd> snapshotfd

Provide file descriptor <fd> and assign it the logical name snapshotfd

(qemu) snapshot-blkdev-async <blockX> fd:snapshotfd <format>

Initiate asynchronous snapshot of device <blockX> to recently provided file descriptor snapshotfd. This will write the COW headers to the snapshot device, and pivot the block device <blockX> to point to the new device, using the original file/device as it's backing file. It is important to note that it is QEMU which will generate the COW headers in the new snapshot file, externally creating these will not be allowed!

On completion a completion notification will be returned to the caller, hence this will require QAPI in place for proper async QMP command support.

(qemu) thaw-io <blockX>

Un-freeze I/O processing for device <blockX>

(agent) guest-agent-fsthaw

Call guest agent requesting it to thaw/unfreeze all file systems within the guest.

(qemu) snapshot-blkdev-status <blockX>

Query the current snapshot status of <blockX>. In addition some form of notification of completion will be required.

Note that the caller can loop the process of comments freeze-io, getfd, snapshot-blkdev-async, and thaw-io to snapshot multiple block devices in one guest.

Live merge

See http://wiki.qemu.org/Features/LiveBlockMigration

Other proposed qemu features that solve similar or related problems