Features/Snapshots: Difference between revisions
(47 intermediate revisions by 5 users not shown) | |||
Line 7: | Line 7: | ||
Roll-back to a previous version requires one to boot from the previous backing file, at which point the snapshot file becomes invalid. Unfortunately there is no way to detect that a backing file has been booted, making it important for administrators to take care to not rely on snapshot files being valid after a roll-back. | Roll-back to a previous version requires one to boot from the previous backing file, at which point the snapshot file becomes invalid. Unfortunately there is no way to detect that a backing file has been booted, making it important for administrators to take care to not rely on snapshot files being valid after a roll-back. | ||
The snapshot image will have to be in a format which support backing files, ie QCOW2 | The snapshot image will have to be in a format which support backing files, ie QCOW2 and QED, however the original image can be of any supported format. Ie. it is possible to make a QCOW2 snapshot of a RAW image, or a QED snapshot of a QED image. | ||
==Guest Agent== | |||
Certain operations in the snapshot process can be improved through support from within the guest. These features will be implemented in the [http://wiki.qemu.org/Features/QAPI/GuestAgent Guest Agent]. Please check the guest [http://wiki.qemu.org/Features/QAPI/GuestAgent Guest Agent] page for design and implementation details. | |||
The two main guest agent features of interest to live snapshots are: | |||
# File system freeze (fsfreeze/fsthaw): This puts the guest file systems into a consistent state, avoiding the need for fsck next time they are mounted. | |||
# | # Guest application notification: This allows guest applications to register and be notified prior to a snapshot, in order for them to allow flushing their data to disk. This is a future feature! | ||
# | |||
As of this writing (July 25, 2011), communication with the QEMU guest agent is performed via a virtio serial channel. Commands are sent over the channel encoded as QMP commands, and replies are encoded as QMP replies. There are future plans to implement a passthrough mechanism for agent commands issued via QMP, allowing these commands to be accessible via the QMP monitor instead of an external agent socket on the host. | |||
Note that guest agent collaboration is also needed for snapshots using other methods, such as snapshots performed on btrfs, LVM, enterprise storage, etc. | |||
==Snapshot command flow== | |||
The snapshot command flow is as follows. Commands are demonstrated using monitor commands for QEMU and agent commands are marked (agent). See the [http://wiki.qemu.org/Features/QAPI/GuestAgent#Example_usage Guest Agent: Example Usage] page for details on the specific command implementation for the guest agent commands. | |||
*''Run the guest, if not currently running:'' | |||
(qemu) cont | |||
*'''''RECOMMENDED:''' Call guest agent requesting it to freeze all file systems and flush all I/O requests. Note that this runs on the guest, and as such the guest must currently be running:'' | |||
(agent) guest-fsfreeze-freeze | |||
*''Initiate synchronous snapshot of device '''<blockX>''' to new device '''snapshot-file''':'' | |||
(qemu) snapshot_blkdev <blockX> <snapshot-file> <format> | |||
''''Note:'''' | |||
The above will write the COW headers to the snapshot device, and pivot the block device '''<blockX>''' to point to the new device, using the original file/device as it's backing file. It is important to note that it is QEMU which will generate the COW headers in the new snapshot file. | |||
During snapshot creation the guest will momentarily be halted by QEMU. Pending I/Os will be flushed to disk, the COW headers will be created in the snapshot file/device, and QEMU will replace the file backing device '''<blockX>''' with the new snapshot file. On completion of the command, the guest will resume running as the command returns, unless the admin tool explicitly issued the optional stop command as described above. | |||
This command is repeated for each device that is to be snapshot. | |||
*''Call guest agent requesting it to thaw/unfreeze all file systems within the guest (if <tt>guest-fsfreeze-freeze</tt> was issued above):'' | |||
(agent) guest-fsfreeze-thaw | |||
At this point, the snapshot for the device is complete, and QEMU has pivoted the guest to the new snapshot file for execution. | |||
To visualize this sequence, below are call sequences showing the order and direction of these commands going to both QEMU and the guest agent: | |||
''Minimum set of commands:'' | |||
Guest Manager QEMU | |||
------- -------- ------- | |||
| | | | |||
| | | | |||
| <<- freeze ---o | | |||
| | | | |||
| o--- snapshot_blkdev --->> | | |||
| | | | |||
| <<- thaw -----o | | |||
| | | | |||
| | | | |||
| | | | |||
= = = | |||
==HMP command== | |||
The HMP (monitor) command is designed to be flexible enough to handle both internal and external snapshots, as well as snapshots to various different snapshot file formats. | |||
'''snapshot_blkdev ''device snapshot-file [format]:''''' | |||
{|border=1 cellpadding="5" cellspacing="0" | |||
|- | |- | ||
!Parameter | |||
!Description | |||
|- | |- | ||
| '''format''' || format of snapshot image, valid formats are QCOW2 & QED | | style="background:#efefef;"| '''device''' || width=500; | block device to snapshot | ||
|- | |||
| style="background:#efefef;"| '''snapshot-file''' || width=500; | target snapshot file (new image filename) | |||
|- | |||
| style="background:#efefef;"| '''format''' || width=500; | format of snapshot image, valid formats are QCOW2 & QED. If not specified, the image will default to QCOW2. | |||
|} | |} | ||
Line 40: | Line 88: | ||
The QMP command matches the behaviour of the human monitor command, except it is named slightly differently to match the fact that the command is synchronous. | The QMP command matches the behaviour of the human monitor command, except it is named slightly differently to match the fact that the command is synchronous. | ||
'''blockdev-snapshot-sync device snapshot-file [format]''' | '''blockdev-snapshot-sync ''device snapshot-file [format]''''' | ||
{|border=1 cellpadding="5" cellspacing="0" | |||
|- | |||
!Parameter (JSON String) | |||
!Description | |||
|- | |||
| style="background:#efefef;"| '''device''' || width=500; | block device to snapshot | |||
|- | |||
| style="background:#efefef;"| '''snapshot-file''' || width=500; | target snapshot file (new image filename) | |||
|- | |||
| style="background:#efefef;"| '''format''' || width=500; | format of snapshot image, valid formats are QCOW2 & QED. If not specified, the image will default to QCOW2. | |||
|} | |||
Here is an example of a QMP snapshot command, in JSON format: | |||
{ "execute": "blockdev-snapshot-sync", "arguments": { "device": "virtio0", | |||
"snapshot-file": | |||
"/some/place/my-image", | |||
"format": "qcow2" } } | |||
==Atomic Snapshots of Multiple Devices== | |||
With the new transaction-based block commands, it is now possible to take atomic snapshots of multiple devices. For more details on the group snapshot API, please see: [http://wiki.qemu.org/Features/SnapshotsMultipleDevices Atomic Snapshots of Multiple Devices] | |||
=Live Snapshot Merge= | |||
Creating snapshots through the QEMU live snapshot commands allow for incremental guest image files to be created, with each image file containing differences from its parent backing file. | |||
While these snapshot files are useful for backup and other purposes, there exists a need to manage these snapshot files so that they can be merged (flattened). Without the ability to merge and flatten snapshot images, the snapshot chain will continue to grow as new snapshots are made, which may become difficult to manage, in addition to introducing performance concerns. | |||
In order to flatten the image, there are two approaches: block streaming, and block commit. Both of these operations can be performed 'live', while the guest OS is running. Block streaming takes data from parent image(s), and copies (streams) the data to the active layer. Block commit takes data from child(ren) image(s), and copies this data into the parent. | |||
==Block Streaming== | |||
===Streaming to the Active Layer=== | |||
The current mode for merging QEMU external snapshots while the emulator is 'live' is via block streaming, which streams sectors located in parent snapshots into the active layer (the endmost 'child'). An optional base file can be specified, so that only sectors between the base and the active layer are streamed to the top. Drawing 1, below, shows an example chain of external QEMU snapshots. | |||
[[Image:Snapshot-chain-example-1.png|thumb|center|600px||''Drawing 1: Example Snapshot Chain'']] | |||
<br style="clear: both" /> | |||
During live block merge, performed with the command 'block-stream', the chain can be full or partially collapsed upwards, towards the active layer. Drawing 2 illustrates flattening out part of the chain, leaving only the base backing file in place: | |||
[[Image:Forward-merge-example-1.png|thumb|center|600px|''Drawing 2: Partial flattening of a snapshot chain'']] | |||
As Drawing 2 illustrates, Snap-1 and Snap-2 have their sectors streamed into Snap-3, while the RootBase sectors are not streamed into Snap-3. This leaves the final snapshot chain, with Snap-3 as the active layer, consisting of just RootBase as the parent and Snap-3 as the child. | |||
<br style="clear: both" /> | |||
Assuming a virtio driver, and the block stream command to perform the merge shown in Drawing 2 is as follows: | |||
{ "execute": "block-stream", "arguments": | |||
{ "device": "virtio0", "base": "RootBase.img" } } | |||
It is worth noting that the QEMU block_stream command does not delete any external snapshot images from storage, so it is the user (or management software) responsibility to clean up unwanted or unneeded snapshots. However, these 'unneeded' snapshots are still valid snapshots, and could be used if desired. | |||
=== Design of Block Streaming to Active Layer === | |||
The QMP/QAPI command for block-stream is handled within blockdev.c, and the handler is responsible for finding the specified image file for the active layer, and then initiating the streaming process. | |||
Once the streaming process is initiated, a block job is created. This block job is implemented via a coroutine, that operate by means of a cooperative multitasking with other coroutines and threads. The block job is responsible for copying sectors that are located between the 'base' image and the active layer, up into the active layer. | |||
The block job, as specified in the block-stream command, takes an optional parameter for speed. This may be used to throttle the streaming process, and the block job will cooperatively yield according to the speed parameter. Note, however, that in the absence of speed throttling, cooperative yields still occur in the block job. | |||
=== Block Stream API === | |||
The current block streaming API is: | |||
{ 'command': 'block-stream', 'data': { 'device': 'str', '*base': 'str', | |||
'*speed': 'int' } } | |||
'''block-stream''' | |||
{|border=1 | {|border=1 cellpadding="5" cellspacing="0" | ||
| '''device''' || device name to snapshot ( | |- | ||
!Parameter (JSON String) | |||
!Description | |||
|- | |||
| style="background:#efefef;"| '''device''' || width=500; | block device to stream | |||
|- | |||
| style="background:#efefef;"| '''base''' || width=500; | base image, only sectors above this image are streamed. ''optional'' | |||
|- | |||
| style="background:#efefef;"| '''speed''' || width=500; | speed throttling, in B/s, of the stream operation. ''optional'' | |||
|} | |||
==== Streaming to an Intermediate Layer [proposal] ==== | |||
Similar to streaming to the active layer, it is possible to stream to an intermediate layer. When streaming to an intermediate layer, QEMU must ensure that the intermediate layer is changed from read-only, to read-write for the streaming, and backing to read-only once streaming has completed. Intermediate layers are opened read-only by QEMU. | |||
The command is the same, except that the 'device' argument now also allows a node name. | |||
== Block Commit [proposal, '''''preliminary'''''] == | |||
There are reasons why it may be desired to stream the commit from the child, up into the parent image. For instance, if there is no desired to keep any of the intermediate images, it may be more efficient to commit the child (or children) into a parent node. Often, the parent node may be the larger image, and as such it would be less I/O intensive to commit the child into the parent, as show in Drawing 3: | |||
[[Image:Block-commit-example-1.png|thumb|center|600px|''Drawing 3: Example Snapshot Commit'']] | |||
=== Limitations of Block Commit === | |||
While live commit can be used while the guest is live, to write data into a base or intermediate image from the active layer, there are important consequences to keep in mind. Since the image commit chain is a directed acyclic graph (DAG), each image may have other children. Such children will be unknown to QEMU, and will also be invalidated by a commit. Drawing 4 illustrates this issue. | |||
[[Image:Commit-merge-example-2.png|thumb|center|600px|''Drawing 4: Live Commit, Invalidating Leaf Images'']] | |||
Both Drawing 4 and Drawing 5 show a snapshot chain along the Snap-B branch being flattened into two images – an active image, with a backing file RootBase. | |||
In Drawing 4, the sectors residing in Snap-B-1 and Snap-B-2 are written into their parent image, Snap-1. As Snap-A-1 and its descendants are dependent on the original state of Snap-1, all of Snap-A-1 and its descendants are now invalid. In addition, Snap-B-1 is invalid, and Snap-B-2 will become invalid on the first write to the freshly committed Snap-1. Snap-1 now becomes the active layer, and is functional equivalent to the original Snap-B-2, without the intermediate images. | |||
In contrast, Drawing 5 shows a similar operation, except instead of sectors being copied back into Snap-1, sectors are copied into Snap-B-2 from Snap-1 and Snap-B-1. This leaves the Snap-A branch still valid, and Snap-B-1 valid as well. | |||
[[Image:Forward-merge-example-2.png|thumb|center|600px|''Drawing 5: Block Streaming, Keeping Valid Leaf Images'']] | |||
Live commit can prove useful in scenarios in which the operator (or management software) is completely aware of (or apathetic towards) all snapshots downstream from the committed layer. It may, in certain scenarios, be a much faster operation to flatten an image chain by committing to a parent image, rather than streaming into an active layer. | |||
=== Design of Block Commit === | |||
A lot of the basic design of block commit is similar to block-stream. However, there are a couple of important differences: | |||
# The destination (base) image is currently opened in the chain read-only, and must be reopened with read-write access modes. | |||
# If the active layer is being committed into the 'base' image, convergence becomes an issue. As this is a live operation, the guest may still be writing to the active layer image, so while sectors are copied into the base image, new sectors are constantly created (i.e., new dirty sectors). This may require special handling, similar to the block-mirror code. | |||
The difference in item 1) is handled by the introduction of a new bdrv_reopen() command (proposed by Supriya from IBM), that allows an image to be reopened in read-write mode. | |||
The differences in item 2) is handled by treating 'old' data (data that was dirty when the live commit began) differently from 'new' data (data that has become dirty from new guest activity). The 'old' data would be subject to the speed throttling parameter from the proposed block-commit command, while the 'new' data would not – it would be treated like an active mirror, with data being committed to the 'base' image at a non-throttled rate. | |||
Once the QMP/QAPI command is received, prior to creating the block-commit block job, the handler: | |||
# Looks up the base image | |||
# Looks up the top image (the active layer for the device if not specified) | |||
# Converts the base image from r/o to r/w | |||
Only if all the above steps are performed successfully, will the block-commit job be performed. | |||
Inside the block-commit coroutine, sectors are copied according to the speed throttling parameter, from the layers between the base image and the top image. This data is copied according to where in the image chain it is allocated. | |||
For new data in the top image, an approach similar to active mirroring is used. Sectors are marked as dirty, and new writes update the dirty bitmap sector. The dirty bitmap sectors are copied over, but without speed throttling, in order to encourage convergence. | |||
Once the both existing data and new data are committed into the specified base image, the live image chain is manipulated with bdrv_swap() so that intermediate images are bypassed. | |||
If the top image was the active layer, then the base image will remain r/w and become the new active layer upon the successful job completion. Otherwise, the base image will be reopened as r/o, as it was prior to the block-commit operation. | |||
[[Image:Commit-active-layer-example.png|thumb|center|600px|''Drawing 6: Commit From The Active Layer'']] | |||
Note, that if management software such as libvirt is used, it may be possible for it to monitor the state of convergence, and optionally pause the guest if desired. | |||
=== Block Commit API === | |||
The proposed API is show below: | |||
{ 'command': 'block-commit', 'data': { 'device': 'str', '*base': 'str', | |||
'*top': 'str', '*speed': 'int' } } | |||
'''block-stream [proposal]''' | |||
{|border=1 cellpadding="5" cellspacing="0" | |||
|- | |||
!Parameter (JSON String) | |||
!Description | |||
|- | |||
| style="background:#efefef;"| '''device''' || width=500; | block device to stream | |||
|- | |||
| style="background:#efefef;"| '''base''' || width=500; | base image, the image into which data is copied | |||
|- | |- | ||
| ''' | | style="background:#efefef;"| '''top''' || width=500; | top image, only sectors below this image are copied, into base image. ''optional - if not specified, the top image is the active layer'' | ||
|- | |- | ||
| ''' | | style="background:#efefef;"| '''speed''' || width=500; | speed throttling, in B/s, of the stream operation. ''optional'' | ||
|} | |} | ||
=Future features= | |||
Internal snapshots to images which support internal snapshots (QCOW2 & QED) are not expected to be supported initially. | Internal snapshots to images which support internal snapshots (QCOW2 & QED) are not expected to be supported initially. | ||
Line 63: | Line 263: | ||
However, this poses a number of problems. When creating the COW headers for the new image file, as the COW header needs to know the file name of the disk image it is pointing to. On Linux this can be obtained through '''/proc/self/fd/<X>''' but this is not available on all other operating systems. | However, this poses a number of problems. When creating the COW headers for the new image file, as the COW header needs to know the file name of the disk image it is pointing to. On Linux this can be obtained through '''/proc/self/fd/<X>''' but this is not available on all other operating systems. | ||
An alternative solution would be to extend the '''getfd''' interface to take an optional file name. However this | An alternative solution would be to extend the '''getfd''' interface to take an optional file name. However this would be a hack and open up for errors, as it would allow a broken/hostile guest/QEMU process to create an image which points to the wrong place, but which wouldn't be discovered until the time where the image was actually booted. | ||
Allowing the controlling application to create the COW headers in the new image is not an acceptable solution. It is race prone and would cause problems for COW formats where the new COW headers include state as of when they are created. | Allowing the controlling application to create the COW headers in the new image is not an acceptable solution. It is race prone as the image is not following the backing file which is still in use, and would also cause problems for COW formats where the new COW headers include state as of when they are created. | ||
===Separating into multiple commands=== | ===Separating into multiple commands=== | ||
There are suggestions for splitting the snapshot process into multiple monitor/QMP commands. The process would be split as follows, using human monitor style commands as example: | There are suggestions for splitting the snapshot process into multiple monitor/QMP commands to allow for asynchronous command processing. The process would be split as follows, using human monitor style commands as example: | ||
( | (agent) guest-agent-fsfreeze | ||
Call guest agent requesting it to freeze all file systems and flush all I/O requests. | Call guest agent requesting it to freeze all file systems and flush all I/O requests. | ||
Line 84: | Line 284: | ||
(qemu) snapshot-blkdev-async <blockX> fd:snapshotfd <format> | (qemu) snapshot-blkdev-async <blockX> fd:snapshotfd <format> | ||
Initiate asynchronous snapshot of device '''<blockX>''' to | Initiate asynchronous snapshot of device '''<blockX>''' to recently provided file descriptor '''snapshotfd'''. This will write the COW headers to the snapshot device, and pivot the block device '''<blockX>''' to point to the new device, using the original file/device as it's backing file. It is important to note that it is QEMU which will generate the COW headers in the new snapshot file, externally creating these will not be allowed! | ||
On completion a completion notification will be returned to the caller, hence this will require '''QAPI''' in place for proper async QMP command support. | On completion a completion notification will be returned to the caller, hence this will require '''QAPI''' in place for proper async QMP command support. | ||
Line 92: | Line 292: | ||
Un-freeze I/O processing for device '''<blockX>''' | Un-freeze I/O processing for device '''<blockX>''' | ||
( | (agent) guest-agent-fsthaw | ||
Call guest agent requesting it to thaw/unfreeze all file systems within the guest. | Call guest agent requesting it to thaw/unfreeze all file systems within the guest. | ||
Line 103: | Line 303: | ||
===Live merge=== | ===Live merge=== | ||
See | See http://wiki.qemu.org/Features/LiveBlockMigration | ||
== Other proposed qemu features that solve similar or related problems == | == Other proposed qemu features that solve similar or related problems == | ||
[[Category:Completed feature pages]] | |||
[ | |||
Latest revision as of 14:58, 11 October 2016
Live Snapshots
This document is describing the current design of live snapshots for QEMU. It is a work in progress and things may change as we progress.
Overall concept
The idea is to be able to issue a command to QEMU via the monitor or QMP, which causes QEMU to create a new snapshot image with the original image as the backing file, mounted read-only. This will allow the original image file to be backed up.
Roll-back to a previous version requires one to boot from the previous backing file, at which point the snapshot file becomes invalid. Unfortunately there is no way to detect that a backing file has been booted, making it important for administrators to take care to not rely on snapshot files being valid after a roll-back.
The snapshot image will have to be in a format which support backing files, ie QCOW2 and QED, however the original image can be of any supported format. Ie. it is possible to make a QCOW2 snapshot of a RAW image, or a QED snapshot of a QED image.
Guest Agent
Certain operations in the snapshot process can be improved through support from within the guest. These features will be implemented in the Guest Agent. Please check the guest Guest Agent page for design and implementation details.
The two main guest agent features of interest to live snapshots are:
- File system freeze (fsfreeze/fsthaw): This puts the guest file systems into a consistent state, avoiding the need for fsck next time they are mounted.
- Guest application notification: This allows guest applications to register and be notified prior to a snapshot, in order for them to allow flushing their data to disk. This is a future feature!
As of this writing (July 25, 2011), communication with the QEMU guest agent is performed via a virtio serial channel. Commands are sent over the channel encoded as QMP commands, and replies are encoded as QMP replies. There are future plans to implement a passthrough mechanism for agent commands issued via QMP, allowing these commands to be accessible via the QMP monitor instead of an external agent socket on the host.
Note that guest agent collaboration is also needed for snapshots using other methods, such as snapshots performed on btrfs, LVM, enterprise storage, etc.
Snapshot command flow
The snapshot command flow is as follows. Commands are demonstrated using monitor commands for QEMU and agent commands are marked (agent). See the Guest Agent: Example Usage page for details on the specific command implementation for the guest agent commands.
- Run the guest, if not currently running:
(qemu) cont
- RECOMMENDED: Call guest agent requesting it to freeze all file systems and flush all I/O requests. Note that this runs on the guest, and as such the guest must currently be running:
(agent) guest-fsfreeze-freeze
- Initiate synchronous snapshot of device <blockX> to new device snapshot-file:
(qemu) snapshot_blkdev <blockX> <snapshot-file> <format>
'Note:' The above will write the COW headers to the snapshot device, and pivot the block device <blockX> to point to the new device, using the original file/device as it's backing file. It is important to note that it is QEMU which will generate the COW headers in the new snapshot file. During snapshot creation the guest will momentarily be halted by QEMU. Pending I/Os will be flushed to disk, the COW headers will be created in the snapshot file/device, and QEMU will replace the file backing device <blockX> with the new snapshot file. On completion of the command, the guest will resume running as the command returns, unless the admin tool explicitly issued the optional stop command as described above.
This command is repeated for each device that is to be snapshot.
- Call guest agent requesting it to thaw/unfreeze all file systems within the guest (if guest-fsfreeze-freeze was issued above):
(agent) guest-fsfreeze-thaw
At this point, the snapshot for the device is complete, and QEMU has pivoted the guest to the new snapshot file for execution.
To visualize this sequence, below are call sequences showing the order and direction of these commands going to both QEMU and the guest agent:
Minimum set of commands:
Guest Manager QEMU ------- -------- ------- | | | | | | | <<- freeze ---o | | | | | o--- snapshot_blkdev --->> | | | | | <<- thaw -----o | | | | | | | | | | = = =
HMP command
The HMP (monitor) command is designed to be flexible enough to handle both internal and external snapshots, as well as snapshots to various different snapshot file formats.
snapshot_blkdev device snapshot-file [format]:
Parameter | Description |
---|---|
device | block device to snapshot |
snapshot-file | target snapshot file (new image filename) |
format | format of snapshot image, valid formats are QCOW2 & QED. If not specified, the image will default to QCOW2. |
QMP command
The QMP command matches the behaviour of the human monitor command, except it is named slightly differently to match the fact that the command is synchronous.
blockdev-snapshot-sync device snapshot-file [format]
Parameter (JSON String) | Description |
---|---|
device | block device to snapshot |
snapshot-file | target snapshot file (new image filename) |
format | format of snapshot image, valid formats are QCOW2 & QED. If not specified, the image will default to QCOW2. |
Here is an example of a QMP snapshot command, in JSON format:
{ "execute": "blockdev-snapshot-sync", "arguments": { "device": "virtio0", "snapshot-file": "/some/place/my-image", "format": "qcow2" } }
Atomic Snapshots of Multiple Devices
With the new transaction-based block commands, it is now possible to take atomic snapshots of multiple devices. For more details on the group snapshot API, please see: Atomic Snapshots of Multiple Devices
Live Snapshot Merge
Creating snapshots through the QEMU live snapshot commands allow for incremental guest image files to be created, with each image file containing differences from its parent backing file.
While these snapshot files are useful for backup and other purposes, there exists a need to manage these snapshot files so that they can be merged (flattened). Without the ability to merge and flatten snapshot images, the snapshot chain will continue to grow as new snapshots are made, which may become difficult to manage, in addition to introducing performance concerns.
In order to flatten the image, there are two approaches: block streaming, and block commit. Both of these operations can be performed 'live', while the guest OS is running. Block streaming takes data from parent image(s), and copies (streams) the data to the active layer. Block commit takes data from child(ren) image(s), and copies this data into the parent.
Block Streaming
Streaming to the Active Layer
The current mode for merging QEMU external snapshots while the emulator is 'live' is via block streaming, which streams sectors located in parent snapshots into the active layer (the endmost 'child'). An optional base file can be specified, so that only sectors between the base and the active layer are streamed to the top. Drawing 1, below, shows an example chain of external QEMU snapshots.
During live block merge, performed with the command 'block-stream', the chain can be full or partially collapsed upwards, towards the active layer. Drawing 2 illustrates flattening out part of the chain, leaving only the base backing file in place:
As Drawing 2 illustrates, Snap-1 and Snap-2 have their sectors streamed into Snap-3, while the RootBase sectors are not streamed into Snap-3. This leaves the final snapshot chain, with Snap-3 as the active layer, consisting of just RootBase as the parent and Snap-3 as the child.
Assuming a virtio driver, and the block stream command to perform the merge shown in Drawing 2 is as follows:
{ "execute": "block-stream", "arguments": { "device": "virtio0", "base": "RootBase.img" } }
It is worth noting that the QEMU block_stream command does not delete any external snapshot images from storage, so it is the user (or management software) responsibility to clean up unwanted or unneeded snapshots. However, these 'unneeded' snapshots are still valid snapshots, and could be used if desired.
Design of Block Streaming to Active Layer
The QMP/QAPI command for block-stream is handled within blockdev.c, and the handler is responsible for finding the specified image file for the active layer, and then initiating the streaming process.
Once the streaming process is initiated, a block job is created. This block job is implemented via a coroutine, that operate by means of a cooperative multitasking with other coroutines and threads. The block job is responsible for copying sectors that are located between the 'base' image and the active layer, up into the active layer.
The block job, as specified in the block-stream command, takes an optional parameter for speed. This may be used to throttle the streaming process, and the block job will cooperatively yield according to the speed parameter. Note, however, that in the absence of speed throttling, cooperative yields still occur in the block job.
Block Stream API
The current block streaming API is:
{ 'command': 'block-stream', 'data': { 'device': 'str', '*base': 'str', '*speed': 'int' } }
block-stream
Parameter (JSON String) | Description |
---|---|
device | block device to stream |
base | base image, only sectors above this image are streamed. optional |
speed | speed throttling, in B/s, of the stream operation. optional |
Streaming to an Intermediate Layer [proposal]
Similar to streaming to the active layer, it is possible to stream to an intermediate layer. When streaming to an intermediate layer, QEMU must ensure that the intermediate layer is changed from read-only, to read-write for the streaming, and backing to read-only once streaming has completed. Intermediate layers are opened read-only by QEMU.
The command is the same, except that the 'device' argument now also allows a node name.
Block Commit [proposal, preliminary]
There are reasons why it may be desired to stream the commit from the child, up into the parent image. For instance, if there is no desired to keep any of the intermediate images, it may be more efficient to commit the child (or children) into a parent node. Often, the parent node may be the larger image, and as such it would be less I/O intensive to commit the child into the parent, as show in Drawing 3:
Limitations of Block Commit
While live commit can be used while the guest is live, to write data into a base or intermediate image from the active layer, there are important consequences to keep in mind. Since the image commit chain is a directed acyclic graph (DAG), each image may have other children. Such children will be unknown to QEMU, and will also be invalidated by a commit. Drawing 4 illustrates this issue.
Both Drawing 4 and Drawing 5 show a snapshot chain along the Snap-B branch being flattened into two images – an active image, with a backing file RootBase.
In Drawing 4, the sectors residing in Snap-B-1 and Snap-B-2 are written into their parent image, Snap-1. As Snap-A-1 and its descendants are dependent on the original state of Snap-1, all of Snap-A-1 and its descendants are now invalid. In addition, Snap-B-1 is invalid, and Snap-B-2 will become invalid on the first write to the freshly committed Snap-1. Snap-1 now becomes the active layer, and is functional equivalent to the original Snap-B-2, without the intermediate images.
In contrast, Drawing 5 shows a similar operation, except instead of sectors being copied back into Snap-1, sectors are copied into Snap-B-2 from Snap-1 and Snap-B-1. This leaves the Snap-A branch still valid, and Snap-B-1 valid as well.
Live commit can prove useful in scenarios in which the operator (or management software) is completely aware of (or apathetic towards) all snapshots downstream from the committed layer. It may, in certain scenarios, be a much faster operation to flatten an image chain by committing to a parent image, rather than streaming into an active layer.
Design of Block Commit
A lot of the basic design of block commit is similar to block-stream. However, there are a couple of important differences:
- The destination (base) image is currently opened in the chain read-only, and must be reopened with read-write access modes.
- If the active layer is being committed into the 'base' image, convergence becomes an issue. As this is a live operation, the guest may still be writing to the active layer image, so while sectors are copied into the base image, new sectors are constantly created (i.e., new dirty sectors). This may require special handling, similar to the block-mirror code.
The difference in item 1) is handled by the introduction of a new bdrv_reopen() command (proposed by Supriya from IBM), that allows an image to be reopened in read-write mode.
The differences in item 2) is handled by treating 'old' data (data that was dirty when the live commit began) differently from 'new' data (data that has become dirty from new guest activity). The 'old' data would be subject to the speed throttling parameter from the proposed block-commit command, while the 'new' data would not – it would be treated like an active mirror, with data being committed to the 'base' image at a non-throttled rate.
Once the QMP/QAPI command is received, prior to creating the block-commit block job, the handler:
- Looks up the base image
- Looks up the top image (the active layer for the device if not specified)
- Converts the base image from r/o to r/w
Only if all the above steps are performed successfully, will the block-commit job be performed.
Inside the block-commit coroutine, sectors are copied according to the speed throttling parameter, from the layers between the base image and the top image. This data is copied according to where in the image chain it is allocated.
For new data in the top image, an approach similar to active mirroring is used. Sectors are marked as dirty, and new writes update the dirty bitmap sector. The dirty bitmap sectors are copied over, but without speed throttling, in order to encourage convergence.
Once the both existing data and new data are committed into the specified base image, the live image chain is manipulated with bdrv_swap() so that intermediate images are bypassed.
If the top image was the active layer, then the base image will remain r/w and become the new active layer upon the successful job completion. Otherwise, the base image will be reopened as r/o, as it was prior to the block-commit operation.
Note, that if management software such as libvirt is used, it may be possible for it to monitor the state of convergence, and optionally pause the guest if desired.
Block Commit API
The proposed API is show below:
{ 'command': 'block-commit', 'data': { 'device': 'str', '*base': 'str', '*top': 'str', '*speed': 'int' } }
block-stream [proposal]
Parameter (JSON String) | Description |
---|---|
device | block device to stream |
base | base image, the image into which data is copied |
top | top image, only sectors below this image are copied, into base image. optional - if not specified, the top image is the active layer |
speed | speed throttling, in B/s, of the stream operation. optional |
Future features
Internal snapshots to images which support internal snapshots (QCOW2 & QED) are not expected to be supported initially.
There have been requests and suggestions for a number of alternative and enhanced interfaces for accessing live snapshots as follows:
internal snapshots
By making the snapshot-file argument of the monitor and QMP command optional, that could be used as a request to make the snapshot internally instead of to an external file. However, without live block migration of an internal snapshot, there is no way to make a backup of an internal snapshot while still leaving the VM running, so this feature is not planned at the present. For now, the snapshot-file argument is required, and only external snapshots are implemented.
fd passed as target for snapshot file/device
To get around problems with selinux, in particular in conjunction with images based on NFS, there is a wish to be able to pass an already open file descriptor using the getfd interface.
However, this poses a number of problems. When creating the COW headers for the new image file, as the COW header needs to know the file name of the disk image it is pointing to. On Linux this can be obtained through /proc/self/fd/<X> but this is not available on all other operating systems.
An alternative solution would be to extend the getfd interface to take an optional file name. However this would be a hack and open up for errors, as it would allow a broken/hostile guest/QEMU process to create an image which points to the wrong place, but which wouldn't be discovered until the time where the image was actually booted.
Allowing the controlling application to create the COW headers in the new image is not an acceptable solution. It is race prone as the image is not following the backing file which is still in use, and would also cause problems for COW formats where the new COW headers include state as of when they are created.
Separating into multiple commands
There are suggestions for splitting the snapshot process into multiple monitor/QMP commands to allow for asynchronous command processing. The process would be split as follows, using human monitor style commands as example:
(agent) guest-agent-fsfreeze
Call guest agent requesting it to freeze all file systems and flush all I/O requests.
(qemu) freeze-io <blockX>
Instruct QEMU to freeze all I/O processing for block device <blockX>
(qemu) getfd <fd> snapshotfd
Provide file descriptor <fd> and assign it the logical name snapshotfd
(qemu) snapshot-blkdev-async <blockX> fd:snapshotfd <format>
Initiate asynchronous snapshot of device <blockX> to recently provided file descriptor snapshotfd. This will write the COW headers to the snapshot device, and pivot the block device <blockX> to point to the new device, using the original file/device as it's backing file. It is important to note that it is QEMU which will generate the COW headers in the new snapshot file, externally creating these will not be allowed!
On completion a completion notification will be returned to the caller, hence this will require QAPI in place for proper async QMP command support.
(qemu) thaw-io <blockX>
Un-freeze I/O processing for device <blockX>
(agent) guest-agent-fsthaw
Call guest agent requesting it to thaw/unfreeze all file systems within the guest.
(qemu) snapshot-blkdev-status <blockX>
Query the current snapshot status of <blockX>. In addition some form of notification of completion will be required.
Note that the caller can loop the process of comments freeze-io, getfd, snapshot-blkdev-async, and thaw-io to snapshot multiple block devices in one guest.
Live merge
See http://wiki.qemu.org/Features/LiveBlockMigration