Features/SnapshotsMultipleDevices: Difference between revisions

From QEMU
(Created page with '=Atomic Snapshots of Multiple Devices= '''NOTE: This wiki page covers a ''proposed'' feature of QEMU and Live Snapshots - it does not reflect the current QEMU codebase''' There…')
 
(rewrite to match QEMU 1.1 implementation)
Line 1: Line 1:
=Atomic Snapshots of Multiple Devices=
=Atomic Snapshots of Multiple Devices=


'''NOTE: This wiki page covers a ''proposed'' feature of QEMU and Live Snapshots - it does not reflect the current QEMU codebase'''
The snapshot_blkdev/blockdev-snapshot-sync command in QEMU 1.0 performs snapshots one device at a time, even if a guest has multiple devices.  This can be troublesome in the instance of a snapshot failure. Should a snapshot fail, qemu will revert back to the original backing store but will still leave the guest in an overall inconsistent state, with some devices snapshotted and some not.


There has been some concern with the current snapshot_blkdev command; namely, it performs snapshots one device at a time, even if a guest has multiple devicesThis can be troublesome in the instance of a snapshot failure. While qemu will revert back to the original backing store should a snapshot fail, that could still leave the guest in an overall inconsistent state, with respect to its other devices.
For instance, let us assume there are three devices in a guest: virtio0, virtio1, and virtio2If we successfully perform a snapshot on virtio0 and virtio1, yet virtio2 fails, we could be in an inconsistent state. While we will have reverted virtio2 back to the previous backing store, virtio0 and virtio1 will have already successfully gone through the snapshot.


For instance, let us assume there are three devices in a guest: virtio0, virtio1, and virtio2.  If we successfully perform a snapshot on virtio0 and virtio1, yet virtio2 fails, we could be in an inconsistent state.  While we will have reverted virtio2 back to the previous backing store, virtio0 and virtio1 will have already successfully gone through the snapshotIdeally, there would be a mechanism to allow all devices to have a snapshot taken as one atomic unit, so that for the snapshot to be successfully performed, all devices must have success.
The only solution here is to stop the machine completely while the snapshots are performedBut ideally there would be a mechanism to allow all devices to have a snapshot taken as one atomic unit, so that for the snapshot to be successfully performed, all devices must have success.


There are a few different solutions to this issue, and the one proposed here is a mechanism to take snapshots of multiple devices atomically (with respect to the snapshot command), by the addition of new commands for taking snapshots of a set of devicesThis will not replace the existing snapshot_blkdev command - that will remain available to take single snapshot of devices.
QEMU 1.1 implements a "transaction" QMP command that operates on multiple block devices atomically.  The transaction command receives one or more "transactionable" QMP commands and their arguments; the only transactionable command for now is blockdev-snapshot-sync.  Execution of the commands is then split into two phases, a ''prepare'' phase and a ''commit/rollback'' phase.  Should any command fail the ''prepare'' phase, the transaction immediately proceeds to roll back the completed prepare phasesIf all commands are prepared successfully they are committed; the commit phase cannot fail, so that atomicity is achieved.


Snapshot sets will allow multiple devices to be queued for a snapshot, with the snapshot for all devices happening during a single command.  This will allow an entire set to have a snapshot taken, and if any one device fails, the entire set reverted back to the original backing store.
The transaction command is implemented using QAPI unions (discriminated records).  Given the schema for a transactionable command, such as the following:


The proposed commands are:
{ 'command': 'blockdev-snapshot-sync',
  'data': { 'device': 'str', 'snapshot-file': 'str', '*format': 'str' } }


* snapshot_set_create(id)
a corresponding type is created and added to a union:
* snapshot_set_destroy(id)
* snapshot_set_add(id, device, snapshot-file, format)
* snapshot_set_execute(id)


{ 'type': 'BlockdevSnapshot',
  'data': { 'device': 'str', 'snapshot-file': 'str', '*format': 'str' } }
{ 'union': 'BlockdevAction',
  'data': { 'blockdev-snapshot-sync': 'BlockdevSnapshot', /* ... */ } }


The transaction command then takes an array of actions:


== Proposed Snapshot Sets Commands ==
  { 'command': 'transaction',
Here are details of the proposed commands, with the parameter values and descriptions. The parameters have been modeled after the existing snapshot_blkdev command.
  'data': { 'actions': [ 'BlockdevAction' ] } }


Here is a sample execution of the command to snapshot two disks:


<div style="background-color: #ffffff; border-style: dotted; border-width: 1px; width: 640px; padding: 5px">
{ "execute": "transaction", "arguments":
  {'actions': [
    { 'type': 'blockdev-snapshot-sync', 'data' :
      { 'device': 'virtio0', 'snapshot-file': 'hd0-snap.qcow2' } },
    { 'type': 'blockdev-snapshot-sync', 'data' :
      { 'device': 'virtio1', 'snapshot-file': 'hd1-snap.qcow2' } } ] } }


<div style="background-color: #efefef;">
=Application to live block copy=
'''snapshot_set_create'''
</div>


''Creates a new set, with the given idOnce a set is created, new devices and be added to that set.''
Another feature that is new in QEMU 1.1 is live block device streamingThis feature lets guest retrieve data from a backing file while the guest is running; it enables quick provisioning of new virtual machines using shared remote storage, and lets the guest transition incrementally to fast local storage.


{|border=1 cellpadding="5" cellspacing="0"
Streaming a block device to another location is also useful if management needs to migrate a guest's storage, for example in case of impending disk failures.  However, in this context block streaming's fundamental deficiency is that the copy operation is performed while the virtual machine is already using the new storage; it is not possible to abort it and fall back to the old storage.
|-
!Parameter
!Description
|-
| width=100; style="background:#efefef;"| '''id''' || width=500; | numeric id to identify the set
|-
|}


Luckily, storage migration is a simple extension of streaming.  The block layer needs to be instructed to mirror writes to both the old and the new storage while streaming is in effect.  Then, management can switch to the new storage at an arbitrary point after streaming is completed.


</div>
Unlike snapshotting, neither the start of block streaming, nor the "release" of old storage need to be done atomically across multiple devices.  However, if the old storage has to be snapshotted at the time mirroring is started, then these two operations have to be done atomically.


Leaving aside for a moment the release operation, there are two possible implementation choices for an atomic snapshot+mirror operation.  One is to specify both the snapshot destination and the mirror target, as in the following hypothetical QAPI schema:


<div style="background-color: #ffffff; border-style: dotted; border-width: 1px; width: 640px; padding: 5px">
{ 'command': drive-mirror',
  'data': { 'device': 'str', 'target': 'str', '*target-format': 'str',
            '*snapshot-file': 'str', '*snapshot-format': 'str' } }


<div style="background-color: #efefef;">
This interface is simple to implement but it has two disadvantages.  First, the interface is complicated.  libvirt and oVirt right now need to do the above snapshot+mirror process because they want to copy storage outside QEMU; however, the additional arguments are there for everyone, even for people that can use block device streaming to do the copy.  Second, the implementation must ensure a complete rollback of the snapshot operation in case mirroring fails.  This is relatively complex to do; in fact, up to QEMU 1.0 blockdev-snapshot-sync couldn't even rollback correctly a single snapshot.
'''snapshot_set_forget'''
</div>


''Removes a snapshot set - the snapshot set id will no longer be associated with any devices, and all devices previously added to this set will be removedNote: this does not destroy any data, but simply dequeues and devices from the snapshot queue''
The latter requirement suggests plugging the drive-mirror command in the transaction command.  The snapshot and mirror operations can be simply placed in the same transaction, which guarantees their atomicityThe schema then becomes:


{|border=1 cellpadding="5" cellspacing="0"
{ 'command': drive-mirror',
|-
  'data': { 'device': 'str', 'target': 'str', '*format': 'str' } }
!Parameter
{ 'type': 'BlockdevMirror',
!Description
  'data': { 'device': 'str', 'target': 'str', '*format': 'str' } }
|-
{ 'union': 'BlockdevAction',
| width=100; style="background:#efefef;"| '''id''' || width=500; | numeric id to identify the set
  'data': { 'blockdev-snapshot-sync': 'BlockdevSnapshot',
|}
            'drive-mirror': 'BlockdevMirror' } }


</div>
and a sample execution of the command is as follows:


{ "execute": "transaction", "arguments":
  {'actions': [
    { 'type': 'blockdev-snapshot-sync', 'data' :
      { 'device': 'ide0-hd0', 'snapshot-file': 'base.qcow2' } },
    { 'type': 'drive-mirror', 'data' :
      { 'device': 'ide0-hd0', 'target': 'mirror.qcow2' } } ] } }


<div style="background-color: #ffffff; border-style: dotted; border-width: 1px; width: 640px; padding: 5px">
Switching the device to the new storage at the end of the copy operation is handled with another QMP command, drive-reopen.  This command is not transactionable, so it is not included in BlockdevAction:


<div style="background-color: #efefef;">
{ 'command': 'drive-reopen',
'''snapshot_set_add'''
  'data': { 'device': 'str', 'new-image-file': 'str', '*format': 'str' } }
</div>


''Adds a new device to the snapshot set identified by a given id.  All devices added to a specific snapshot set will be the devices used to take a snapshot when the snapshot_set_execute command is issued.''
=Image creation modes=


{|border=1 cellpadding="5" cellspacing="0"
Compared to the above definitions, QEMU 1.1 also introduces a ''mode'' argument to the blockdev-snapshot-sync and drive-mirror commands.  The argument applies both to standalone command and to transactions. Its type is the ''NewImageMode'' enum:
|-
!Parameter
!Description
|-
| width=100; style="background:#efefef;"| '''id''' || width=500; | numeric id of the set
|-
| width=100; style="background:#efefef;"| '''device''' || width=500; | block device to snapshot
|-
| width=100; style="background:#efefef;"| '''snapshot-file''' || width=500; | target snapshot file
|-
| width=100; style="background:#efefef;"| '''format''' || width=500; | format of snapshot image, valid formats are QCOW2 & QED. If not specified, the image will default to QCOW2.
|}


{ 'enum': 'NewImageMode'
  'data': [ 'existing', 'absolute-paths', 'no-backing-file' ] }


</div>
The argument controls how QEMU creates the new image file:


* ''existing'' directs QEMU to look for an existing image.  The image must be on disk and should have the same contents as the disk that is currently attached to the virtual machine.
* ''absolute-paths'' directs QEMU to create an image whose backing file is the current image.  The current image is identified by an absolute path in the new image.
* ''no-backing-file'' directs QEMU to create an image with no backing file at all.  This is useful when the mirror target is a raw file, for example.


<div style="background-color: #ffffff; border-style: dotted; border-width: 1px; width: 640px; padding: 5px">
In the future, it is planned to have another mode, ''relative-paths''.  It will also create an image whose backing file is the current image, but the current image will be identified by a relative path in the new image.


<div style="background-color: #efefef;">
Image creation occurs in the prepare phase and uses the mode argument; however, the new backing file chain is composed in the commit phase with no regard to the mode.  This matters when the same disk is included twice in a transaction, as in the following example:
'''snapshot_set_execute'''
</div>


''Performs the snapshot of the identified set.  If a snapshot of any device fails, then all devices in the set will be reverted, and no snapshot will exist for any of the devices in the set.  Snapshot is an all-or-none proposition of the set. The snapshot set can be cleared, or remembered for later reuse''
{ "execute": "transaction", "arguments":
  {'actions': [
    { 'type': 'blockdev-snapshot-sync', 'data' :
      { 'device': 'virtio0', 'snapshot-file': 'hd0-snap0.qcow2' } },
    { 'type': 'blockdev-snapshot-sync', 'data' :
      { 'device': 'virtio0', 'snapshot-file': 'hd0-snap1.qcow2' } } ] } }


{|border="1" cellpadding="5" cellspacing="0"
Assuming virtio0 is associated to ''hd0-base.qcow2'', the backing file chain at the end of the transaction will be ''hd0-base.qcow2 <- hd0-snap0.qcow2 <- hd0-snap1.qcow2''.  However, the hd0-snap1.qcow2 image file will point to hd0-base.qcow2. This is useful when doing a combined snapshot+mirror operation:
|-
!Parameter
!Description
|-
|  width=100; style="background:#efefef;"| '''id''' || width=500; | numeric id to identify the set
|-
| width=100; style="background:#efefef;"| '''forget''' || width=500; | 'false' to keep the set, or 'true' to forget the setIf not specified, the set will be forgotten after the snapshots have been attempted (whether they passed or failed) - i.e., 'id' will now become invalid.
|}


{ "execute": "transaction", "arguments":
  {'actions': [
    { 'type': 'blockdev-snapshot-sync', 'data' :
      { 'device': 'virtio0', 'snapshot-file': 'src/hd0-snap.qcow2' } },
    { 'type': 'drive-mirror', 'data' :
      { 'device': 'virtio0', 'target': 'dest/hd0-snap.qcow2' } } ] } }


</div>
Here, assume the backing storage is ''shared/hd0-base.qcow2''. Mirroring will write to src/hd0-snap.qcow2 and dest/hd0-snap.qcow2 as expected, and dest/hd0-snap.qcow2 will point to the original storage. As soon as block streaming completes, management can switch the device to dest/hd0-snap.qcow2.  src/hd0-snap.qcow2 is not part of the backing file chain anymore, and can be deleted.
 
 
 
== Example Snapshot Sets Command Sequence ==
 
Below we have an example command sequence of an arbitrary number of devices added to a snapshot set, and a snapshot performed of the entire set with the set forgotten at the end of the snapshot:
 
Guest        Manager                                            QEMU
-------      --------                                          -------
  |              |                                                |
  |              |                                                |
  |              o--snapshot_set_create(1234) --------------->> |
  |              |                                                |
  |              |                                                |
  |              o---  snapshot_set_add(1234, "virtio0",          |
  |              |                        "/some/place/my-image0", |
  |              |                        "qcow2" )  ----------->> |
  |              |                                                |
  |              |                                                |
  |              o--- snapshot_set_add(1234, "virtio0",          |
  |              |                        "/some/place/my-image0", |
  |              |                        "qcow2" )  ----------->> |
  |              |                                                |
  |              |                                                |
  .              .                                                .
  .              .                                                .
  .               .                                                .
  |              o--- snapshot_set_add(1234, "virtioX",          |
  |              |                        "/some/place/my-imageX", |
  |              |                        "qcow2" )  ----------->> |
  |              |                                                |
  |              |                                                |
  *<--- freeze ---o                                                |
  |              |                                                |
  |              o--- snapshot_set_execute(id) ----------------->> |
  |              |                                                |
  *<--- thaw -----o                                                |
  |              |                                                |
  |              |                                                |
  |              |                                                |
  =              =                                                =

Revision as of 11:54, 13 March 2012

Atomic Snapshots of Multiple Devices

The snapshot_blkdev/blockdev-snapshot-sync command in QEMU 1.0 performs snapshots one device at a time, even if a guest has multiple devices. This can be troublesome in the instance of a snapshot failure. Should a snapshot fail, qemu will revert back to the original backing store but will still leave the guest in an overall inconsistent state, with some devices snapshotted and some not.

For instance, let us assume there are three devices in a guest: virtio0, virtio1, and virtio2. If we successfully perform a snapshot on virtio0 and virtio1, yet virtio2 fails, we could be in an inconsistent state. While we will have reverted virtio2 back to the previous backing store, virtio0 and virtio1 will have already successfully gone through the snapshot.

The only solution here is to stop the machine completely while the snapshots are performed. But ideally there would be a mechanism to allow all devices to have a snapshot taken as one atomic unit, so that for the snapshot to be successfully performed, all devices must have success.

QEMU 1.1 implements a "transaction" QMP command that operates on multiple block devices atomically. The transaction command receives one or more "transactionable" QMP commands and their arguments; the only transactionable command for now is blockdev-snapshot-sync. Execution of the commands is then split into two phases, a prepare phase and a commit/rollback phase. Should any command fail the prepare phase, the transaction immediately proceeds to roll back the completed prepare phases. If all commands are prepared successfully they are committed; the commit phase cannot fail, so that atomicity is achieved.

The transaction command is implemented using QAPI unions (discriminated records). Given the schema for a transactionable command, such as the following:

{ 'command': 'blockdev-snapshot-sync',
  'data': { 'device': 'str', 'snapshot-file': 'str', '*format': 'str' } }

a corresponding type is created and added to a union:

{ 'type': 'BlockdevSnapshot',
  'data': { 'device': 'str', 'snapshot-file': 'str', '*format': 'str' } }

{ 'union': 'BlockdevAction',
  'data': { 'blockdev-snapshot-sync': 'BlockdevSnapshot', /* ... */ } }

The transaction command then takes an array of actions:

{ 'command': 'transaction',
  'data': { 'actions': [ 'BlockdevAction' ] } }

Here is a sample execution of the command to snapshot two disks:

{ "execute": "transaction", "arguments":
  {'actions': [
    { 'type': 'blockdev-snapshot-sync', 'data' :
      { 'device': 'virtio0', 'snapshot-file': 'hd0-snap.qcow2' } },
    { 'type': 'blockdev-snapshot-sync', 'data' :
      { 'device': 'virtio1', 'snapshot-file': 'hd1-snap.qcow2' } } ] } }

Application to live block copy

Another feature that is new in QEMU 1.1 is live block device streaming. This feature lets guest retrieve data from a backing file while the guest is running; it enables quick provisioning of new virtual machines using shared remote storage, and lets the guest transition incrementally to fast local storage.

Streaming a block device to another location is also useful if management needs to migrate a guest's storage, for example in case of impending disk failures. However, in this context block streaming's fundamental deficiency is that the copy operation is performed while the virtual machine is already using the new storage; it is not possible to abort it and fall back to the old storage.

Luckily, storage migration is a simple extension of streaming. The block layer needs to be instructed to mirror writes to both the old and the new storage while streaming is in effect. Then, management can switch to the new storage at an arbitrary point after streaming is completed.

Unlike snapshotting, neither the start of block streaming, nor the "release" of old storage need to be done atomically across multiple devices. However, if the old storage has to be snapshotted at the time mirroring is started, then these two operations have to be done atomically.

Leaving aside for a moment the release operation, there are two possible implementation choices for an atomic snapshot+mirror operation. One is to specify both the snapshot destination and the mirror target, as in the following hypothetical QAPI schema:

{ 'command': drive-mirror',
  'data': { 'device': 'str', 'target': 'str', '*target-format': 'str',
            '*snapshot-file': 'str', '*snapshot-format': 'str' } }

This interface is simple to implement but it has two disadvantages. First, the interface is complicated. libvirt and oVirt right now need to do the above snapshot+mirror process because they want to copy storage outside QEMU; however, the additional arguments are there for everyone, even for people that can use block device streaming to do the copy. Second, the implementation must ensure a complete rollback of the snapshot operation in case mirroring fails. This is relatively complex to do; in fact, up to QEMU 1.0 blockdev-snapshot-sync couldn't even rollback correctly a single snapshot.

The latter requirement suggests plugging the drive-mirror command in the transaction command. The snapshot and mirror operations can be simply placed in the same transaction, which guarantees their atomicity. The schema then becomes:

{ 'command': drive-mirror',
  'data': { 'device': 'str', 'target': 'str', '*format': 'str' } }
{ 'type': 'BlockdevMirror',
  'data': { 'device': 'str', 'target': 'str', '*format': 'str' } }
{ 'union': 'BlockdevAction',
  'data': { 'blockdev-snapshot-sync': 'BlockdevSnapshot',
            'drive-mirror': 'BlockdevMirror' } }

and a sample execution of the command is as follows:

{ "execute": "transaction", "arguments":
  {'actions': [
    { 'type': 'blockdev-snapshot-sync', 'data' :
      { 'device': 'ide0-hd0', 'snapshot-file': 'base.qcow2' } },
    { 'type': 'drive-mirror', 'data' :
      { 'device': 'ide0-hd0', 'target': 'mirror.qcow2' } } ] } }

Switching the device to the new storage at the end of the copy operation is handled with another QMP command, drive-reopen. This command is not transactionable, so it is not included in BlockdevAction:

{ 'command': 'drive-reopen',
  'data': { 'device': 'str', 'new-image-file': 'str', '*format': 'str' } }

Image creation modes

Compared to the above definitions, QEMU 1.1 also introduces a mode argument to the blockdev-snapshot-sync and drive-mirror commands. The argument applies both to standalone command and to transactions. Its type is the NewImageMode enum:

{ 'enum': 'NewImageMode'
  'data': [ 'existing', 'absolute-paths', 'no-backing-file' ] }

The argument controls how QEMU creates the new image file:

  • existing directs QEMU to look for an existing image. The image must be on disk and should have the same contents as the disk that is currently attached to the virtual machine.
  • absolute-paths directs QEMU to create an image whose backing file is the current image. The current image is identified by an absolute path in the new image.
  • no-backing-file directs QEMU to create an image with no backing file at all. This is useful when the mirror target is a raw file, for example.

In the future, it is planned to have another mode, relative-paths. It will also create an image whose backing file is the current image, but the current image will be identified by a relative path in the new image.

Image creation occurs in the prepare phase and uses the mode argument; however, the new backing file chain is composed in the commit phase with no regard to the mode. This matters when the same disk is included twice in a transaction, as in the following example:

{ "execute": "transaction", "arguments":
  {'actions': [
    { 'type': 'blockdev-snapshot-sync', 'data' :
      { 'device': 'virtio0', 'snapshot-file': 'hd0-snap0.qcow2' } },
    { 'type': 'blockdev-snapshot-sync', 'data' :
      { 'device': 'virtio0', 'snapshot-file': 'hd0-snap1.qcow2' } } ] } }

Assuming virtio0 is associated to hd0-base.qcow2, the backing file chain at the end of the transaction will be hd0-base.qcow2 <- hd0-snap0.qcow2 <- hd0-snap1.qcow2. However, the hd0-snap1.qcow2 image file will point to hd0-base.qcow2. This is useful when doing a combined snapshot+mirror operation:

{ "execute": "transaction", "arguments":
  {'actions': [
    { 'type': 'blockdev-snapshot-sync', 'data' :
      { 'device': 'virtio0', 'snapshot-file': 'src/hd0-snap.qcow2' } },
    { 'type': 'drive-mirror', 'data' :
      { 'device': 'virtio0', 'target': 'dest/hd0-snap.qcow2' } } ] } }

Here, assume the backing storage is shared/hd0-base.qcow2. Mirroring will write to src/hd0-snap.qcow2 and dest/hd0-snap.qcow2 as expected, and dest/hd0-snap.qcow2 will point to the original storage. As soon as block streaming completes, management can switch the device to dest/hd0-snap.qcow2. src/hd0-snap.qcow2 is not part of the backing file chain anymore, and can be deleted.