Features/SnapshotsMultipleDevices: Difference between revisions

From QEMU
(rewrite to match QEMU 1.1 implementation)
No edit summary
 
(8 intermediate revisions by 2 users not shown)
Line 35: Line 35:
     { 'type': 'blockdev-snapshot-sync', 'data' :
     { 'type': 'blockdev-snapshot-sync', 'data' :
       { 'device': 'virtio1', 'snapshot-file': 'hd1-snap.qcow2' } } ] } }
       { 'device': 'virtio1', 'snapshot-file': 'hd1-snap.qcow2' } } ] } }
'''Status:''' merged into master


=Application to live block copy=
=Application to live block copy=
Line 57: Line 59:


  { 'command': drive-mirror',
  { 'command': drive-mirror',
   'data': { 'device': 'str', 'target': 'str', '*format': 'str' } }
   'data': { 'device': 'str', 'target': 'str', '*format': 'str', 'full': 'bool' } }
  { 'type': 'BlockdevMirror',
  { 'type': 'BlockdevMirror',
   'data': { 'device': 'str', 'target': 'str', '*format': 'str' } }
   'data': { 'device': 'str', 'target': 'str', '*format': 'str', 'full': 'bool' } }
  { 'union': 'BlockdevAction',
  { 'union': 'BlockdevAction',
   'data': { 'blockdev-snapshot-sync': 'BlockdevSnapshot',
   'data': { 'blockdev-snapshot-sync': 'BlockdevSnapshot',
Line 71: Line 73:
       { 'device': 'ide0-hd0', 'snapshot-file': 'base.qcow2' } },
       { 'device': 'ide0-hd0', 'snapshot-file': 'base.qcow2' } },
     { 'type': 'drive-mirror', 'data' :
     { 'type': 'drive-mirror', 'data' :
       { 'device': 'ide0-hd0', 'target': 'mirror.qcow2' } } ] } }
       { 'device': 'ide0-hd0', 'target': 'mirror.qcow2', 'full': 'false' } } ] } }


Switching the device to the new storage at the end of the copy operation is handled with another QMP command, drive-reopen.  This command is not transactionable, so it is not included in BlockdevAction:
Switching the device to the new storage at the end of the copy operation is handled with another QMP command, drive-reopen.  This command is not transactionable, so it is not included in BlockdevAction:
Line 77: Line 79:
  { 'command': 'drive-reopen',
  { 'command': 'drive-reopen',
   'data': { 'device': 'str', 'new-image-file': 'str', '*format': 'str' } }
   'data': { 'device': 'str', 'new-image-file': 'str', '*format': 'str' } }
'''Status:''' available at git://github.com/bonzini/qemu.git branch blkmirror


=Image creation modes=
=Image creation modes=


Compared to the above definitions, QEMU 1.1 also introduces a ''mode'' argument to the blockdev-snapshot-sync and drive-mirror commands.  The argument applies both to standalone command and to transactions.  Its type is the ''NewImageMode'' enum:
Compared to the above definitions, QEMU 1.1 also introduces an optional ''mode'' argument to the blockdev-snapshot-sync and drive-mirror commands.  The argument applies both to standalone command and to transactions.  Its type is the ''NewImageMode'' enum:


  { 'enum': 'NewImageMode'
  { 'enum': 'NewImageMode'
   'data': [ 'existing', 'absolute-paths', 'no-backing-file' ] }
   'data': [ 'existing', 'absolute-paths' ] }


The argument controls how QEMU creates the new image file:
The drive-mirror command also gains a mandatory ''full'' argument, to request whether the new image is shallow (shares the same backing file as the original) or full (no backing file).  These arguments control how QEMU creates the new image file:


* ''existing'' directs QEMU to look for an existing image.  The image must be on disk and should have the same contents as the disk that is currently attached to the virtual machine.
* 'mode':'existing' for 'blockdev-snapshot-sync', and 'mode':'existing','full':'false' for 'drive-mirror', directs QEMU to look for an existing image.  The image must be on disk and should have the same contents as the disk that is currently attached to the virtual machine.  If 'format' is not provided, the image is probed for its type.
* ''absolute-paths'' directs QEMU to create an image whose backing file is the current image.  The current image is identified by an absolute path in the new image.
* 'mode':'existing','full':'true' for 'drive-mirror' directs QEMU to look for an existing image, which must be on disk but have no contents.  If 'format' is not provided, the image is probed for its type.
* ''no-backing-file'' directs QEMU to create an image with no backing file at all.  This is useful when the mirror target is a raw file, for example.  
* 'mode':'absolute-paths' for 'blockdev-snapshot-sync', and 'mode':'absolute-paths','full':'false' for 'drive-mirror, directs QEMU to create an image whose backing file is an absolute path to the current image.  If no 'format' is provided, the new file will share the same format as the source.
* 'mode':'absolute-paths','full':'true' for 'drive-mirror' directs QEMU to create an image with no backing file at all.  This is useful when the mirror target is a raw file, for example.  If no 'format' is provided, the new file will share the same format as the source.


In the future, it is planned to have another mode, ''relative-paths''.  It will also create an image whose backing file is the current image, but the current image will be identified by a relative path in the new image.
In the future, it is planned to have another mode, ''relative-paths''.  It will also create an image whose backing file is the current image, but the current image will be identified by a relative path in the new image.
Line 109: Line 114:
       { 'device': 'virtio0', 'snapshot-file': 'src/hd0-snap.qcow2' } },
       { 'device': 'virtio0', 'snapshot-file': 'src/hd0-snap.qcow2' } },
     { 'type': 'drive-mirror', 'data' :
     { 'type': 'drive-mirror', 'data' :
       { 'device': 'virtio0', 'target': 'dest/hd0-snap.qcow2' } } ] } }
       { 'device': 'virtio0', 'target': 'dest/hd0-snap.qcow2', 'full': 'false' } } ] } }


Here, assume the backing storage is ''shared/hd0-base.qcow2''.  Mirroring will write to src/hd0-snap.qcow2 and dest/hd0-snap.qcow2 as expected, and dest/hd0-snap.qcow2 will point to the original storage.  As soon as block streaming completes, management can switch the device to dest/hd0-snap.qcow2.  src/hd0-snap.qcow2 is not part of the backing file chain anymore, and can be deleted.
Here, assume the backing storage is ''shared/hd0-base.qcow2''.  Mirroring will write to src/hd0-snap.qcow2 and dest/hd0-snap.qcow2 as expected, and dest/hd0-snap.qcow2 will point to the original storage.  As soon as block streaming completes, management can switch the device to dest/hd0-snap.qcow2.  src/hd0-snap.qcow2 is not part of the backing file chain anymore, and can be deleted.
'''Status:''' 'mode' for 'blockdev-snapshot-sync' is merged into master, 'drive-mirror' is still pending
=Libvirt interaction=
Without atomic snapshots, libvirt does the best it can: <code>virDomainSnapshotCreateXML</code> will pause the guest with ''stop'', then perform one snapshot at a time with ''blockdev-snapshot-sync'', then resume the guest with ''cont''.  If a second snapshot fails, then libvirt will update the domain XML to reflect which snapshots succeeded, but this puts the burden on the management application to then check <code>virDomainGetXMLDesc</code> after failure to see what changes were actually made.
With the addition of these patches, libvirt will now probe whether atomic snapshots exist by checking the existence of the ''transaction'' monitor command.  Libvirt will assume that if the command is present, then 'blockdev-snapshot-sync' and 'drive-mirror' actions within ''transaction'' exist (true for qemu 1.1); if we later add new actions to the discriminated union, then we will also need to add a QMP command for introspecting what additional commands are supported before libvirt can try to use those additional commands in a transaction.  Libvirt will add a new flag, <code>VIR_DOMAIN_SNAPSHOT_CREATE_ATOMIC</code>, which will force the operation to fail if ''transaction'' is not available; omitting the flag implies that the non-atomic ''snapshot-blockdev-sync'' will be used as fallback.
With atomic snapshots, libvirt will form a single ''transaction'' command for <code>virDomainSnapshotCreateXML</code> with no additional effort needed from the user, and with no need for ''stop'' and ''cont''.
The second part of the implementation is support for pre-copy storage migration.  In QEMU, pre-copy storage migration has two phases:
* add a streaming mirror to the existing image (not a snapshot+mirror);
* reopen the image to the new mirror ("pivot" the mirror).
Copy of the images below the topmost source image is still done outside QEMU and perhaps even outside libvirt.
Pre-copy storage migration is implemented as an extension to the existing block job support.  The current API is <code>virDomainBlockRebase</code>, which starts a streaming (aka post-copy storage migration) job:
int virDomainBlockRebase (virDomainPtr dom,
                          const char *disk,  const char *base,
                          unsigned long bandwidth, unsigned int flags)
In this API, ''base'' is the absolute path of one of the backing images further up the chain; streaming takes data from the backing file chain up to that image, and copies it to the top image.
An optimal mirror API would take an ''additional'' path (let's call it ''dest'') which the mirror will target.  The API would be like
int virDomainBlockCopy (virDomainPtr dom,
                          const char *disk, const char *base, const char *dest, const char *format,
                          unsigned long bandwidth, unsigned int flags)
However, we can for now add new flags to virDomainBlockRebase that treat "base" as the destination (VIR_DOMAIN_BLOCK_REBASE_COPY), and choose whether to limit the copy to the topmost image (VIR_DOMAIN_BLOCK_REBASE_SHALLOW, maps to 'full':'false') as well as whether to reuse an existing file (VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT, maps to 'mode':'existing).  We can also add another flag to indicate whether the destination image must be treated as raw instead of probed (VIR_DOMAIN_BLOCK_REBASE_COPY_RAW, maps to 'format':'raw')..
virDomainBlockRebase/virDomainBlockCopy will perform the first step above:
* adding a streaming mirror to the existing image is done with the drive-mirror command (either directly, or wrapped within a transaction)
For now, qemu does not give any event indication that the streaming is complete, but libvirt can poll the 'query-block-jobs' command to see when streaming has completed and mirroring is now active.  The libvirt block job should continue until explicitly aborted by the user, because QEMU is still actively mirroring the VM's writes to both the source and the target.  This is very important, because pivoting (phase 3) reopens the whole backing file chain on the destination storage, and must not be attempted until all the base images have been copied successfully to the target.  Once the copy of the base images terminates, the libvirt client can start polling for the termination of phase 2.  This is signaled by ''cur == end'' in the virDomainBlockJobInfo struct that is returned by virDomainGetBlockJobInfo.  When both conditions are verified (base images copied ''and'' cur == end), the client can perform the pivoting operation.  Pivoting is done with a new flag to virDomainBlockJobAbort, VIR_BLOCK_JOB_ABORT_PIVOT.  The flag:
* is only allowed for a mirrored streaming job;
* causes the function to fail unless QEMU has already reported streaming complete
* triggers a drive-reopen command that completes storage migration.
In case the VM aborts, the mirror can be safely thrown away and the process restarted.
<!--
Additionally, libvirt will add a <mirror> element to its <domainsnapshot> XML, such that:
<domainsnapshot>
  <disks>
    <disk name='/src/base.img' snapshot='external'>
      <source file='/src/snap.img'/>
      '''<mirror file='/dest/snap.img'/>'''
    </disk>
  </disks>
</domainsnapshot>
will map to the QMP command:
{ "execute": "transaction", "arguments":
  {'actions': [
    { 'type': 'blockdev-snapshot-sync', 'data' :
      { 'device': 'virtio0', 'snapshot-file': '/src/snap.img', 'format': 'qcow2' } },
    { 'type': 'drive-mirror', 'data' :
      { 'device': 'virtio0', 'target': '/dest/snap.img', 'format': 'qcow2' } } ] } }
Libvirt added a flag <code>VIR_DOMAIN_SNAPSHOT_CREATE_REUSE_EXT</code> in 0.9.10, in order to suppress libvirt from checking whether the destination file already existed.  Now that qemu has an explicit NewImageMode argument, libvirt 0.9.11 will treat the flag as follows: if omitted, libvirt relies on qemu's default behavior of 'absolute-paths' when no mode is given; if present, libvirt will request an explicit mode of 'existing'.  This setting will apply to all disks in the snapshot.  Use of this flag will now imply <code>VIR_DOMAIN_SNAPSHOT_CREATE_ATOMIC</code> (since an explicit mode was only added at the same time as ''transaction'').  Mapping to other modes ('no-backing-file', or any future additions like 'relative-paths'), as well as choosing the mode on a per-disk basis, will have to be incorporated as extensions to the <domainsnapshot> XML or as a new API such as <code>virDomainBlockCopy</code>, and will likely be deferred to a later libvirt release.
Since qemu does not yet expose a command line interface for opening mirrored devices on boot or incoming migration, libvirt will refuse to migrate a domain with any active <domainsnapshot> with a mirror, and will also refuse to start any persistent domain that has a current snapshot with a mirror.  If, for any reason, the management application decides to abort a storage migration, then it will be necessary to back out the libvirt changes; this will be done by adding a new flag to <code>virDomainSnapshotDelete</code>, named <code>VIR_DOMAIN_SNAPSHOT_DELETE_REMOVE_MIRROR</code>, which, in combination with the existing </code>VIR_DOMAIN_SNAPSHOT_DELETE_METADATA_ONLY</code>, will tell libvirt to use a ''drive-reopen'' on the existing source file in order to stop qemu from using the mirror, as well as keep libvirt's domain XML tied to the proper file while removing record of the snapshot that created the mirror.
For committing the end result of a live storage migration, libvirt 0.9.11 will be adding a new flag to <code>virDomainSnapshotDelete</code>, named <code>VIR_DOMAIN_SNAPSHOT_DELETE_REOPEN_MIRROR</code>, which will map to a ''drive-reopen'' command for each mirrored disk in the associated snapshot being deleted; for the above <domainsnapshot>, this would be:
{ "execute": "drive-reopen", "arguments":
  { "device": "virtio0", "new-image-file": "/dest/snap.img", "format": "qcow2" } }
Deleting the snapshot is sufficient for storage migration (the libvirt snapshot object only needs to exist as long as the storage migration is active; once things are reopened in storage domain two, there is no longer any need to revert to the snapshot storage domain one).  If the 'drive-reopen' action were available as part of ''transaction'', then libvirt could safely reopen multiple mirrored devices onto storage domain two.  But since the initial implementation of ''drive-reopen'' closes the original disk before opening the new name, the user must either migrate only one disk at a time, or be prepared to deal with partial failure on the reopen stage.  Consider what happens if libvirt is requested to delete a snapshot with multiple mirrors.  If non-atomic ''drive-reopen'' succeeds on disk 1 then fails on disk 2, then libvirt can't delete the snapshot, but it also can't perform a rollback (doing yet another ''drive-reopen'' on disk 1 will hopefully succeed at restoring storage domain 1 as the source for disk 1, but it won't restore the mirroring, and it could fail since things are being reopened that were already closed, rather than the transactional effects of opening all new files before committing and closing any old files).
After a storage migration has occurred, if the management application wishes to shorten the backing chain on the destination, the <code>virDomainBlockRebase</code> API maps to the ''block_stream'' monitor command.
-->
=Credits=
* Marcelo Tosatti implemented mirroring
* Jeff Cody implemented atomic snapshots
* Federico Simoncelli provided the first implementation of the mirroring commands
* Eric Blake and Kevin Wolf participated to the discussions and suggested using discriminated unions
* Paolo Bonzini provided a transaction-capable implementation of mirroring
* Eric and Paolo adjusted the 'drive-mirror' command to directly provide streaming semantics as a second form of block job
(in chronological order)
[[Category:Completed feature pages]]

Latest revision as of 12:14, 12 October 2016

Atomic Snapshots of Multiple Devices

The snapshot_blkdev/blockdev-snapshot-sync command in QEMU 1.0 performs snapshots one device at a time, even if a guest has multiple devices. This can be troublesome in the instance of a snapshot failure. Should a snapshot fail, qemu will revert back to the original backing store but will still leave the guest in an overall inconsistent state, with some devices snapshotted and some not.

For instance, let us assume there are three devices in a guest: virtio0, virtio1, and virtio2. If we successfully perform a snapshot on virtio0 and virtio1, yet virtio2 fails, we could be in an inconsistent state. While we will have reverted virtio2 back to the previous backing store, virtio0 and virtio1 will have already successfully gone through the snapshot.

The only solution here is to stop the machine completely while the snapshots are performed. But ideally there would be a mechanism to allow all devices to have a snapshot taken as one atomic unit, so that for the snapshot to be successfully performed, all devices must have success.

QEMU 1.1 implements a "transaction" QMP command that operates on multiple block devices atomically. The transaction command receives one or more "transactionable" QMP commands and their arguments; the only transactionable command for now is blockdev-snapshot-sync. Execution of the commands is then split into two phases, a prepare phase and a commit/rollback phase. Should any command fail the prepare phase, the transaction immediately proceeds to roll back the completed prepare phases. If all commands are prepared successfully they are committed; the commit phase cannot fail, so that atomicity is achieved.

The transaction command is implemented using QAPI unions (discriminated records). Given the schema for a transactionable command, such as the following:

{ 'command': 'blockdev-snapshot-sync',
  'data': { 'device': 'str', 'snapshot-file': 'str', '*format': 'str' } }

a corresponding type is created and added to a union:

{ 'type': 'BlockdevSnapshot',
  'data': { 'device': 'str', 'snapshot-file': 'str', '*format': 'str' } }

{ 'union': 'BlockdevAction',
  'data': { 'blockdev-snapshot-sync': 'BlockdevSnapshot', /* ... */ } }

The transaction command then takes an array of actions:

{ 'command': 'transaction',
  'data': { 'actions': [ 'BlockdevAction' ] } }

Here is a sample execution of the command to snapshot two disks:

{ "execute": "transaction", "arguments":
  {'actions': [
    { 'type': 'blockdev-snapshot-sync', 'data' :
      { 'device': 'virtio0', 'snapshot-file': 'hd0-snap.qcow2' } },
    { 'type': 'blockdev-snapshot-sync', 'data' :
      { 'device': 'virtio1', 'snapshot-file': 'hd1-snap.qcow2' } } ] } }

Status: merged into master

Application to live block copy

Another feature that is new in QEMU 1.1 is live block device streaming. This feature lets guest retrieve data from a backing file while the guest is running; it enables quick provisioning of new virtual machines using shared remote storage, and lets the guest transition incrementally to fast local storage.

Streaming a block device to another location is also useful if management needs to migrate a guest's storage, for example in case of impending disk failures. However, in this context block streaming's fundamental deficiency is that the copy operation is performed while the virtual machine is already using the new storage; it is not possible to abort it and fall back to the old storage.

Luckily, storage migration is a simple extension of streaming. The block layer needs to be instructed to mirror writes to both the old and the new storage while streaming is in effect. Then, management can switch to the new storage at an arbitrary point after streaming is completed.

Unlike snapshotting, neither the start of block streaming, nor the "release" of old storage need to be done atomically across multiple devices. However, if the old storage has to be snapshotted at the time mirroring is started, then these two operations have to be done atomically.

Leaving aside for a moment the release operation, there are two possible implementation choices for an atomic snapshot+mirror operation. One is to specify both the snapshot destination and the mirror target, as in the following hypothetical QAPI schema:

{ 'command': drive-mirror',
  'data': { 'device': 'str', 'target': 'str', '*target-format': 'str',
            '*snapshot-file': 'str', '*snapshot-format': 'str' } }

This interface is simple to implement but it has two disadvantages. First, the interface is complicated. libvirt and oVirt right now need to do the above snapshot+mirror process because they want to copy storage outside QEMU; however, the additional arguments are there for everyone, even for people that can use block device streaming to do the copy. Second, the implementation must ensure a complete rollback of the snapshot operation in case mirroring fails. This is relatively complex to do; in fact, up to QEMU 1.0 blockdev-snapshot-sync couldn't even rollback correctly a single snapshot.

The latter requirement suggests plugging the drive-mirror command in the transaction command. The snapshot and mirror operations can be simply placed in the same transaction, which guarantees their atomicity. The schema then becomes:

{ 'command': drive-mirror',
  'data': { 'device': 'str', 'target': 'str', '*format': 'str', 'full': 'bool' } }
{ 'type': 'BlockdevMirror',
  'data': { 'device': 'str', 'target': 'str', '*format': 'str', 'full': 'bool' } }
{ 'union': 'BlockdevAction',
  'data': { 'blockdev-snapshot-sync': 'BlockdevSnapshot',
            'drive-mirror': 'BlockdevMirror' } }

and a sample execution of the command is as follows:

{ "execute": "transaction", "arguments":
  {'actions': [
    { 'type': 'blockdev-snapshot-sync', 'data' :
      { 'device': 'ide0-hd0', 'snapshot-file': 'base.qcow2' } },
    { 'type': 'drive-mirror', 'data' :
      { 'device': 'ide0-hd0', 'target': 'mirror.qcow2', 'full': 'false' } } ] } }

Switching the device to the new storage at the end of the copy operation is handled with another QMP command, drive-reopen. This command is not transactionable, so it is not included in BlockdevAction:

{ 'command': 'drive-reopen',
  'data': { 'device': 'str', 'new-image-file': 'str', '*format': 'str' } }

Status: available at git://github.com/bonzini/qemu.git branch blkmirror

Image creation modes

Compared to the above definitions, QEMU 1.1 also introduces an optional mode argument to the blockdev-snapshot-sync and drive-mirror commands. The argument applies both to standalone command and to transactions. Its type is the NewImageMode enum:

{ 'enum': 'NewImageMode'
  'data': [ 'existing', 'absolute-paths' ] }

The drive-mirror command also gains a mandatory full argument, to request whether the new image is shallow (shares the same backing file as the original) or full (no backing file). These arguments control how QEMU creates the new image file:

  • 'mode':'existing' for 'blockdev-snapshot-sync', and 'mode':'existing','full':'false' for 'drive-mirror', directs QEMU to look for an existing image. The image must be on disk and should have the same contents as the disk that is currently attached to the virtual machine. If 'format' is not provided, the image is probed for its type.
  • 'mode':'existing','full':'true' for 'drive-mirror' directs QEMU to look for an existing image, which must be on disk but have no contents. If 'format' is not provided, the image is probed for its type.
  • 'mode':'absolute-paths' for 'blockdev-snapshot-sync', and 'mode':'absolute-paths','full':'false' for 'drive-mirror, directs QEMU to create an image whose backing file is an absolute path to the current image. If no 'format' is provided, the new file will share the same format as the source.
  • 'mode':'absolute-paths','full':'true' for 'drive-mirror' directs QEMU to create an image with no backing file at all. This is useful when the mirror target is a raw file, for example. If no 'format' is provided, the new file will share the same format as the source.

In the future, it is planned to have another mode, relative-paths. It will also create an image whose backing file is the current image, but the current image will be identified by a relative path in the new image.

Image creation occurs in the prepare phase and uses the mode argument; however, the new backing file chain is composed in the commit phase with no regard to the mode. This matters when the same disk is included twice in a transaction, as in the following example:

{ "execute": "transaction", "arguments":
  {'actions': [
    { 'type': 'blockdev-snapshot-sync', 'data' :
      { 'device': 'virtio0', 'snapshot-file': 'hd0-snap0.qcow2' } },
    { 'type': 'blockdev-snapshot-sync', 'data' :
      { 'device': 'virtio0', 'snapshot-file': 'hd0-snap1.qcow2' } } ] } }

Assuming virtio0 is associated to hd0-base.qcow2, the backing file chain at the end of the transaction will be hd0-base.qcow2 <- hd0-snap0.qcow2 <- hd0-snap1.qcow2. However, the hd0-snap1.qcow2 image file will point to hd0-base.qcow2. This is useful when doing a combined snapshot+mirror operation:

{ "execute": "transaction", "arguments":
  {'actions': [
    { 'type': 'blockdev-snapshot-sync', 'data' :
      { 'device': 'virtio0', 'snapshot-file': 'src/hd0-snap.qcow2' } },
    { 'type': 'drive-mirror', 'data' :
      { 'device': 'virtio0', 'target': 'dest/hd0-snap.qcow2', 'full': 'false' } } ] } }

Here, assume the backing storage is shared/hd0-base.qcow2. Mirroring will write to src/hd0-snap.qcow2 and dest/hd0-snap.qcow2 as expected, and dest/hd0-snap.qcow2 will point to the original storage. As soon as block streaming completes, management can switch the device to dest/hd0-snap.qcow2. src/hd0-snap.qcow2 is not part of the backing file chain anymore, and can be deleted.

Status: 'mode' for 'blockdev-snapshot-sync' is merged into master, 'drive-mirror' is still pending

Libvirt interaction

Without atomic snapshots, libvirt does the best it can: virDomainSnapshotCreateXML will pause the guest with stop, then perform one snapshot at a time with blockdev-snapshot-sync, then resume the guest with cont. If a second snapshot fails, then libvirt will update the domain XML to reflect which snapshots succeeded, but this puts the burden on the management application to then check virDomainGetXMLDesc after failure to see what changes were actually made.

With the addition of these patches, libvirt will now probe whether atomic snapshots exist by checking the existence of the transaction monitor command. Libvirt will assume that if the command is present, then 'blockdev-snapshot-sync' and 'drive-mirror' actions within transaction exist (true for qemu 1.1); if we later add new actions to the discriminated union, then we will also need to add a QMP command for introspecting what additional commands are supported before libvirt can try to use those additional commands in a transaction. Libvirt will add a new flag, VIR_DOMAIN_SNAPSHOT_CREATE_ATOMIC, which will force the operation to fail if transaction is not available; omitting the flag implies that the non-atomic snapshot-blockdev-sync will be used as fallback.

With atomic snapshots, libvirt will form a single transaction command for virDomainSnapshotCreateXML with no additional effort needed from the user, and with no need for stop and cont.

The second part of the implementation is support for pre-copy storage migration. In QEMU, pre-copy storage migration has two phases:

  • add a streaming mirror to the existing image (not a snapshot+mirror);
  • reopen the image to the new mirror ("pivot" the mirror).

Copy of the images below the topmost source image is still done outside QEMU and perhaps even outside libvirt.

Pre-copy storage migration is implemented as an extension to the existing block job support. The current API is virDomainBlockRebase, which starts a streaming (aka post-copy storage migration) job:

int virDomainBlockRebase (virDomainPtr dom, 
                          const char *disk,  const char *base, 
                          unsigned long bandwidth, unsigned int flags)

In this API, base is the absolute path of one of the backing images further up the chain; streaming takes data from the backing file chain up to that image, and copies it to the top image.

An optimal mirror API would take an additional path (let's call it dest) which the mirror will target. The API would be like

int virDomainBlockCopy (virDomainPtr dom, 
                          const char *disk, const char *base, const char *dest, const char *format,
                          unsigned long bandwidth, unsigned int flags)

However, we can for now add new flags to virDomainBlockRebase that treat "base" as the destination (VIR_DOMAIN_BLOCK_REBASE_COPY), and choose whether to limit the copy to the topmost image (VIR_DOMAIN_BLOCK_REBASE_SHALLOW, maps to 'full':'false') as well as whether to reuse an existing file (VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT, maps to 'mode':'existing). We can also add another flag to indicate whether the destination image must be treated as raw instead of probed (VIR_DOMAIN_BLOCK_REBASE_COPY_RAW, maps to 'format':'raw')..

virDomainBlockRebase/virDomainBlockCopy will perform the first step above:

  • adding a streaming mirror to the existing image is done with the drive-mirror command (either directly, or wrapped within a transaction)

For now, qemu does not give any event indication that the streaming is complete, but libvirt can poll the 'query-block-jobs' command to see when streaming has completed and mirroring is now active. The libvirt block job should continue until explicitly aborted by the user, because QEMU is still actively mirroring the VM's writes to both the source and the target. This is very important, because pivoting (phase 3) reopens the whole backing file chain on the destination storage, and must not be attempted until all the base images have been copied successfully to the target. Once the copy of the base images terminates, the libvirt client can start polling for the termination of phase 2. This is signaled by cur == end in the virDomainBlockJobInfo struct that is returned by virDomainGetBlockJobInfo. When both conditions are verified (base images copied and cur == end), the client can perform the pivoting operation. Pivoting is done with a new flag to virDomainBlockJobAbort, VIR_BLOCK_JOB_ABORT_PIVOT. The flag:

  • is only allowed for a mirrored streaming job;
  • causes the function to fail unless QEMU has already reported streaming complete
  • triggers a drive-reopen command that completes storage migration.

In case the VM aborts, the mirror can be safely thrown away and the process restarted.


Credits

  • Marcelo Tosatti implemented mirroring
  • Jeff Cody implemented atomic snapshots
  • Federico Simoncelli provided the first implementation of the mirroring commands
  • Eric Blake and Kevin Wolf participated to the discussions and suggested using discriminated unions
  • Paolo Bonzini provided a transaction-capable implementation of mirroring
  • Eric and Paolo adjusted the 'drive-mirror' command to directly provide streaming semantics as a second form of block job

(in chronological order)