Features/BlockJob

Error handling

query-block-jobs: BlockJobInfo gets two new fields, paused and io-status. The job-specific iostatus is completely separate from the block device iostatus.

block-stream

I would still like to add on_error to the existing block-stream command, if only to ease unit testing. Concerns about the stability of the API can be handled by adding introspection (exporting the schema), which is not hard to do. The new option is an enum with the following possible values:

'report': The behavior is the same as in 1.1. An I/O error will complete the job immediately with an error code.
'ignore': An I/O error, respectively during a read or a write, will be ignored. For streaming, the job will complete with an error and the backing file will be left in place. For mirroring, the sector will be marked again as dirty and re-examined later.
'stop': The job will be paused, and the job iostatus (which can be examined with query-block-jobs) is updated.
'enospc': Behaves as 'stop' for ENOSPC errors, 'report' for others.

In all cases, even for 'report', the I/O error is reported as a QMP event BLOCK_JOB_ERROR, with the same arguments as BLOCK_IO_ERROR.

After cancelling a job, the job implementation MAY choose to treat stop and enospc values as report, i.e. complete the job immediately with an error code, as long as block_job_is_cancelled(job) returns true when the completion callback is called.

Open problem: There could be unrecoverable errors in which the job will always fail as if rerror/werror were set to report (example: error while switching backing files). Does it make sense to fire an event before the point in time where such errors can happen?

block-job-pause: A new QMP command. Takes a block device (drive), pauses an active background block operation on that device. This command returns immediately after marking the active background block operation for pausing. It is an error to call this command if no operation is in progress. The operation will pause as soon as possible (it won't pause if the job is being cancelled). No event is emitted when the operation is actually paused. Cancelling a paused job automatically resumes it.

block-job-resume

A new QMP command. Takes a block device (drive), resume a paused background block operation on that device. This command returns immediately after resuming a paused background block operation. It is an error to call this command if no operation is in progress.

A successful block-job-resume operation also resets the iostatus on the job that is passed.

Rationale: block-job-resume is required to restart a job that had on_error behavior set to 'stop' or 'enospc'. Adding block-job-pause makes it simpler to test the new feature.

Mirroring commands

query-block-jobs

The returned JSON object will grow an additional member, "target". The target field is a dictionary with two fields, "info" and "stats" (resembling the output of query-block and query-blockstat but for the mirroring target). Member "device" of the BlockInfo structure will be made optional.

Rationale: this allows libvirt to observe the high watermark of qcow2 mirroring targets.

If present, the target has its own iostatus. It is set when the job is paused due to an error on the target (together with sending a BLOCK_JOB_ERROR event). block-job-resume resets it.

drive-mirror

activates mirroring to a second block device (optionally creating the image on that second block device). Compared to the earlier versions, the "full" argument is replaced by an enum option "sync" with three values:

'top': copies data in the topmost image to the destination
'full': copies data from all images to the destination
'dirty': copies clusters that are marked in the dirty bitmap to the destination (see below)

block-job-complete: force completion of mirroring and switching of the device to the target, not related to the rest of the proposal. Synchronously opens backing files if needed, asynchronously completes the job.

MIRROR_STATE_CHANGE: new event, triggered every time the block-job-complete becomes available/unavailable. Contains the device name (like device: 'ide0-hd0'), and the state (synced: true/false).

Persistent dirty bitmap

A persistent dirty bitmap can be used by management for two reasons: 1) when mirroring is used for continuous replication of storage, to record I/O operations that happened while the replication server is not connected or unavailable; 2) when mirroring is used for storage migration, to check after a management crash whether the VM must be restarted with the source or the destination.

The dirty bitmap is synchronized on every bdrv_flush (or on every I/O operation if the disk operates in writethrough or directsync mode).

The persistent dirty bitmap is created by management, but QEMU needs it also for drive-mirror. If so:

if management has not set up a persistent dirty bitmap, QEMU will use a simple non-persistent bitmap.
if management has set up a persistent dirty bitmap and later calls blockdev-dirty-disable, QEMU will delay the disabling until drive mirroring also terminates.

QMP commands

The dirty bitmap is managed by these QMP commands:

blockdev-dirty-enable: takes a file name used for the dirty bitmap, and an optional granularity. Setting the granularity will not be supported in the initial version.
query-block-dirty: returns statistics about the dirty bitmap: right now the granularity, the number of bits that are set, and whether QEMU is consuming the dirty bitmap (i.e. drive-mirror active)
blockdev-dirty-disable: disable the dirty bitmap.

The dirty bitmap can also be specified on the command-line with -drive.

Usage

The dirty bitmap can be used as follows for storage migration. To start migration:

blockdev-dirty-enable ide0-hd0 /var/lib/libvirt/dirty/diskname
management notes existence of dirty bitmap for /mnt/src/diskname.img in its private data
drive-mirror ide0-hd0 /mnt/dest/diskname.img
management notes /mnt/dest/diskname.img as the mirroring target in its private data
At this point, mirroring has taken a reference to the dirty bitmap.
To end migration:
blockdev-dirty-disable ide0-hd0
block-job-complete ide0-hd0
The dirty bitmap remains enabled until the BLOCK_JOB_COMPLETED event is sent.
When management receives the BLOCK_JOB_COMPLETED event, it notes switch to /mnt/dest/diskname.img (without dirty bitmap nor mirroring target) in its private data.

If management crashes between (6) and (7), it can examine the dirty bitmap on disk. If it is all-zeros, management can restart the virtual machine with /mnt/dest/diskname.img. If it has even a single zero bit, management can restart the virtual machine with the persistent dirty bitmap enabled, and later issue again a drive-mirror command (with sync='dirty') to restart from step 4.

Internal workings

In addition to the persistent dirty bitmap, QEMU keeps an in-flight bitmap. The in-flight bitmap does not need to be persistent.

Bitmap handling when doing I/O on the source

after writing to the source:
- clear bit in the volatile in-flight bitmap
- set bit in the persistent dirty bitmap
after flushing the source:
- msync the persistent bitmap to disk

Bitmap handling in the drive-mirror coroutine

before reading from the source:
- set bit in the volatile in-flight bitmap
periodically:
- flush the target
- clear bits in the persistent dirty bitmap that are set in the in-flight bitmap
- clear the volatile in-flight bitmap

(TODO: write a formal model of the above and check it with spin)