Features/Qcow3

From QEMU

Current status

The proposal is fully implemented. This page is mostly of historical interest.

The proposed image format change was introduced in QEMU 1.1 and became the default for image creation with QEMU 1.7. When creating images, use the -o compat=[0.10|1.1] option to create an image in the non-default format version. qemu-img info displays the compatibility level of an image. A compatibility level of 1.1 means that the new format is in use. qemu-img amend can be used to upgrade or downgrade existing images.

What is QCOW3?

The QCOW image format includes a version number to allow introducing new features that change the format in an incompatible way, so that older qemu versions can't read the image any more.

When this version number was changed from QCOW1 to QCOW2, the format was radically changed. QCOW1 and QCOW2 have two completely different driver implementations and the two versions actually are exposed as two different image formats ("qcow" and "qcow2") to the user.

The proposal for QCOW3 is different: It includes a version number increase in order to introduce some incompatible features, however it's strictly an extension of QCOW2 and keeps the fundamental structure unchanged, so that a single codebase will be enough for working with both QCOW2 and QCOW3 images.

Internally, QEMU will have a single driver for both QCOW2 and QCOW3 images, so it's an option to continue letting users know the format as "qcow2", just with an additional flag that can be set on image creation and specifies which version number should be used (and therefore, which features should be available).

Requirements

The key requirements for QCOW3 are:

  1. Near-raw performance, competitive with QED.
  2. Fully QCOW2 backwards-compatible feature set.
  3. Extensibility to tackle current and future storage virtualization challenges.

Near-raw performance, competitive with QED

Performance analysis has shown that as of QEMU 0.13, QCOW2 performance is significantly worse than using raw files. Different approaches, including QED's simplified metadata and fully asynchronous implementation, have proven that a modern image format can achieve near-raw performance. Metadata caching and batched updates can also improve performance but require image format changes to be effective in all cases.

Improving performance is the key motivation for a QCOW3 image format.

Fully QCOW2 backwards-compatible feature set

The QCOW2 format offers encryption, compression, and internal snapshotting features not supported by other formats. Unlike other formats, QCOW2 allows images to be stored directly on block devices instead of using a file system. These features must be preserved in order to provide backwards compatibility for existing deployments.

Furthermore, it should be easy to upgrade from QCOW2 to QCOW3 so that existing users can do so without lengthy downtime or storage administration overheads.

Extensibility to tackle current and future storage virtualization challenges

Several new features in the storage area including discard support, external snapshots, and image streaming are being developed and integrated. These and other future features must fit into the format gracefully. A feature bit mechanism can be used to provide forwards, backwards, and incompatible feature negotiation support.

QCOW2 allows introducing incompatible new features only by increasing the version number. This is what we'll do for QCOW3. With this change, a more flexible mechanism will be introduced that can be used for future changes.

Roadmap

  1. Introduce qcow2 test suite to exercise qcow2-specific routines
    1. Calls directly into qcow2 routines to exercise internal features.
    2. Will guard us against introducing regressions.
  2. Make QCOW2 asynchronous
    1. Option: Callbacks
      1. Well-understood and proven performance at the cost of some maintainability/complexity.
    2. Option: Coroutines
      1. Riskier new approach that requires only limited code changes.
      2. Questions about performance not yet addressed.
  3. Introduce QCOW3 format and feature negotiation.
  4. Implement a safe delayed metadata update mode relying on consistency check.
    1. Eventually consistent refcounts reduce performance overhead of metadata updates.
    2. Ability to rebuild refcount table by scanning L1/L2 tables.
    3. Option: Use allocation offset header field (on block devices)
  5. Reduce consistency check times
    1. Clear dirty flag periodically and after guest flush to reduce dirty time window and risk of having to do a consistency check after crash.
  6. Metadata caching tweaks (cache size, replacement algorithm)
  7. Scalability and parallel requests
    1. In theory parallel requests may just work at this point but their performance may not scale for multithreaded workloads.
  8. Copy-on-read for image streaming

Estimated development times are based on the assumption that both Stefan and Kevin can focus on QCOW3 related work. If this doesn't hold true, delays are to be expected.

Make QCOW2 asynchronous using coroutines

The current QCOW2 implementation performs synchronous metadata accesses. This can temporarily stop the guest from running, results in poor performance, and introduces timing jitter.

Coroutines can be used to make QCOW2 asynchronous without invasive code changes. An emphasis will need to be placed on profiling and optimizing to ensure that coroutines do not introduce a significant new overhead.

Estimated development time: 4 weeks

Introduce QCOW3 format and feature negotiation

The QCOW header version number must be bumped to 3 in order to support incompatible file format changes. At the same time a feature bit mechanism should be introduced to make future file format changes easier.

Estimated development time: 1 week

Implement a safe delayed metadata update mode relying on consistency check

For writeback caches modes, qcow2 already batches writes to L2 tables and refcount blocks, requiring less writes and less flushes than QED and therefore providing better potential performance.

For writethrough modes any batching would be incorrect, so we still need an improvement here. We'll add an optional QED-like mode for this:

Refcount updates are batched even in writethrough modes and we do no ordering between L2 tables and refcount blocks. This makes the overhead of writing out refcount blocks negligible because it only happens when a refcount block is full (once in 2 GB for 64k clusters).

This will improve performance but also means that refcount tables may not be accurate. Introduce a dirty bit and consistency check that rebuilds the refcount tables by scanning L2 tables on startup if the dirty bit was set.

Estimated development time: 1 month

Asynchronous QCOW2 codebase sizing

This sizing considers implementing the QCOW2 image format in a disciplined
manner with tests.  The end product is an asynchronous QCOW2 implementation
with tests.  It does not include the current Qcow2Cache optimization.

I have sized this for someone who is comfortable with the QEMU block layer and
can dedicate their time to implementing and testing.  If there is no previous
familiarity with implementing QEMU block drivers and the QCOW2 file format,
then a ramp up of perhaps 20 days would make for a more conservative sizing.

Asynchronous QCOW2 implementation [51 days + 20 days ramp-up]
Including time for upstream integration: 5 pm

Milestone 1 - Read-only images [7 days]
1. Block driver skeleton with QCOW2 image format probing [1 day]
2. Read-only L1/L2 table cache [3 days]
2.1 L1/L2 table read function and data structure
2.2 Read cache for L1/L2 tables
2.3 Tests for L1/L2 cache including parallel reads
3. Block position to file offset lookup [1 day]
4. Read [2 days]

Milestone 2 - Read-write images [15 days]
1. Refcount cache [5 days]
1.1 Refcount table read/write functions and data structure
1.2 Cache for refcount blocks
1.3 Extend tests including parallel reads and writes
2. Cluster allocation [5 days]
3. Read-write L1/L2 table cache [3 days]
3.1 L1/L2 table write function
3.2 Make L1/L2 table cache writeable
3.3 Extend tests including parallel writes
4. Write [2 days]
4.1 In-place write
4.2 Allocating write

Milestone 3 - Image creation and sizing [8 days]
1. Consistency check [3 days]
2. Create [2 days]
3. Ensure that qemu-iotests run [2 days]
4. Truncate [2 days]
4.1 Enable growing the image size

Milestone 4 - Backing files [8 days]
1. Open and create backing file support [1 day]
2. Read [1 day]
3. Write [2 days]
4. Change backing file [1 day]
5. Ensure that qemu-iotests run [2 day]

Milestone 5 - Internal snapshots [10 days]
1. Snapshot list and goto [3 days]
2. Snapshot create and delete [4 days]
3. Vmstate load and save [3 days]

Milestone 6 - Feature parity with upstream [3 days]
1. Discard [3 days]