Features/QED: Difference between revisions

From QEMU
No edit summary
Line 119: Line 119:
This section is informational and in order to clarify the use of the QED_F_NEED_CHECK ''features'' bit.
This section is informational and in order to clarify the use of the QED_F_NEED_CHECK ''features'' bit.


The QED_F_NEED_CHECK bit is used to mark an image as dirty.  A dirty image must be checked on open because its metadata may not be consistent.
The QED_F_NEED_CHECK bit is used to mark an image as dirty before starting an operation that could leave the image in an inconsistent state if interrupted by a crash or power failure.  A dirty image must be checked on open because its metadata may not be consistent.


Consistency check includes the following invariants:
Consistency check includes the following invariants:
Line 127: Line 127:


The consistency check process starts by from ''l1_table_offset'' and scans all L2 tables.  After the check completes with no other errors besides leaks, the QED_F_NEED_CHECK bit can be cleared and the image can be accessed.
The consistency check process starts by from ''l1_table_offset'' and scans all L2 tables.  After the check completes with no other errors besides leaks, the QED_F_NEED_CHECK bit can be cleared and the image can be accessed.
The QED_F_NEED_CHECK bit is set before starting an allocating write.


=Operations=
=Operations=

Revision as of 11:54, 22 October 2010

Specification

The file format looks like this:

+----------+----------+----------+-----+
| cluster0 | cluster1 | cluster2 | ... |
+----------+----------+----------+-----+

The first cluster begins with the header. The header contains information about where regular clusters start; this allows the header to be extensible and store extra information about the image file. A regular cluster may be a data cluster, an L2, or an L1 table. L1 and L2 tables are composed of one or more contiguous clusters.

Normally the file size will be a multiple of the cluster size. If the file size is not a multiple, extra information after the last cluster may not be preserved if data is written. Legitimate extra information should use space between the header and the first regular cluster.

All fields are little-endian.

Header

Header {
    uint32_t magic;               /* QED\0 */

    uint32_t cluster_size;        /* in bytes */
    uint32_t table_size;          /* for L1 and L2 tables, in clusters */
    uint32_t header_size;         /* in clusters */

    uint64_t features;            /* format feature bits */
    uint64_t compat_features;     /* compat feature bits */
    uint64_t l1_table_offset;     /* in bytes */
    uint64_t image_size;          /* total logical image size, in bytes */

    /* if (features & QED_F_BACKING_FILE) */
    uint32_t backing_filename_offset; /* in bytes from start of header */
    uint32_t backing_filename_size;   /* in bytes */
}

Field descriptions:

  • cluster_size must be a power of 2 in range [2^12, 2^26].
  • table_size must be a power of 2 in range [1, 16].
  • header_size is the number of clusters used by the header and any additional information stored before regular clusters.
  • features, compat_features, and autoclear_features are file format extension bitmaps. They work as follows:
    • An image with unknown features bits enabled must not be opened. File format changes that are not backwards-compatible must use features bits.
    • An image with unknown compat_features bits enabled can be opened safely. The unknown features are simply ignored and represent backwards-compatible changes to the file format.
    • An image with unknown autoclear_features bits enable can be opened safely after clearing the unknown bits. This allows for backwards-compatible changes to the file format which degrade gracefully and can be re-enabled again by a new program later.
  • l1_table_offset is the offset of the first byte of the L1 table in the image file and must be a multiple of cluster_size.
  • image_size is the block device size seen by the guest and must be a multiple of 512 bytes.
  • backing_filename is a string in (byte offset, byte size) form. It is not NUL-terminated and has no alignment constraints.

Feature bits:

  • QED_F_BACKING_FILE = 0x01. The image uses a backing file. The backing filename string is given in the backing_filename_{offset,size} fields and may be an absolute path or relative to the image file.
  • QED_F_NEED_CHECK = 0x02. The image needs a consistency check before use.
  • QED_F_BACKING_FORMAT_NO_PROBE = 0x04. The backing file is a raw disk image and no file format autodetection should be attempted. This should be used to ensure that raw backing images are never detected as an image format if they happen to contain magic constants.

There are currently no defined compat_features or autoclear_features bits.

Fields predicated on a feature bit are only used when that feature is set. The fields always take up header space, regardless of whether or not the feature bit is set.

Tables

Tables provide the translation from logical offsets in the block device to cluster offsets in the file.

#define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))
 
Table {
    uint64_t offsets[TABLE_NOFFSETS];
}

The tables are organized as follows:

                   +----------+
                   | L1 table |
                   +----------+
              ,------'  |  '------.
         +----------+   |    +----------+
         | L2 table |  ...   | L2 table |
         +----------+        +----------+
     ,------'  |  '------.
+----------+   |    +----------+
|   Data   |  ...   |   Data   |
+----------+        +----------+

A table is made up of one or more contiguous clusters. The table_size header field determines table size for an image file. For example, cluster_size=64 KB and table_size=4 results in 256 KB tables.

The logical image size must be less than or equal to the maximum possible size of clusters rooted by the L1 table:

header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size

All offsets in L1 and L2 tables are cluster-aligned. The least significant bits up to cluster_size are reserved and must be zero. This may be used in future format extensions to store per-offset information.

The following offsets have special meanings:

L2 table offsets

  • 0 - unallocated. The L2 table is not yet allocated.

Data cluster offsets

  • 0 - unallocated. The data cluster is not yet allocated.

Unallocated L2 tables and data clusters

Reads to an unallocated area of the image file access the backing file. If there is no backing file, then zeroes are produced. The backing file may be smaller than the image file and reads of unallocated areas beyond the end of the backing file produce zeroes.

Writes to an unallocated area cause a new data clusters to be allocated, and a new L2 table if that is also unallocated. The new data cluster is populated with data from the backing image (or zeroes if no backing image) and the data being written.

Logical offset translation

Logical offsets are translated into cluster offsets as follows:

 table_bits table_bits    cluster_bits
 <--------> <--------> <--------------->
+----------+----------+-----------------+
| L1 index | L2 index |     byte offset |
+----------+----------+-----------------+

      Structure of a logical offset
offset_mask = ~(cluster_size - 1) # mask for the image file byte offset

def logical_to_cluster_offset(l1_index, l2_index, byte_offset):
  l2_offset = l1_table[l1_index]
  l2_table = load_table(l2_offset)
  cluster_offset = l2_table[l2_index] & offset_mask
  return cluster_offset + byte_offset

Consistency checking

This section is informational and in order to clarify the use of the QED_F_NEED_CHECK features bit.

The QED_F_NEED_CHECK bit is used to mark an image as dirty before starting an operation that could leave the image in an inconsistent state if interrupted by a crash or power failure. A dirty image must be checked on open because its metadata may not be consistent.

Consistency check includes the following invariants:

  1. Each cluster is referenced once and only once. It is an inconsistency to have a cluster referenced more than once by L1 or L2 tables. A cluster has been leaked if it has no references.
  2. Offsets must be within the image file size and must be cluster_size aligned.
  3. Table offsets must at least table_size * cluster_size bytes from the end of the image file so that there is space for the entire table.

The consistency check process starts by from l1_table_offset and scans all L2 tables. After the check completes with no other errors besides leaks, the QED_F_NEED_CHECK bit can be cleared and the image can be accessed.

Operations

Read

  1. If L2 table is not present in L1, read from backing image.
  2. If data cluster is not present in L2, read from backing image or zero fill if no backing image.
  3. Otherwise read data from cluster.

Write

  1. If L2 table is not present in L1, allocate new cluster and L2. Perform L2 and L1 link after writing data.
  2. If data cluster is not present in L2, allocate new cluster. Perform L1 link after writing data.
  3. Otherwise overwrite data cluster.

The L2 link should be made after the data is in place on storage. However, when no ordering is enforced the worst case scenario is an L2 link to an unwritten cluster.

The L1 link must be made after the L2 cluster is in place on storage. If the order is reversed then the L1 table may point to a bogus L2 table. (Is this a problem since clusters are allocated at the end of the file?)

Grow

  1. If table_size * TABLE_NOFFSETS < new_image_size, fail -EOVERFLOW. The L1 table is not big enough.
  2. Write new image_size header field.

Data integrity

Write

Writes that complete before a flush must be stable when the flush completes.

If storage is interrupted (e.g. power outage) then writes in progress may be lost, stable, or partially completed. The storage must not be otherwise corrupted or inaccessible after it is restarted.

Future Features