Contents |
The file format looks like this:
+----------+----------+----------+-----+ | cluster0 | cluster1 | cluster2 | ... | +----------+----------+----------+-----+
The first cluster begins with the header. The header contains information about where regular clusters start; this allows the header to be extensible and store extra information about the image file. A regular cluster may be a data cluster, an L2, or an L1 table. L1 and L2 tables are composed of one or more contiguous clusters.
Normally the file size will be a multiple of the cluster size. If the file size is not a multiple, extra information after the last cluster may not be preserved if data is written. Legitimate extra information should use space between the header and the first regular cluster.
All fields are little-endian.
Header {
uint32_t magic; /* QED\0 */
uint32_t cluster_size; /* in bytes */
uint32_t table_size; /* for L1 and L2 tables, in clusters */
uint32_t header_size; /* in clusters */
uint64_t features; /* format feature bits */
uint64_t compat_features; /* compat feature bits */
uint64_t autoclear_features; /* self-resetting feature bits */
uint64_t l1_table_offset; /* in bytes */
uint64_t image_size; /* total logical image size, in bytes */
/* if (features & QED_F_BACKING_FILE) */
uint32_t backing_filename_offset; /* in bytes from start of header */
uint32_t backing_filename_size; /* in bytes */
}
Field descriptions:
Feature bits:
There are currently no defined compat_features or autoclear_features bits.
Fields predicated on a feature bit are only used when that feature is set. The fields always take up header space, regardless of whether or not the feature bit is set.
Tables provide the translation from logical offsets in the block device to cluster offsets in the file.
#define TABLE_NOFFSETS (table_size * cluster_size / sizeof(uint64_t))
Table {
uint64_t offsets[TABLE_NOFFSETS];
}
The tables are organized as follows:
+----------+
| L1 table |
+----------+
,------' | '------.
+----------+ | +----------+
| L2 table | ... | L2 table |
+----------+ +----------+
,------' | '------.
+----------+ | +----------+
| Data | ... | Data |
+----------+ +----------+
A table is made up of one or more contiguous clusters. The table_size header field determines table size for an image file. For example, cluster_size=64 KB and table_size=4 results in 256 KB tables.
The logical image size must be less than or equal to the maximum possible size of clusters rooted by the L1 table:
header.image_size <= TABLE_NOFFSETS * TABLE_NOFFSETS * header.cluster_size
All offsets in L1 and L2 tables are cluster-aligned. The least significant bits up to cluster_size are reserved and must be zero. This may be used in future format extensions to store per-offset information.
The following offsets have special meanings:
Reads to an unallocated area of the image file access the backing file. If there is no backing file, then zeroes are produced. The backing file may be smaller than the image file and reads of unallocated areas beyond the end of the backing file produce zeroes.
Writes to an unallocated area cause a new data clusters to be allocated, and a new L2 table if that is also unallocated. The new data cluster is populated with data from the backing image (or zeroes if no backing image) and the data being written.
Logical offsets are translated into cluster offsets as follows:
table_bits table_bits cluster_bits
<--------> <--------> <--------------->
+----------+----------+-----------------+
| L1 index | L2 index | byte offset |
+----------+----------+-----------------+
Structure of a logical offset
offset_mask = ~(cluster_size - 1) # mask for the image file byte offset def logical_to_cluster_offset(l1_index, l2_index, byte_offset): l2_offset = l1_table[l1_index] l2_table = load_table(l2_offset) cluster_offset = l2_table[l2_index] & offset_mask return cluster_offset + byte_offset
This section is informational and included to provide background on the use of the QED_F_NEED_CHECK features bit.
The QED_F_NEED_CHECK bit is used to mark an image as dirty before starting an operation that could leave the image in an inconsistent state if interrupted by a crash or power failure. A dirty image must be checked on open because its metadata may not be consistent.
Consistency check includes the following invariants:
The consistency check process starts by from l1_table_offset and scans all L2 tables. After the check completes with no other errors besides leaks, the QED_F_NEED_CHECK bit can be cleared and the image can be accessed.
The L2 link should be made after the data is in place on storage. However, when no ordering is enforced the worst case scenario is an L2 link to an unwritten cluster.
The L1 link must be made after the L2 cluster is in place on storage. If the order is reversed then the L1 table may point to a bogus L2 table. (Is this a problem since clusters are allocated at the end of the file?)
Writes that complete before a flush must be stable when the flush completes.
If storage is interrupted (e.g. power outage) then writes in progress may be lost, stable, or partially completed. The storage must not be otherwise corrupted or inaccessible after it is restarted.