Features/FVD/Design

From QEMU
Revision as of 14:45, 11 October 2016 by Paolo Bonzini (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

On-disk Metadata

FVD is quite simple. It uses three on-disk metadata structures.

  • A bitmap to implement copy-on-write.
  • A one-level lookup table to implement storage allocation.
  • A journal to commit changes of the bitmap and the lookup table.

A bit in the bitmap tracks the state of a block. The bit is 0 if the block is in the base image, and the bit is 1 if the block is in the FVD image. The default size of a block is 64KB, same as the cluster size of QCOW2. To represent the state of a 1TB base image, FVD only needs a 2MB bitmap, which can be easily cached in memory. This bitmap also implements copy-on-read and adaptive prefetching.

One entry in the lookup table maps the virtual disk address of a chunk to an offset in the FVD image where the chunk is stored. The default size of a chunk is 1MB, as that in VirtualBox VDI (VMware VMDK and Microsoft VHD use a chunk size of 2MB). For a 1TB virtual disk, the size of the lookup table is 4MB. Because of the small size, there is no need to use a two-level lookup table as that in QCOW2.

When the bitmap and/or the lookup table need be modified, the changes are saved in the journal, as opposed to updating the bitmap and/or table directly. When the journal is full (which happens infrequently), the entire bitmap and the entire lookup table are flushed to disk, and the journal can be recycled for reuse. The flush is quick, because the bitmap and the lookup table are small.

Rationale

Separating the implementations of copy-on-write and storage allocation has several advantages. First, the lookup table can be optionally disabled so that FVD gets the most efficient RAW-image-like data layout. Second, it makes the metadata smaller and easier to cache. The bitmap is small because of its efficient representation, and the lookup table is small because of the large chunk size. Third, this separation enables optimizations that reduce metadata update overhead (see the paper).

The journal provides several benefits. First, updating both the bitmap and the lookup table requires only a single write to journal. Second, K concurrent updates to any potions of the bitmap or the lookup table are converted to K sequential writes in the journal, which can be merged into a single write by the host Linux kernel. Third, it increases concurrency by avoiding locking the bitmap or the lookup table. For example, updating one bit in the bitmap requires writing a 512-byte sector to the on-disk bitmap. This bitmap sector covers a total of 512*8*64K=256MB data, and any two writes to that same bitmap sector cannot proceed concurrently. The journal solves this problem and eliminates unnecessary locking.

Important Features

FVD can do storage allocation, copy-on-write, copy-on-read, and adaptive prefetching, with the following two important features.

  • FVD’s functions are orthogonal, i.e., each function can be enabled or disabled individually without affecting other functions. The purpose is to support diverse use cases. For example, when the bitmap is enabled and the lookup table is disabled, FVD provides a copy-on-write image that performs no storage allocation. This allows using the most appropriate host file system to do storage allocation.
  • An FVD image can be stored on either a host file system or a logical volume. When stored on a logical volume, FVD’s lookup table can manage incrementally added storage without a host file system. For example, when FVD serves a 100GB virtual disk, it initially gets 5GB storage space from the logical volume manager (LVM) and uses it to host many 1MB chunks. When the first 5GB is used up, FVD gets another 5GB from the LVM, and so forth. Unlike QCOW2 and more like a file system, FVD does not have to allocate a new chunk always right after where the previous chunk was allocated. Instead, it may spread out used chunks in the storage space in order to achieve certain benefits, e.g., mimicking a raw image like data layout.