FVD was designed for both high performance and flexibility. As a next-generation image format, FVD surpasses existing image formats in two simple but fundamental ways.
- FVD separates the functions of copy-on-write and storage allocation so that storage allocation can be performed by any component, i.e., a component that is most appropriate for a given scenario, be it a host file system, a host logical volume manager, FVD itself, or even another image format. This is the key in achieving high performance.
- FVD treats image mobility as a first-class citizen, and provides copy-on-read and adaptive prefetching of base image, in addition to the widely used copy-on-write technique.
Storage allocation is a most important issue in storage and file systems. When a VM writes to a data block, the lower layers decide where to store that block. Traditionally, storage allocation is done twice, first by an image format (e.g., QCOW2), and then by a host file system. This is problematic in several aspects:
- Most importantly, regardless of the underlying platform, an image format insists on getting in the way and doing storage allocation in its naïve, one-size-fit-all manner, which is likely to perform poorly in many cases, because of the diversity of the platforms supported by QEMU. Storage systems have different characteristics (solid-state drive/Flash, DAS, NAS, SAN, etc), and host file systems (GFS, NTFS, FFS, LFS, ext2/ext3/ext4, reiserFS, Reiser4, XFS, JFS, VMFS, ZFS, etc) provide many different features and are optimized for different objectives (flash wear leveling, seek distance, reliability, etc). An image format should piggyback on the success of the diverse solutions developed by the storage and file systems community through 40 years of hard work, rather than insisting on reinventing a naïve, one-size-fit-all wheel to redo storage allocation.
- The interference between an image format and a host file system may make neither of them work well, even if either of them is optimal by itself. For example, what would happen when the image format and the host file system both perform online defragmentation simultaneously? If the image format’s storage allocation algorithm really works well, the image should be stored on a logical volume and the host file system should be disabled. If the host file system’s algorithm works well, the image format’s algorithm should be disabled. It is better not to use both at the same time to confuse each other.
- Obviously, doing storage allocation twice doubles the overhead: updating on-disk metadata twice, causing fragmentation twice, and caching metadata twice, etc.
These are fundamental problems in existing image formats, but the solution in FVD is surprisingly trivial. With a simple design, FVD makes all the following configurations possible: 1) only perform storage allocation in a host file system; 2) only perform storage allocation in FVD (directly on a logical volume without a host file system); 3) do storage allocation twice as that in existing image formats; or 4) FVD performs copy-on-write, copy-on-read, and adaptive prefetching, but delegates the function of storage allocation to any other QEMU image formats (assuming they are better in doing that). This flexibility allows FVD to support any use cases, even if unanticipated bizarre storage technology becomes main stream in the future (flash, nano, or whatever).
Another motivation of FVD is to enhance image mobility, which is not well supported by existing image formats. Consider the following use cases.
- In a Cloud, VMs are created based on read-only image templates stored on network-attached storage (NAS) and shared by all hosts. A VM’s image file, however, may be stored on direct-attached storage (DAS, i.e., local disk) because of its low cost. If the image uses the RAW format, creating a new VM is slow because it needs to copy the image template (gigabytes of data) from NAS to DAS. On the other hand, if the image uses a copy-on-write format (e.g., QCOW2), it may cause congestion on network and NAS, because every VM may repeatedly read unmodified data from the image template on NAS. The solution in FVD is to do copy-on-read and adaptive prefetching in addition to copy-on-write. Copy-on-read saves on DAS a copy of data retrieved from NAS for later reuse. Adaptive prefetching finds resource idle time to proactively copy from NAS to DAS the rest of the image that have not been accessed by the VM. Prefetching is conservative in that if FVD detects a contention on any resource (including DAS, NAS, or network), FVD pauses prefetching temporarily and resumes prefetching later when congestion disappears.
- Today, migrating a VM from one host’s DAS to another host’s DAS takes a pre-copy approach, i.e., the VM's disk data must be copied from the source host to the target host in its entirety before the VM can start to run on the target host. Pre-copy may take a long time due to the large size of a disk image. With FVD, it can instantly migrate a VM without first transferring its disk image. As the VM runs uninterruptedly on the target host, FVD uses copy-on-read and adaptive prefetching to gradually copy the image from the source host to the target host, without user-perceived downtime.