ToDo/Block/Qcow2PerformanceRoadmap

Known problems and possible solutions

Fully allocated image

Should be able to perform similar to raw because there is very little handling of metadata. Additional I/O only if an L2 table must be read from the disk.

Should we increase the L2 table cache size to make it happen less often? (Currently 16 * 512 MB, QED uses more)

Known problems:

Synchronous read of L2 tables; should be made async
- General thought on making things async: Coroutines? What happened to that proposal?
We may want to have online defragmentation eventually

Growing stand-alone image

Stand-alone images (i.e. images without a backing file) aren't that interesting because you would use raw for them anyway if you needed optimal performance. We need to be "good enough" here.

However, all of the problems that arise from dealing with metadata apply for the really interesting third case, so optimizing them is an important step on the way.

Known problems:

Needs a bdrv_flush between refcount table and L2 table write
Synchronous metadata updates
Both to be solved by block-queue
- Batches writes and makes the async, can greatly reduce number of bdrv_flush calls
- Except for cache=writethrough, but this is secondary
- Should we make cache=off the default caching mode in qemu? writethrough seems to be a bit too much anyway irrespective of the image format.
Synchronous refcount table reads
- How frequent are cache misses?
- Making this one async is much harder than L2 table reads. We can make it a goal for mid-term, but short term we should make it hurt less if it's a problem in practice.
  - It's probably not, because (without internal snapshots or compression) we never free clusters, so we fill it sequentially and only load a new one when the old one is full - and that one we don't even read, but write, so block-queue will help
Things like refcount table growth are completely synchronous.
- Not a real problem, because it happens approximately never.

Growing image with backing file

This is the really interesting scenario where you need an image format that provides some features. For qcow2, it's mostly the same as above.

See stand-alone, plus:

Needs an bdrv_flush between COW and writing to the L2 table
- qcow2 has already one after refcount table write, so no additional overhead
Synchronous COW
- Should be fairly easy to make async

Benchmark results

The following FFSB benchmark results model QCOW2 performance:

QCOW2 (2010/9/22) is the upstream QCOW2 code including Kevin's work to avoid flushes when possible.
QED with L2 Flush is QED with a flush on L2 update, the best QCOW2 can get (without some batching) since a flush is required after refcount update.
QED is the QED v1 patchset, a fully asynchronous block driver implementation.

All results are throughput in MB/s.

FFSB Scenario	Threads	QCOW2 (2010/9/22)	QED with L2 Flush	QED
Large File Creates (Block Size=256KB)	1	95.1	101.0	121.5
	8	110.0	95.2	125.0
	16	89.4	96.8	134.5
Sequential Reads (Block Size=256KB)	1	68.2	154.0	149.0
	8	262.0	849.0	817.5
	16	276.0	908.0	922.0
Large File Creates (Block Size=8KB)	1	11.8	13.9	16.3
	8	18.4	29.7	41.0
	16	17.7	30.2	43.3
Sequential Reads (Block Size=8KB)	1	20.8	26.3	22.2
	8	13.8	136.0	131.5
	16	13.8	160.0	160.0
Random Reads (Block Size=8KB)	1	3.8	3.5	3.6
	8	13.8	26.0	25.7
	16	13.7	48.8	48.4
Random Writes (Block Size=8KB)	1	22.6	22.5	22.6
	8	14.0	113.1	112.7
	16	14.0	125.2	126.7
Mail Server (Block Size=8KB)	1	7.8	8.8	8.3
	8	10.8	45.8	46.4
	16	11.8	72.0	73.2
Mixed I/O (70% Reads, 30% Writes, Block Size=8KB)	1	10.2	9.5	8.8
	8	24.8	63.5	62.9
	16	28.6	114.1	111.9

O_DIRECT benchmark execution. Image file on ext4 on LVM on 8 LUNs.