ToDo/Block: Difference between revisions

From QEMU
No edit summary
Line 27: Line 27:
* blockdev-add probably shouldn't be able to reference a node that has blockers
* blockdev-add probably shouldn't be able to reference a node that has blockers


== Basic infrastructure for blockdev-add [Kevin, Markus] ==
== -blockdev world ==
* Allow libvirt to override backing file options (filename, format, cache mode)
** Convert bdrv_open() flags to QDict options (or handle in drive_init)
** Implemented: filename, format, driver-specific options
** Missing: Cache mode, AIO mode, discard mode, copy on read (basically everything that is contained in bs->open_flags)
*** These options are inherited; need to clean up and fix inheritance of QDict options
** Investigating possible QemuOpts misuse killing drive_del [Markus]
* Convert remaining drivers to make use of "QDict options" argument
* bdrv_reopen() needs to use "QDict options" instead of only flags


== blockdev-add + blockdev-del QMP interface ==
=== Basic infrastructure for blockdev-add [Kevin, Markus] ===
* Convert remaining drivers to make use of "QDict options" argument: iscsi, sheepdog, rbd
 
=== blockdev-add + blockdev-del QMP interface ===
* By default, return an error for blockdev-del if reference count > 1
* By default, return an error for blockdev-del if reference count > 1
* But have a force option that closes the image file, even if it breaks the remaining users (e.g. uncooperative guest that doesn't release its  PCI device)
* But have a force option that closes the image file, even if it breaks the remaining users (e.g. uncooperative guest that doesn't release its  PCI device)
* Note: backends created with blockdev-add are currently indestructible: they aren't deleted on frontend unplug (commit 2d246f0), and can't be deleted with drive_del (commit 48f364d)
* Note: backends created with blockdev-add are currently indestructible: they aren't deleted on frontend unplug (commit 2d246f0), and can't be deleted with drive_del (commit 48f364d)


== Split BlockBackend from BlockDriverState [Max, Markus] ==
=== Split BlockBackend from BlockDriverState [Max, Markus] ===
* Make block driver private embedded in BlockDriverState instead of opaque pointer
* Make block driver private embedded in BlockDriverState instead of opaque pointer
* To be moved to BlockFilters later (stay in BDS for now; BlockFilters implemented as BlockDriver):
* To be moved to BlockFilters later (stay in BDS for now; BlockFilters implemented as BlockDriver):
Line 48: Line 43:
** copy_on_read
** copy_on_read


== BlockFilter and dynamic reconfiguration of the BDS graph ==
=== BlockFilter and dynamic reconfiguration of the BDS graph ===
* Add/remove (e.g. filter) BDSes at runtime
* Add/remove (e.g. filter) BDSes at runtime
* Ability to implement light-weight block drivers that play together with snapshots (e.g. block debug, active-mirroring, copy-on-read, I/O throttling, etc)
* Ability to implement light-weight block drivers that play together with snapshots (e.g. block debug, active-mirroring, copy-on-read, I/O throttling, etc)
Line 58: Line 53:
* Be careful to never add cycles to the graph!
* Be careful to never add cycles to the graph!


== New design for better encryption support [Markus] ==
=== Dynamic graph reconfiguration (e.g. adding block filters, taking snapshots, etc.) ===
* Existing QCOW/QCOW2 encryption is cryptographically unsound (see commit 136cd19)
* Where does the new node get inserted and how to specify how it is linked up with the existing nodes?
* Its integration into QEMU has serious defects, and usability is comically bad (see commit a1f688f)
** On a given "arrow" between two nodes (only works with 1 child, 1 parent)
* Deprecated in 2.3 (commit a1f688f), intend to rip out in 2.4
** On a given set of arrows (possibly more complex than what is really needed?)
* How does removing a node work with more than one child of the deleted node?
* Keep using the existing QMP command for I/O throttling for now, until we understand the general problem reasonably well
* Action:
** Figure out the general problem
** Split I/O throttling off into own BDS [Benoît]
*** Requires some care with snapshots etc.
 
=== BDS graph rules and manipulating arbitrary nodes ===
* Arbitrary nodes
** Accept node-name where we now have other means to id BDS
*** drive-mirror of arbitrary node [Berto]
*** block-stream of arbitrary node [Berto]
** Action:
*** Add base-nodename argument to block-stream command [Jeff]
*** Allow node names in the device argument of the block-stream command [Berto]
**** If command can modify part of a backing chain, need to add option to update the parent's backing filename field on disk! [Jeff]
**** Add optional backing-filename argument (since libvirt may use fd passing and QEMU's filename is useless) [Jeff]
***** Done: block-commit, block-stream, change-backing-file
***** Might need more
*** Deprecate filename references in QMP commands in favour of node names (e.g. streaming base) [Jeff?]
 
=== Add blockdev-* equivalents for drive_* commands ===
* Can't specify blockdev-add options for other QMP commands creating new BDSes (e.g. block jobs, live snapshots)
* Can't specify image creation options for QMP commands creating new image files
* Solve this by adding blockdev-* commands that work only on existing BDSes (referred to by node-name), so that users can use qemu-img and blockdev-add to set the right options
 
== New design for better encryption support [Dan] ==
* Deprecated in 2.3 (commit a1f688f), disabled in the system emulator in 2.7 (commit 8c0dcbc4a)
* Dan Berrange intends to work on a replacement, starting in a few months https://www.berrange.com/posts/2015/03/17/qemu-qcow2-built-in-encryption-just-say-no-deprecated-now-to-be-deleted-soon/
* Dan Berrange intends to work on a replacement, starting in a few months https://www.berrange.com/posts/2015/03/17/qemu-qcow2-built-in-encryption-just-say-no-deprecated-now-to-be-deleted-soon/
** LUKS format
** LUKS format driver is merged
** stackable (may require BlockFilter)
** qcow2 integration is still missing


== Block jobs ==
== Block jobs ==
* Live streaming of intermediate layers (using block-stream) [Berto]
** libvirt is going to expose the functionality, may need introspection to detect it, though
* Main loop prints "main-loop: WARNING: I/O thread spun for 1000 iterations" when block job is running.
* Main loop prints "main-loop: WARNING: I/O thread spun for 1000 iterations" when block job is running.
** We have "block_job_sleep_ns(..., 0)" in block job coroutines, but that doesn't really yield the BQL to VCPU as desired.
** We have "block_job_sleep_ns(..., 0)" in block job coroutines, but that doesn't really yield the BQL to VCPU as desired.


== Remove bs->job field and allow multiple jobs on a BDS [mreitz] ==
=== Live streaming of intermediate layers (using block-stream) [Berto] ===
** QEMU part is merged
** libvirt is going to expose the functionality, may need introspection to detect it, though
 
=== Active mirroring: just like mirroring, but live, on the fly, skip the bitmap [Kevin] ===
* similar to drive-backup
* security (virus-scan or some sort of inspection)
* should be implemented as a block filter
 
=== Remove bs->job field and allow multiple jobs on a BDS [John/Max] ===
* allows more than one blockjob at a block device at a time
* allows more than one blockjob at a block device at a time
* infra-structure, refactoring work
* infra-structure, refactoring work
* Careful not to break QMP API, need wrapper
* Careful not to break QMP API, need wrapper
* Allow non-block jobs (long-running operations outside block layer)
* Allow non-block jobs (long-running operations outside block layer)
* Alternatively, move to BB for now


== Test Infrastructure ==
=== Image fleecing [jsnow] ===
* Image creation with existing BlockDriverState as backing file (BlockDriverState ref count)
* Patches: http://lists.gnu.org/archive/html/qemu-devel/2013-11/msg03692.html
** still not merged
* Writable backing file
 
=== Incremental backup [jsnow] ===
* Backup applications need a dirty block bitmap so they can read only blocks that changed
* Two approaches discussed:
** Dirty bitmap file
** Write changed data through NBD:
*** See also: http://lists.gnu.org/archive/html/qemu-devel/2013-11/msg03035.html
** Merkle tree - hash tree allows efficient syncing of image files between hosts
 
=== Block job I/O throttling ===
* Reuse Benoît's throttling implementation
* Handle large buffer sizes used by block jobs
** block jobs like to work in bulk for efficiency but throttling doesn't like big, bursty requests (note from Benoît: I think the solution would be to make the clock used for the throttling computation coarser)
 
== Test ==
=== Test Infrastructure ===
* Desires?
* Desires?
* support for testing AIO requests
* support for testing AIO requests
Line 86: Line 135:
** related to SCSI req tags
** related to SCSI req tags


== Tests for -drive discard= ==
=== Tests for -drive discard= ===
* Currently the discard feature is not well-tested in qemu-iotests
* Currently the discard feature is not well-tested in qemu-iotests


== Block device model tests ==
=== Block device model tests ===
* AHCI got decent coverage
* AHCI got decent coverage
* rest basic to nonexistent
* rest basic to nonexistent


== Active mirroring: just like mirroring, but live, on the fly, skip the bitmap ==
=== iotests.py - Python module for writing qemu-iotests [stefan] ===
* similar to drive-backup
* Extract qemu.py generic QEMU interaction code
* security (virus-scan or some sort of inspection)
* Document qemu.py and iotests.py so it meets standard Python module conventions
* should be implemented as a block filter
* Port live migration qemu-iotest to Python to see if it's preferrable to the shell version
 
=== Broken or unreliable qemu-iotests ===
* 136 on tmpfs is not working or unreliable


== qcow2 ==
== qcow2 ==
* Cluster allocation performance: [kevin]
* Cluster allocation performance: [Kevin/John]
** Delayed COW  
** Delayed COW  
** Use a single request to write both guest data and COW padding
** Use a single request to write both guest data and COW padding
Line 111: Line 163:
* subclusters
* subclusters
** allocate larger chunks, cow smallers ones, for perf
** allocate larger chunks, cow smallers ones, for perf
* Implement refcount_bits for smaller metadata [Max]
** 2.3 can create/use such images; amend is missing yet
* Use finer grained locking in the Qcow2Cache so that random I/O loads can load/update multiple L2 tables at once instead of serialising everything
* Use finer grained locking in the Qcow2Cache so that random I/O loads can load/update multiple L2 tables at once instead of serialising everything
* Shrink support in qcow2_truncate()
* Shrink support in qcow2_truncate()


== Image fleecing [jsnow] ==
=== Header extension for qcow2 generation id ===
* Image creation with existing BlockDriverState as backing file (BlockDriverState ref count)
Desirable to add two header extensions:
* Patches: http://lists.gnu.org/archive/html/qemu-devel/2013-11/msg03692.html
# Double Generation id (one for metadata, one for guest visible content), coupled with an auto-clear feature bit, for use in backing images
** still not merged
# Expected backing file generations, for use in overlay images
* Writable backing file
Usage:
* Any program that opens a qcow2 file read-write with a generation id header must increment the appropriate generation ids before making that sort of change to that file. The id does not have to be incremented for every change to the file, only for the first time a change is made since the file was opened for writing.
* A generation id is valid only if the auto-clear bit is still set (thus, if an older qemu opens a backing image, it is required to leave the unrecognized generation id header alone, but also required to clear the unknown auto-clear bit, making it obvious that the generation id header may no longer be accurate and a new generation id is needed once new qemu again handles the file).
* Any program that opens a qcow2 file that has expected backing generation header should default to verifying that the backing file has that generation id.  If the backing file id is not correct, then the access should fail unless the user supplies an extra flag to acknowledge the risk/update the expected id.
* Should internal snapshots track id? Open question.
** If snapshot includes generation id, then you can roll back to that id as part of reverting to a snapshot. But the id must then be something like a UUID, as mere linear incrementing causes branching collisions (take snapshot at id 2, then create id 3, then roll back to 2, then create a new id 3, but the two "id 3" states are not the same, which breaks any overlay depending on id 3).
** If snapshot does not include generation id, then the mere act of taking or reverting to a snapshot increments the id, and overlays must use a forced open to accept the new id, even if guest-visible contents are unchanged.


== Incremental backup [jsnow] ==
* Backup applications need a dirty block bitmap so they can read only blocks that changed
* Two approaches discussed:
** Dirty bitmap file
** Write changed data through NBD:
*** See also: http://lists.gnu.org/archive/html/qemu-devel/2013-11/msg03035.html
** Merkle tree - hash tree allows efficient syncing of image files between hosts


== virtio data plane ==
== virtio data plane ==
Line 153: Line 202:
== Make qemu-img use QMP command implementations internally (e.g. use mirroring for qemu-img convert) [Max] ==
== Make qemu-img use QMP command implementations internally (e.g. use mirroring for qemu-img convert) [Max] ==
* Ensures that live operations provide the same functionality as we have offline
* Ensures that live operations provide the same functionality as we have offline
== Block job I/O throttling ==
* Reuse Benoît's throttling implementation
* Handle large buffer sizes used by block jobs
** block jobs like to work in bulk for efficiency but throttling doesn't like big, bursty requests (note from Benoît: I think the solution would be to make the clock used for the throttling computation coarser)


== I/O accounting (for query-blockstats) ==
== I/O accounting (for query-blockstats) ==
Line 175: Line 219:
** Averaging module under review
** Averaging module under review
** More tocme
** More tocme
== Add guard page to bottom of coroutine stacks in order to detect stack overflows ==


== Performance improvements ==
== Performance improvements ==
Line 194: Line 236:


== Trace guest block I/O, replay with qemu-io ==
== Trace guest block I/O, replay with qemu-io ==
== iotests.py - Python module for writing qemu-iotests [stefan] ==
* Extract qemu.py generic QEMU interaction code
* Document qemu.py and iotests.py so it meets standard Python module conventions
* Port live migration qemu-iotest to Python to see if it's preferrable to the shell version
== Dynamic graph reconfiguration (e.g. adding block filters, taking snapshots, etc.) ==
* Where does the new node get inserted and how to specify how it is linked up with the existing nodes?
** On a given "arrow" between two nodes (only works with 1 child, 1 parent)
** On a given set of arrows (possibly more complex than what is really needed?)
* How does removing a node work with more than one child of the deleted node?
* Keep using the existing QMP command for I/O throttling for now, until we understand the general problem reasonably well
* Action:
** Figure out the general problem
** Split I/O throttling off into own BDS [Benoît]
*** Requires some care with snapshots etc.
== Proper specification for blockdev-add [Kevin, Max] ==
* What does -drive add?
* Filename parsing and protocol detection
* Format probing
* Desugaring -drive
* What does BDRV_O_PROTOCOL mean?
** Disable format probing
** Parse protocol name from filename (but not from options QDict)
*** Put filename then into options QDict
** Set bs->growable
** Disable adding of a bs->file layer
** Ignore BDRV_O_SNAPSHOT
** Which callers need which of these properties?
* Action:
** Convert network block drivers to QDict options (keep legacy filename parsing for compatibility)
** Add network block drivers to blockdev-add
** Translate bdrv_open() arguments into options qdict, if appropriate [Kevin]
*** Translate legacy "filename" to qdict
** Specify bdrv_open() behavior (especially magic) [Kevin]
== BDS graph rules and manipulating arbitrary nodes ==
* A proper design: iterate children, safely manipulate graph
** Action:
*** Get rid of bdrv_swap() and update child/parent pointers instead (depends on BlockBackend) [Kevin]
*** Add parents list to BlockDriverState (could be realloc array or just a function interface that operates on ->file/->backing_hd) [nice to have]
* Arbitrary nodes
** Accept node-name where we now have other means to id BDS
*** drive-mirror of arbitrary node [Berto]
*** block-stream of arbitrary node [Berto]
** Action:
*** Add base-nodename argument to block-stream command [Jeff]
*** Allow node names in the device argument of the block-stream command [Berto]
**** If command can modify part of a backing chain, need to add option to update the parent's backing filename field on disk! [Jeff]
**** Add optional backing-filename argument (since libvirt may use fd passing and QEMU's filename is useless) [Jeff]
***** Done: block-commit, block-stream, change-backing-file
***** Might need more
*** Deprecate filename references in QMP commands in favour of node names (e.g. streaming base) [Jeff?]
== Add blockdev-* equivalents for drive_* commands ==
* Can't specify blockdev-add options for other QMP commands creating new BDSes (e.g. block jobs, live snapshots)
* Can't specify image creation options for QMP commands creating new image files
* Solve this by adding blockdev-* commands that work only on existing BDSes (referred to by node-name), so that users can use qemu-img and blockdev-add to set the right options


== Dataplane ==
== Dataplane ==
Line 280: Line 263:
** what if blocks are shared with users that are likely to use them again?
** what if blocks are shared with users that are likely to use them again?
* should it advise kernel src is read sequentially?
* should it advise kernel src is read sequentially?
== Header extension for qcow2 generation id ==
Desirable to add two header extensions:
# Double Generation id (one for metadata, one for guest visible content), coupled with an auto-clear feature bit, for use in backing images
# Expected backing file generations, for use in overlay images
Usage:
* Any program that opens a qcow2 file read-write with a generation id header must increment the appropriate generation ids before making that sort of change to that file.  The id does not have to be incremented for every change to the file, only for the first time a change is made since the file was opened for writing.
* A generation id is valid only if the auto-clear bit is still set (thus, if an older qemu opens a backing image, it is required to leave the unrecognized generation id header alone, but also required to clear the unknown auto-clear bit, making it obvious that the generation id header may no longer be accurate and a new generation id is needed once new qemu again handles the file).
* Any program that opens a qcow2 file that has expected backing generation header should default to verifying that the backing file has that generation id.  If the backing file id is not correct, then the access should fail unless the user supplies an extra flag to acknowledge the risk/update the expected id.
* Should internal snapshots track id? Open question.
** If snapshot includes generation id, then you can roll back to that id as part of reverting to a snapshot. But the id must then be something like a UUID, as mere linear incrementing causes branching collisions (take snapshot at id 2, then create id 3, then roll back to 2, then create a new id 3, but the two "id 3" states are not the same, which breaks any overlay depending on id 3).
** If snapshot does not include generation id, then the mere act of taking or reverting to a snapshot increments the id, and overlays must use a forced open to accept the new id, even if guest-visible contents are unchanged.
== Broken or unreliable qemu-iotests ==
* 136 on tmpfs is not working or unreliable


== Dependency graph ==
== Dependency graph ==

Revision as of 16:25, 12 December 2016

This page contains block layer and storage features that have been proposed. These features may not be in active development and questions about them should be addressed to the QEMU mailing list at qemu-devel@nongnu.org.

op blockers [Jeff]

  • Mutual exclusion of operations/background jobs
    • Streaming in two different parts of the backing chain - allowed? (Benoît though that not, but does anything break?)
    • Does streaming only require that streamed images stay read-only (i.e. backing chain segment on which the operation is performed)
    • Live commit in the opposite direction at the same time?
    • Action:
      • Draw up matrix of operations (mirror, stream, resize, etc)
      • Make op blocker mechanism use matrix as data instead of code (define an array)
      • Enforce that new QMP/QAPI commands and block jobs add themselves to the matrix
  • node-name allows starting operations in the middle of the chain; we need to protect against incompatible concurrent operations
    • In fact, we even used paths before node-name (e.g. for live commit), so this has existed for a while
  • bs->backing_blocker already forbids almost everything on backing files
    • Except live commit, which needs to be forbidden only when another job runs on the same chain
  • Plan for 2.1 was to block all nodes recursively
    • bdrv_swap() during block job completion turns out to be nasty, especially for live commit of active layer:
      • Need to clean up blockers on the removed subchain
      • Which blockers should the newly swapped in node have?
  • Alternative plan for 2.1:
    • Keep checking blockers on the requested node (for bs->backing_blockers to be effective)
    • But also check in the active layer because this is where block jobs do their blocking
      • bottommost node might work as well
        • As Kevin pointed out on IRC, in the current code blockers exist on backing files that don't exist on the active layer
  • Long term (2.2+): Block categories of operations
  • blockdev-add probably shouldn't be able to reference a node that has blockers

-blockdev world

Basic infrastructure for blockdev-add [Kevin, Markus]

  • Convert remaining drivers to make use of "QDict options" argument: iscsi, sheepdog, rbd

blockdev-add + blockdev-del QMP interface

  • By default, return an error for blockdev-del if reference count > 1
  • But have a force option that closes the image file, even if it breaks the remaining users (e.g. uncooperative guest that doesn't release its PCI device)
  • Note: backends created with blockdev-add are currently indestructible: they aren't deleted on frontend unplug (commit 2d246f0), and can't be deleted with drive_del (commit 48f364d)

Split BlockBackend from BlockDriverState [Max, Markus]

  • Make block driver private embedded in BlockDriverState instead of opaque pointer
  • To be moved to BlockFilters later (stay in BDS for now; BlockFilters implemented as BlockDriver):
    • bps_limits
    • copy_on_read

BlockFilter and dynamic reconfiguration of the BDS graph

  • Add/remove (e.g. filter) BDSes at runtime
  • Ability to implement light-weight block drivers that play together with snapshots (e.g. block debug, active-mirroring, copy-on-read, I/O throttling, etc)
    • Converting current I/O throttling code to a block filter should be simple, mostly a mechanical task.
  • Requires BlockBackend split
    • Keep filters on top even after taking snapshots
  • filters implement ops normally, and call out to their child BDS explicitly, no before- or after-ops-magic
  • Benoît's customer may want I/O throttling in arbitrary places in the graph
  • Be careful to never add cycles to the graph!

Dynamic graph reconfiguration (e.g. adding block filters, taking snapshots, etc.)

  • Where does the new node get inserted and how to specify how it is linked up with the existing nodes?
    • On a given "arrow" between two nodes (only works with 1 child, 1 parent)
    • On a given set of arrows (possibly more complex than what is really needed?)
  • How does removing a node work with more than one child of the deleted node?
  • Keep using the existing QMP command for I/O throttling for now, until we understand the general problem reasonably well
  • Action:
    • Figure out the general problem
    • Split I/O throttling off into own BDS [Benoît]
      • Requires some care with snapshots etc.

BDS graph rules and manipulating arbitrary nodes

  • Arbitrary nodes
    • Accept node-name where we now have other means to id BDS
      • drive-mirror of arbitrary node [Berto]
      • block-stream of arbitrary node [Berto]
    • Action:
      • Add base-nodename argument to block-stream command [Jeff]
      • Allow node names in the device argument of the block-stream command [Berto]
        • If command can modify part of a backing chain, need to add option to update the parent's backing filename field on disk! [Jeff]
        • Add optional backing-filename argument (since libvirt may use fd passing and QEMU's filename is useless) [Jeff]
          • Done: block-commit, block-stream, change-backing-file
          • Might need more
      • Deprecate filename references in QMP commands in favour of node names (e.g. streaming base) [Jeff?]

Add blockdev-* equivalents for drive_* commands

  • Can't specify blockdev-add options for other QMP commands creating new BDSes (e.g. block jobs, live snapshots)
  • Can't specify image creation options for QMP commands creating new image files
  • Solve this by adding blockdev-* commands that work only on existing BDSes (referred to by node-name), so that users can use qemu-img and blockdev-add to set the right options

New design for better encryption support [Dan]

Block jobs

  • Main loop prints "main-loop: WARNING: I/O thread spun for 1000 iterations" when block job is running.
    • We have "block_job_sleep_ns(..., 0)" in block job coroutines, but that doesn't really yield the BQL to VCPU as desired.

Live streaming of intermediate layers (using block-stream) [Berto]

    • QEMU part is merged
    • libvirt is going to expose the functionality, may need introspection to detect it, though

Active mirroring: just like mirroring, but live, on the fly, skip the bitmap [Kevin]

  • similar to drive-backup
  • security (virus-scan or some sort of inspection)
  • should be implemented as a block filter

Remove bs->job field and allow multiple jobs on a BDS [John/Max]

  • allows more than one blockjob at a block device at a time
  • infra-structure, refactoring work
  • Careful not to break QMP API, need wrapper
  • Allow non-block jobs (long-running operations outside block layer)

Image fleecing [jsnow]

Incremental backup [jsnow]

Block job I/O throttling

  • Reuse Benoît's throttling implementation
  • Handle large buffer sizes used by block jobs
    • block jobs like to work in bulk for efficiency but throttling doesn't like big, bursty requests (note from Benoît: I think the solution would be to make the clock used for the throttling computation coarser)

Test

Test Infrastructure

  • Desires?
  • support for testing AIO requests
    • No design yet, but we need some way to label I/O requests in blkdebug
    • right now we sleep, which is stupid
    • related to SCSI req tags

Tests for -drive discard=

  • Currently the discard feature is not well-tested in qemu-iotests

Block device model tests

  • AHCI got decent coverage
  • rest basic to nonexistent

iotests.py - Python module for writing qemu-iotests [stefan]

  • Extract qemu.py generic QEMU interaction code
  • Document qemu.py and iotests.py so it meets standard Python module conventions
  • Port live migration qemu-iotest to Python to see if it's preferrable to the shell version

Broken or unreliable qemu-iotests

  • 136 on tmpfs is not working or unreliable

qcow2

  • Cluster allocation performance: [Kevin/John]
    • Delayed COW
    • Use a single request to write both guest data and COW padding
    • Journalling (should help a lot with internal COW, and possibly with delayed COW)
  • Run-time image file preallocation (fallocate 128 MB or whatever at the end of the image file to avoid host file system fragmentation; like Parallels series "write/create for Parallels images with reasonable performance" in v3)
  • qcow2 backing file validation (parent modification invalidates children) [jeff]
    • similar to vmdk and vhdx
  • qcow2 internal snapshot read-only BlockDriverState
    • Allows accessing snapshots while guest accesses disk image
    • Tricky, insufficient prio
  • subclusters
    • allocate larger chunks, cow smallers ones, for perf
  • Use finer grained locking in the Qcow2Cache so that random I/O loads can load/update multiple L2 tables at once instead of serialising everything
  • Shrink support in qcow2_truncate()

Header extension for qcow2 generation id

Desirable to add two header extensions:

  1. Double Generation id (one for metadata, one for guest visible content), coupled with an auto-clear feature bit, for use in backing images
  2. Expected backing file generations, for use in overlay images

Usage:

  • Any program that opens a qcow2 file read-write with a generation id header must increment the appropriate generation ids before making that sort of change to that file. The id does not have to be incremented for every change to the file, only for the first time a change is made since the file was opened for writing.
  • A generation id is valid only if the auto-clear bit is still set (thus, if an older qemu opens a backing image, it is required to leave the unrecognized generation id header alone, but also required to clear the unknown auto-clear bit, making it obvious that the generation id header may no longer be accurate and a new generation id is needed once new qemu again handles the file).
  • Any program that opens a qcow2 file that has expected backing generation header should default to verifying that the backing file has that generation id. If the backing file id is not correct, then the access should fail unless the user supplies an extra flag to acknowledge the risk/update the expected id.
  • Should internal snapshots track id? Open question.
    • If snapshot includes generation id, then you can roll back to that id as part of reverting to a snapshot. But the id must then be something like a UUID, as mere linear incrementing causes branching collisions (take snapshot at id 2, then create id 3, then roll back to 2, then create a new id 3, but the two "id 3" states are not the same, which breaks any overlay depending on id 3).
    • If snapshot does not include generation id, then the mere act of taking or reverting to a snapshot increments the id, and overlays must use a forced open to accept the new id, even if guest-visible contents are unchanged.


virtio data plane

  • fine-grained critical sections
    • Requirement for multiqueue and virtio-scsi MMIO thread safety
    • posted on the list by Paolo for 2.6
    • Fine grained BDS lock: change aio_context_acquire/release to bdrv_acquire/release. BDS is acquired by tracked_request_begin, released by tracked_request_end, and released/re-acquired around bs->file operations.
  • thread-safe dirty bitmap for migration
    • atomic access done, missing: hotplug support
    • allows removal of vring.c
  • multiqueue
    • Single BlockDriverState, multiple independent threads accessing in parallel
    • Allows us to extend Linux multiqueue block layer up into guest
    • For maximum SMP scalability and performance with high IOPS SSDs

multiqueue plan:

  1. introduce bdrv_lock/bdrv_unlock that just calls aio_context_acquire/release
  2. convert all aio_context_acquire/release to bdrv_lock/bdrv_unlock
  3. lock/unlock in bdrv_* implementation rather than in the callers (it's a recursive mutex)
  4. introduce QemuMutex for dataplane
  5. unlock/lock in drivers around calls to bs->file->ref
  6. separate QemuMutex for each virtqueue

Make qemu-img use QMP command implementations internally (e.g. use mirroring for qemu-img convert) [Max]

  • Ensures that live operations provide the same functionality as we have offline

I/O accounting (for query-blockstats)

  • Driving it was made device model's responsibility (commit a597e79 in 2011)
    • Most of them still don't
    • The ones that do are inconsistent
    • Consequently, query-blockstats is better treated as wild guess, not data
  • Need to take a step back
    • Benoît got us use cases, discussed on list
      • measuring in the device model is good for billing
      • some metrics are missing
      • It would be good to collect the same data everywhere in the BDS graph for telemetry purpose (seeking hidden costs)
      • having a dedicated JSON socket to output accounting data would be good
      • so we can keep analysis out of qemu
  • Working on revamping the I/O accounting infrastructure (Benoît)
    • Preliminary patches merged
    • Averaging module under review
    • More tocme

Performance improvements

  • old ideas: coroutine bypass (Paolo), coroutines everywhere (Kevin). Scrapped.
  • using driver=file improves performance a bit. Anything we can do to make this happen by default?
  • avoid big allocations for VirtQueueElement. Old patches from Ming Lei used a special-purpose pool allocator, Paolo posted new patches that use regular malloc with smaller allocations (up to a few hundred bytes)
  • Move linux-aio to AioContext. We already have a thread pool, might as well add linux-aio there.

virtio-blk discard support [Peter Lieven]

  • spec, guest driver, device model

virtio-blk gets lots of small requests from the guest

  • But we don't know why
  • Possibly a guest driver issue
  • multiwrite was introduced to mitigate this long ago
  • Need to perform more benchmarks to see to what extent it exists today

Trace guest block I/O, replay with qemu-io

Dataplane

  • AioContext assertions to prevent callbacks in wrong event loop [Stefan]

Adding QMP to qemu-nbd

  • wanted so doing things offline works same as online
  • Patches are on the list, need rebase [Benoit]

Adding TLS to NBD

  • NBD now supports encryption, and qemu exposes that support as of 2.6 [Daniel Berrange].

Adding Sparse File handling to NBD

  • NBD handles holes inefficiently (passing zeroes over the wire for both reads and writes, no way to query where holes are). Proposals have been made on the NBD list on how to add support, with qemu serving as one of the proof-of-concept implementations, target qemu 2.7 [Eric Blake]
    • add WRITE_ZEROES for writing, extension is fairly stable
    • add STRUCTURED_READ for reading, extension is proposed but harder to implement
    • add BLOCK_INFO for querying, extension still under discussion on NBD mailing list

Export QEMU volumes as ISCSI or FCOE

  • Andy Grover is working on implementing a preliminary tcmu-runner plugin using the block layer
  • a QMP socket would be needed to make this usable in a cloud context

Avoid qemu-img convert host cache pollution

  • converting LVM snapshot may fill up host cache uselessly
  • still want to use readahead
  • convert should advise kernel to drop cached src blocks
    • what if blocks are shared with users that are likely to use them again?
  • should it advise kernel src is read sequentially?

Dependency graph

(paste on http://yuml.me/diagram/scruffy/class/draw) [Image fleecing] [Incremental backup] [qcow2 improvements] [Make qemu-img use QMP command implementations internally] [Recursive Op Blocker]->[Category Op Blocker] [Category Op Blocker]->[intermediate layers live streaming] [Category Op Blocker]->[Basic infrastructure for blockdev-add] [Category Op Blocker]->[drive-mirror (block-mirror) of arbitrary node] [Category Op Blocker]->[Jeff Cody's block-commit of arbitrary node] [Proper specification for blockdev-add]->[Basic infrastructure for blockdev-add] [Basic infrastructure for blockdev-add]->[blockdev-add + blockdev-del QMP interface] [Image formats] [virtio-blk data plane]->[Multiqueue block layer] [virtio-scsi data plane] [Split BlockBackend from BlockDriverState]->[Get rid of bdrv_swap] [Split BlockBackend from BlockDriverState]->[BlockFilter and dynamic reconfiguration of the BDS graph] [BlockFilter and dynamic reconfiguration of the BDS graph]->[New design for better encryption support] [BlockFilter and dynamic reconfiguration of the BDS graph]->[Active mirroring] [BlockFilter and dynamic reconfiguration of the BDS graph]->[Throttle as a filter ?] [Remove bs->job field and allow multiple jobs on a BDS] [iotests.py - Python module for writing qemu-iotests]->[Test Infrastructure] [Tests for -drive discard] [Adding TLS to NBD] [AHCI emulation] [Block job I/O throttling] [I/O accounting] [Add guard page to bottom of coroutine stack] [I/O throttling groups] [QMP added to qemu-nbd] -> [Export QEMU volumes as ISCSI or FCOE with QMP] (result: http://yuml.me/c2e9edb8)

Old