Features/AllocationFailures: Difference between revisions

From QEMU
(Created page with " == Introduction == In most situations memory allocations never fail in qemu; we just find out later that there's not enough memory when the kernel OOM-killer delivers a sign...")
 
 
(One intermediate revision by the same user not shown)
Line 9: Line 9:
== Gracefully handling allocation failures ==
== Gracefully handling allocation failures ==


  Allocation failure during QEMU startup is the least problematic; the user hasn't lost anything at this point if
Allocation failure during QEMU startup is the least problematic; the user hasn't lost anything at this point if
  QEMU exits.
QEMU exits.


  Allocation failure on a running QEMU is much more important - if a VM is happily running, a crashing QEMU causes downtime for the user and/or potential data loss.
Allocation failure on a running QEMU is much more important - if a VM is happily running, a crashing QEMU causes downtime for the user and/or potential data loss.


  Ideally all memory paths would be checked and fail cleanly; however it's acknowledged that in QEMUs structure it's difficult to check all allocations, thus we provide pragmattic rules.
Ideally all memory paths would be checked and fail cleanly; however it's acknowledged that in QEMUs structure it's difficult to check all allocations, thus we provide pragmattic rules.


# Allocation failures should never cause bad pointers
# Allocation failures should never cause bad pointers
  Allocations should be checked and the choice is whether to take some
#:  Allocations should be checked and the choice is whether to take some
  recovery or exit; it should never end up writing to a NULL pointer for
#:  recovery or exit; it should never end up writing to a NULL pointer for
  example.
#:  example.
 
# When exiting due to an allocation error, error exits should be used not abort
# When exiting due to an allocation error, error exits should be used not abort
  'abort' indicates an internal error for which the host may capture a coredump and submit an error, running out of
#:  'abort' indicates an internal error for which the host may capture a coredump and submit an error, running out of
  memory is generally not an internal error and so should just exit.
#:  memory is generally not an internal error and so should just exit.
  That's tricky with some libraries (glib in particular) - but where possible try
#:  That's tricky with some libraries (glib in particular) - but where possible try
  and make it give an error exit
#:  and make it give an error exit
 
# Small allocation checks can exit
# Small allocation checks can exit
#: Small allocations are less likely to fail than large allocations, and they're a pain to check every one,
#: Small allocations are less likely to fail than large allocations, and they're a pain to check every one,
#: so pragmatically don't worry about them and allow exits on allocation failures.
#: so pragmatically don't worry about them and allow exits on allocation failures.
 
#::However, if you are allocating a list of hundreds of tiny allocations and the total size is large
#: However, if you are allocating a list of hundreds of tiny allocations and the total size is large
#::then it is worth thinking about the total size.
#: then it is worth thinking about the total size.
#:: 'small' is arbitrary and is arbitrarily chosen to be a typical 4k page size.
 
#: 'small' is arbitrary and is arbitrarily chosen to be a typical 4k page size.
 
# Allocations during startup can exit
# Allocations during startup can exit
#: If the VM hasn't started running then it's OK to exit due to an allocation failure.
#: If the VM hasn't started running then it's OK to exit due to an allocation failure.
#: Take care of allocations that can happen during either startup or hot plug.
#: Take care of allocations that can happen during either startup or hot plug.
# Large allocations that fail after startup should not cause a failure
# Large allocations that fail after startup should not cause a failure
#: All large allocations that can happen once the VM is running should be checked, and
#: All large allocations that can happen once the VM is running should be checked, and
#: a failure should not cause the VM to fail.
#: a failure should not cause the VM to fail.
# Failures triggered by a monitor command should return an error on the monitor
# Failures triggered by a monitor command should return an error on the monitor
#: A simple example of this is hotplugging more RAM or a device
#: A simple example of this is hotplugging more RAM or a device
# Clean up
# Clean up
#: When a large allocation fails, make sure to clean up so the VM isn't stuck
#: When a large allocation fails, make sure to clean up so the VM isn't stuck
#: with a half allocated device etc.
#: with a half allocated device etc.

Latest revision as of 13:35, 22 October 2018

Introduction

In most situations memory allocations never fail in qemu; we just find out later that there's not enough memory when the kernel OOM-killer delivers a signal and kills the process. However with enough ulimit configuration it is possible to get an allocation to fail. This documents what we would like the behaviour to be when an allocation fails.

Gracefully handling allocation failures

Allocation failure during QEMU startup is the least problematic; the user hasn't lost anything at this point if QEMU exits.

Allocation failure on a running QEMU is much more important - if a VM is happily running, a crashing QEMU causes downtime for the user and/or potential data loss.

Ideally all memory paths would be checked and fail cleanly; however it's acknowledged that in QEMUs structure it's difficult to check all allocations, thus we provide pragmattic rules.

  1. Allocation failures should never cause bad pointers
    Allocations should be checked and the choice is whether to take some
    recovery or exit; it should never end up writing to a NULL pointer for
    example.
  2. When exiting due to an allocation error, error exits should be used not abort
    'abort' indicates an internal error for which the host may capture a coredump and submit an error, running out of
    memory is generally not an internal error and so should just exit.
    That's tricky with some libraries (glib in particular) - but where possible try
    and make it give an error exit
  3. Small allocation checks can exit
    Small allocations are less likely to fail than large allocations, and they're a pain to check every one,
    so pragmatically don't worry about them and allow exits on allocation failures.
    However, if you are allocating a list of hundreds of tiny allocations and the total size is large
    then it is worth thinking about the total size.
    'small' is arbitrary and is arbitrarily chosen to be a typical 4k page size.
  4. Allocations during startup can exit
    If the VM hasn't started running then it's OK to exit due to an allocation failure.
    Take care of allocations that can happen during either startup or hot plug.
  5. Large allocations that fail after startup should not cause a failure
    All large allocations that can happen once the VM is running should be checked, and
    a failure should not cause the VM to fail.
  6. Failures triggered by a monitor command should return an error on the monitor
    A simple example of this is hotplugging more RAM or a device
  7. Clean up
    When a large allocation fails, make sure to clean up so the VM isn't stuck
    with a half allocated device etc.