Features/Migration/Troubleshooting

From QEMU

You're probably looking at this page because migration has just failed; sorry about that, but hopefully this page will give you some idea of how to figure out why and importantly what to include in any bug report.


Basics

QEMU's migration is like lifting the state out of one machine and dumping it into another - the (emulated) hardware of the two machines have to match pretty much exactly for this to succeed.

Note that QEMU supports migrating forward between QEMU versions but in general not backwards, although some distros support this on their packaged QEMU versions.

Machine types

QEMU's machine types (the parameter to -M or --machine) is a definition of the basic shape of the emulated machine; the closest analogy is to the model of motherboard in a system. Migration requires you to have the same machine type on the source and destination machines. Architectures tend to have a variety of machine types (e.g. on x86 there is the 'pc' and the 'q35' family) that correspond to different generations of system. In addition some architectures version the machine types - e.g. pc-i440fx-2.5, pc-i440fx-2.6. Newer QEMUs normally keep (most of) the older machine types around so that you can migrate. So for example, a 2.6 release of QEMU should be able to migrate from a 2.5 release using the pc-i440fx-2.5 or pc-i440fx-2.4 machine types; Note it's not heavily tested!

Note that some machine types are aliases; on x86 the 'pc' and 'q35' machine types are aliases to whatever the latest version is on that version of qemu, and thus migrating between two different qemu versions both started with machine type 'pc' often won't work - use the full machine type.

ROMs

The ROM images used on the two hosts should be approximately (within a page size) the same size; if the ROMs do not match in size the migration is normally refused; care should be taken when packaging or upgrading BIOS, net boot roms etc to ensure this constraint is met.

Devices

The devices on the source and destination VMs must be identical - although any host devices they depend on can be different; for example you can't migrate between a VM with an IDE controller and another that replaced it with a SATA controller; but you can migrate between a VM with an IDE controller connected to a local file and another VM with an IDE controller backed by an iSCSI LUN.

Ordering and addressing of devices

When adding a device using -device on the qemu command line, it's normally added to the next available slot on the bus unless an address is specified. It's best to specify the address explicitly to avoid the source and destination ending up with different allocations; e.g. use -device pci-ohci,addr=5 -device usb-mouse,port=1

Hotplugged devices

Hotplugging can't normally be performed during a migration, however it's fine to hot plug/unplug a device before migration starts as long as care is taken to ensure that the state of the destination VM is identical to the current state of the source VM prior to the start of migration. Particular care should be taken to specify the address/port of devices with hotplugged devices since the automatic allocation on the command line of the destination won't necessarily reflect the history of hot plug/unplug events on the source.

Host devices

Host PCI devices that are passed through to the guest normally block migration. There are various attempts to fix this for special cases of network cards, but none of them are complete yet.

Block storage

To do: cache=none, all the different ways to migrate block

Reporting bugs

If you report a migration bug please make sure that you:

* Include the full QEMU versions you're using (including the full package version if you're using a distro's build)
* The full qemu command line on both the source and destination (feel free to remove identifying paths/passwords/IPs etc)
* The qemu log output from both the source and destination.
* Describe the networking between the two hosts (e.g. TCP over 10Gb ether)
* Any of the migration parameters or capabilities you set/changed.
* Details of whether you hot plugged anything
* Is it repeatable or occasional?
* How does it fail? An error? A hang ? etc - see below for additional details.

Finding logs

If you're using any system that uses libvirt, then libvirt normally captures the logs from the VM. On system libvirt they're normally in /var/log/libvirt/qemu/guestname.log. If you're running it on a desktop with a user libvirt session then try ~/.cache/libvirt/qemu/log (although that probably doesn't migrate). If you're using openstack it can be a little tricky to figure out which instance name corresponds to the VM you're migrating.

Types of failure

Migrations fail in lots of different ways; when reporting the bug make sure to indicate the type of failure and the additional details mentioned below.

Migrations that never finish

A migration that doesn't finish is not necessarily a bug - it might be that your VM is changing memory too quickly for the migration to stuff the data down the network connection. If the VM is still running happily on the guest, 'info migrate' on the source shows it as 'active' and you still see a large network bandwidth transferring data then this is probably what's happening. You can try using postcopy migration, autoconverge or increasing your downtime to cope with big VMs rapidly changing memory.

If it doesn't finish but the source has stopped, or the source is still running but 'info migrate' isn't in active, or even if it's in active but there's very little network bandwidth, then report a bug, remember to include the output of 'info migrate' (taken a couple of times a few seconds apart) gathered from the source.

Migrations that fail with a guest hang or crash

These are the worst case and are pretty hard to debug; if the only failure is in the guest then it's best to start seeing if you can see any logs inside the guest after restarting it, e.g. anything that would indicate a particular device failure etc. Running a memory test in the guest during a migrate (assuming you're host is OK!) is a good way to check the migration code isn't doing something really bad.

If reporting this provide details of the guest you're running, also check the qemu logs on the source and destination for warnings.

Migrations that fail with an error

When a migration fails, check the logs on both sides for errors; sometimes it's tricky to figure out which side caused the problem.

Names in qemu's migration errors

The names in qemu's migration errors correspond to internal object names; they fall into 3 categories:

  • Simple names like 'vmmouse'
  • Fixed but structured names, e.g. '/rom@etc/acpi/tables/2'
  • Names with PCI, USB or SCSI bus IDs in, e.g. '0000:02.0/qxl.vram'

qemu: terminating on signal .. from pid ....

That's typically signal 15 and the pid typically corresponds to libvirtd; if there are no other qemu errors on either side, then it's best to go and check libvirtd's own logs to see why it's upset.

State blocked by non-migratable device '.....'

Devices can block migration either because the code hasn't been written/tested for their migration or because a particular feature is hard to migrate. Examples include:

  • Older qemu's couldn't migrate AHCI/SATA
  • x86 cpus can't migrate with the 'invtsc' feature flag enabled.

error while loading state for instance 0x... of device '....' =

This tells you exactly the device that's failed, if you're lucky there might be some errors preceding it telling you what went wrong. While most of these cases are bugs, other cases can include IO problems on the backing device on the destination or a missing subsection definition.

Unknown savevm section or instance '...'

In this case the source has sent migration data for a device that can't be found on the destination. There are two common causes of this:

  • A mismatch in the qemu command line/machine type causing the destination not to have the device at all.
  • A mismatch in the order of devices on a bus, e.g. in a case where I hadn't specified the port number for a usb-kbd and had the order different, I got an Unknown savevm section or instance '0000:00:04.0/1/usb-kbd' because it was actually /0/usb-kbd.

Unknown ramblock "..." cannot accept migration

Similar to the unknown savevm section above, in this case we're missing a block of RAM or ROM; again this is normally down to a command line mismatch.

Length mismatch: ....: .... in != .....

A block of RAM or ROM is a different size on the source and destination, while QEMU can cope in some specific cases, in general it can't (because it wouldn't have anywhere to put the excess data in the guests address space). If this is a ROM the problem is normally down to the source and destination having different versions of the associated ROM installed; check the bios and ipxe packages that provide them. Packagers are advised to pad ROMs to nice convenient power-of-2 boundaries with plenty of space for growth to avoid this problem.

Other common causes are different settings for the size of VRAM on graphics emulation.

load of migration failed: Input/output error

This is normally seen on the destination and there are a few failure cases.

  • A network failure during the migration - the destination can't receive the data
  • Something kills/cancels the migration on the source - e.g. migrate_cancel on the source or the source is killed before migration is complete.
  • An actual IO error generated by one of the devices as it's loaded - e.g. networking/disc etc
  • A migration failure on the source. In this case check 'info migrate' on the source and it should say 'failed'. One way to start debugging this is to do a migrate to /dev/null to see if the problem can be isolated to the source; e.g. migrate "exec:cat > /dev/null" - if that still shows migrate failed in info migrate then the problem is purely on the source.

QEMU aborts/seg faults/crashes

Any segfault by either the source or destination qemu is a bug - please report it. Abort's are normally also bugs except in specific cases (e.g. corrupt image files); again if the error isn't obvious report it.