Features/RDMALiveMigration: Difference between revisions

From QEMU
No edit summary
No edit summary
Line 1: Line 1:
== summary ==
== Summary ==
Live migration using RDMA instead of TCP.
Live migration using RDMA instead of TCP.


== contact ==                                                                     
== Contact ==                                                                     
* '''Name:''' [[User:Michael Hines|Michael Hines]]
* '''Name:''' [[User:Michael Hines|Michael Hines]]
* ''' Email:''' mrhines@us.ibm.com
* ''' Email:''' mrhines@us.ibm.com


== description ==
== Description ==
Uses standard OFED software stack, which supports both RoCE and Infiniband.
Uses standard OFED software stack, which supports both RoCE and Infiniband.


== design ==
== Design ==


# In order to provide maximum cross-device compatibility, we use the '''librdmacm''' library, which abstracts out the RDMA capabilities of each individual type of RDMA device, including infiniband, iWARP, as well as RoCE. This patch has been tested on both RoCE and infiniband devices from Mellanox.
# In order to provide maximum cross-device compatibility, we use the '''librdmacm''' library, which abstracts out the RDMA capabilities of each individual type of RDMA device, including infiniband, iWARP, as well as RoCE. This patch has been tested on both RoCE and infiniband devices from Mellanox.
Line 15: Line 15:
# Files "arch_init.c" and "savevm.c" have been modified to transfer the VM's memory in the standard live migration path using RMDA memory instead of using TCP.
# Files "arch_init.c" and "savevm.c" have been modified to transfer the VM's memory in the standard live migration path using RMDA memory instead of using TCP.
# Currently, the XBZRLE capability and the detection of zero pages (dup_page()) significantly slow down the empircal throughput observed when RDMA is activated, so the code path skips these capabilities when RDMA is enabled. Hopefully, we can stop doing this in the future and come up with a way to preserve these capabilities simultaneously with the use of RDMA.
# Currently, the XBZRLE capability and the detection of zero pages (dup_page()) significantly slow down the empircal throughput observed when RDMA is activated, so the code path skips these capabilities when RDMA is enabled. Hopefully, we can stop doing this in the future and come up with a way to preserve these capabilities simultaneously with the use of RDMA.
# All of the original logic for migration of devices and protocol synchronization does not change - that happens simultaneously over TCP as it normally does.


== usage ==
== Usage ==


Command-line:
Compiling:
* Start like this:  


== performance ==
$ ./configure --enable-rdma --target-list=x86_64-softmmu
$ make
 
Command-line on the Source machine AND Destination:
$ x86_64-softmmu/qemu-system-x86_64 ....... -rdmaport 3456 -rdmahost x.x.x.x
 
# "rdmport" is whatever you want
# "rdmahost" should be the destination IP address assigned to the remote interface with RDMA capabilities
# Both parameters '''should be identical''' on both machines
## "rdmahost" option should match on the destination because both sides use the same IP address to discover which RDMA interface
 
Optionally, you can use the QEMU monitor or libvirt to enable RDMA later.
This is more flexible using libvirt, like this:
 
$ virsh qemu-monitor-command --hmp --cmd "migrate_set_rdma_host xx.xx.xx.xx"
$ virsh qemu-monitor-command --hmp --cmd "migrate_set_rdma_port 3456"
 
Then, verify RDMA is activated before starting migration:
$ virsh qemu-monitor-command --hmp --cmd "info migrate_capabilities"
capabilities: xbzrle: off rdma: on
 
Also, make sure to increase the maximum throughput allowed:
$ virsh qemu-monitor-command --hmp --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
 
Finally, perform the actual migration:
$ virsh migrate domain tcp:xx.xx.xx.xx:TCPPORT  # do NOT use the RDMA port, use the TCP port
 
# Note the difference in syntax here: We're using the TCP port, not the RDMA port.
# All the control and device migration logic still happens over TCP, but the memory pages and RDMA connection setup goes over the RDMA port.
 
If you're doing this on the command-line, you will then want to resume the VM:
$ virsh qemu-monitor-command --hmp --cmd "continue"
 
== Performance ==


[[File:Perf.png]]
[[File:Perf.png]]


== TODO ==
== TODO ==
* Figure out how to properly cap RDMA throughput without copying data through a new type of QEMUFile abstraction or without artificially slowing down RDMA throughput because of control logic.
* Integrate with XOR-based run-length encoding (if possible)
* Integrate with XOR-based run-length encoding (if possible)
* Stop skipping the zero-pages check
* Stop skipping the zero-pages check


== links ==                                                                     
== Links ==                                                                     
* [http://www.canturkisci.com/ETC/papers/IBMJRD2011/preprint.pdf Original RDMA Live Migration Paper]
* [http://www.canturkisci.com/ETC/papers/IBMJRD2011/preprint.pdf Original RDMA Live Migration Paper]

Revision as of 22:26, 11 February 2013

Summary

Live migration using RDMA instead of TCP.

Contact

Description

Uses standard OFED software stack, which supports both RoCE and Infiniband.

Design

  1. In order to provide maximum cross-device compatibility, we use the librdmacm library, which abstracts out the RDMA capabilities of each individual type of RDMA device, including infiniband, iWARP, as well as RoCE. This patch has been tested on both RoCE and infiniband devices from Mellanox.
  2. A new file named "migration-rdma.c" contains the core code required to perform librdmacm connection establishment and the transfer of actual RDMA contents.
  3. Files "arch_init.c" and "savevm.c" have been modified to transfer the VM's memory in the standard live migration path using RMDA memory instead of using TCP.
  4. Currently, the XBZRLE capability and the detection of zero pages (dup_page()) significantly slow down the empircal throughput observed when RDMA is activated, so the code path skips these capabilities when RDMA is enabled. Hopefully, we can stop doing this in the future and come up with a way to preserve these capabilities simultaneously with the use of RDMA.
  5. All of the original logic for migration of devices and protocol synchronization does not change - that happens simultaneously over TCP as it normally does.

Usage

Compiling:

$ ./configure --enable-rdma --target-list=x86_64-softmmu $ make

Command-line on the Source machine AND Destination: $ x86_64-softmmu/qemu-system-x86_64 ....... -rdmaport 3456 -rdmahost x.x.x.x

  1. "rdmport" is whatever you want
  2. "rdmahost" should be the destination IP address assigned to the remote interface with RDMA capabilities
  3. Both parameters should be identical on both machines
    1. "rdmahost" option should match on the destination because both sides use the same IP address to discover which RDMA interface

Optionally, you can use the QEMU monitor or libvirt to enable RDMA later. This is more flexible using libvirt, like this:

$ virsh qemu-monitor-command --hmp --cmd "migrate_set_rdma_host xx.xx.xx.xx" $ virsh qemu-monitor-command --hmp --cmd "migrate_set_rdma_port 3456"

Then, verify RDMA is activated before starting migration: $ virsh qemu-monitor-command --hmp --cmd "info migrate_capabilities" capabilities: xbzrle: off rdma: on

Also, make sure to increase the maximum throughput allowed: $ virsh qemu-monitor-command --hmp --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device

Finally, perform the actual migration: $ virsh migrate domain tcp:xx.xx.xx.xx:TCPPORT # do NOT use the RDMA port, use the TCP port

  1. Note the difference in syntax here: We're using the TCP port, not the RDMA port.
  2. All the control and device migration logic still happens over TCP, but the memory pages and RDMA connection setup goes over the RDMA port.

If you're doing this on the command-line, you will then want to resume the VM: $ virsh qemu-monitor-command --hmp --cmd "continue"

Performance

Perf.png

TODO

  • Figure out how to properly cap RDMA throughput without copying data through a new type of QEMUFile abstraction or without artificially slowing down RDMA throughput because of control logic.
  • Integrate with XOR-based run-length encoding (if possible)
  • Stop skipping the zero-pages check

Links