Features/RDMALiveMigration: Difference between revisions

From QEMU
No edit summary
No edit summary
Line 8: Line 8:
== Description ==
== Description ==
Uses standard OFED software stack, which supports both RoCE and Infiniband.
Uses standard OFED software stack, which supports both RoCE and Infiniband.
== Design ==
# In order to provide maximum cross-device compatibility, we use the '''librdmacm''' library, which abstracts out the RDMA capabilities of each individual type of RDMA device, including infiniband, iWARP, as well as RoCE. This patch has been tested on both RoCE and infiniband devices from Mellanox.
# A new file named "migration-rdma.c" contains the core code required to perform librdmacm connection establishment and the transfer of actual RDMA contents.
# Files "arch_init.c" and "savevm.c" have been modified to transfer the VM's memory in the standard live migration path using RMDA memory instead of using TCP.
# Currently, the XBZRLE capability and the detection of zero pages (dup_page()) significantly slow down the empircal throughput observed when RDMA is activated, so the code path skips these capabilities when RDMA is enabled. Hopefully, we can stop doing this in the future and come up with a way to preserve these capabilities simultaneously with the use of RDMA.
# All of the original logic for migration of devices and protocol synchronization does not change - that happens simultaneously over TCP as it normally does.


== Usage ==
== Usage ==
Line 27: Line 19:
Command-line on the Source machine AND Destination:
Command-line on the Source machine AND Destination:


$ x86_64-softmmu/qemu-system-x86_64 ....... -rdmaport 3456 -rdmahost x.x.x.x
$ virsh qemu-monitor-command --hmp --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device


# "rdmport" is whatever you want
Finally, perform the actual migration:
# "rdmahost" should be the destination IP address assigned to the remote interface with RDMA capabilities
# Both parameters '''should be identical''' on both machines
## "rdmahost" option should match on the destination because both sides use the same IP address to discover which RDMA interface


Optionally, you can use the QEMU monitor or libvirt to enable RDMA later.
$ virsh migrate domain rdma:xx.xx.xx.xx:port


This is more flexible using libvirt, like this:


$ virsh qemu-monitor-command --hmp --cmd "migrate_set_rdma_host xx.xx.xx.xx"
== Performance ==


$ virsh qemu-monitor-command --hmp --cmd "migrate_set_rdma_port 3456"
[[File:Perf.png]]


Then, verify RDMA is activated before starting migration:
== Protocol Design ==


$ virsh qemu-monitor-command --hmp --cmd "info migrate_capabilities"
# In order to provide maximum cross-device compatibility, we use the '''librdmacm''' library, which abstracts out the RDMA capabilities of each individual type of RDMA device, including infiniband, iWARP, as well as RoCE. This patch has been tested on both RoCE and infiniband devices from Mellanox.
# Currently, the XBZRLE capability and the detection of zero pages (dup_page()) significantly slow down the empircal throughput observed when RDMA is activated, so the code path skips these capabilities when RDMA is enabled. Hopefully, we can stop doing this in the future and come up with a way to preserve these capabilities simultaneously with the use of RDMA.
 
We use two kinds of RDMA messages:                                                                                   


capabilities: xbzrle: off rdma: on
1. RDMA WRITES (to the receiver)
2. RDMA SEND (for non-live state, like devices and CPU)                                                             


Also, make sure to increase the maximum throughput allowed:
First, migration-rdma.c does the initial connection establishment
using the URI 'rdma:host:port' on the QMP command line.                                                             
                                                                                                                     
Second, the normal live migration process kicks in for 'pc.ram'.


$ virsh qemu-monitor-command --hmp --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
During iterative phase of the migration, only RDMA WRITE messages
are used. Messages are grouped into "chunks" which get pinned by
the hardware in 64-page increments. Each chunk is acknowledged in
the Queue Pairs completion queue (not the individual pages).


Finally, perform the actual migration:
During iteration of RAM, there are no messages sent, just RDMA writes.                                               


$ virsh migrate domain tcp:xx.xx.xx.xx:TCPPORT  # do NOT use the RDMA port, use the TCP port
During the last iteration, once the devices and CPU is ready to be
sent, we begin to use the RDMA SEND messages.


# Note the difference in syntax here: We're using the TCP port, not the RDMA port.
Due to the asynchronous nature of RDMA, the receiver of the migration                                                
# All the control and device migration logic still happens over TCP, but the memory pages and RDMA connection setup goes over the RDMA port.
must post Receive work requests in the queue *before* a SEND work request                                           
can be posted.  


If you're doing this on the command-line, you will then want to resume the VM:
To achieve this, both sides perform an initial 'barrier' synchronization.                                           
Before the barrier, we already know that both sides have a receive work                                             
request posted, and then both sides exchange and block on the completion                                             
queue waiting for each other to know the other peer is alive and ready                                               
to send the rest of the live migration state (qemu_send/recv_barrier()).                                             
At this point, the use of QEMUFile between both sides for communication
proceeds as normal.                                                                                                 


$ virsh qemu-monitor-command --hmp --cmd "continue"
The difference between TCP and SEND comes in migration-rdma.c: Since
we cannot simply dump the bytes into a socket, instead a SEND message
must be preceeded by one side instructing the other side *exactly* how                                               
many bytes the SEND message will contain.                                                                           


== Performance ==
Each time a SEND is received, the receiver buffers the message and
divies out the bytes from the SEND to the qemu_loadvm_state() function
until all the bytes from the buffered SEND message have been exhausted.                                             
 
Before the SEND is exhausted, the receiver sends an 'ack' SEND back                                                 
to the sender to let the savevm_state_* functions know that they                                                     
can resume and start generating more SEND messages.                                                                 


[[File:Perf.png]]
This ping-pong of SEND messages happens until the live migration completes.


== TODO ==
* Figure out how to properly cap RDMA throughput without copying data through a new type of QEMUFile abstraction or without artificially slowing down RDMA throughput because of control logic.
* Integrate with XOR-based run-length encoding (if possible)
* Stop skipping the zero-pages check


== Links ==                                                                     
== Links ==                                                                     
* [http://www.canturkisci.com/ETC/papers/IBMJRD2011/preprint.pdf Original RDMA Live Migration Paper]
* [http://www.canturkisci.com/ETC/papers/IBMJRD2011/preprint.pdf Original RDMA Live Migration Paper]

Revision as of 03:13, 11 March 2013

Summary

Live migration using RDMA instead of TCP.

Contact

Description

Uses standard OFED software stack, which supports both RoCE and Infiniband.

Usage

Compiling:

$ ./configure --enable-rdma --target-list=x86_64-softmmu

$ make

Command-line on the Source machine AND Destination:

$ virsh qemu-monitor-command --hmp --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device

Finally, perform the actual migration:

$ virsh migrate domain rdma:xx.xx.xx.xx:port


Performance

Perf.png

Protocol Design

  1. In order to provide maximum cross-device compatibility, we use the librdmacm library, which abstracts out the RDMA capabilities of each individual type of RDMA device, including infiniband, iWARP, as well as RoCE. This patch has been tested on both RoCE and infiniband devices from Mellanox.
  2. Currently, the XBZRLE capability and the detection of zero pages (dup_page()) significantly slow down the empircal throughput observed when RDMA is activated, so the code path skips these capabilities when RDMA is enabled. Hopefully, we can stop doing this in the future and come up with a way to preserve these capabilities simultaneously with the use of RDMA.

We use two kinds of RDMA messages:

1. RDMA WRITES (to the receiver) 2. RDMA SEND (for non-live state, like devices and CPU)

First, migration-rdma.c does the initial connection establishment using the URI 'rdma:host:port' on the QMP command line.

Second, the normal live migration process kicks in for 'pc.ram'.

During iterative phase of the migration, only RDMA WRITE messages are used. Messages are grouped into "chunks" which get pinned by the hardware in 64-page increments. Each chunk is acknowledged in the Queue Pairs completion queue (not the individual pages).

During iteration of RAM, there are no messages sent, just RDMA writes.

During the last iteration, once the devices and CPU is ready to be sent, we begin to use the RDMA SEND messages.

Due to the asynchronous nature of RDMA, the receiver of the migration must post Receive work requests in the queue *before* a SEND work request can be posted.

To achieve this, both sides perform an initial 'barrier' synchronization. Before the barrier, we already know that both sides have a receive work request posted, and then both sides exchange and block on the completion queue waiting for each other to know the other peer is alive and ready to send the rest of the live migration state (qemu_send/recv_barrier()). At this point, the use of QEMUFile between both sides for communication proceeds as normal.

The difference between TCP and SEND comes in migration-rdma.c: Since we cannot simply dump the bytes into a socket, instead a SEND message must be preceeded by one side instructing the other side *exactly* how many bytes the SEND message will contain.

Each time a SEND is received, the receiver buffers the message and divies out the bytes from the SEND to the qemu_loadvm_state() function until all the bytes from the buffered SEND message have been exhausted.

Before the SEND is exhausted, the receiver sends an 'ack' SEND back to the sender to let the savevm_state_* functions know that they can resume and start generating more SEND messages.

This ping-pong of SEND messages happens until the live migration completes.


Links