Features/RDMALiveMigration: Difference between revisions

From QEMU
No edit summary
Line 13: Line 13:
Compiling:
Compiling:


$ ./configure --enable-rdma --target-list=x86_64-softmmu
$ ./configure --enable-rdma --target-list=x86_64-softmmu
$ make


$ make
== Running ==


Command-line on the Source machine AND Destination:
First, decide if you want dynamic page registration on the server-side. This always happens on the primary-VM side, but is optional on the server. Doing this allows you to support overcommit (such as cgroups or ballooning) with a smaller footprint on the server-side without having to register the                                           
entire VM memory footprint.


$ virsh qemu-monitor-command --hmp --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
NOTE: This significantly slows down performance (about 30% slower).                                                 


Finally, perform the actual migration:
$ virsh qemu-monitor-command --hmp --cmd "migrate_set_capability chunk_register_destination on" # disabled by default                               


$ virsh migrate domain rdma:xx.xx.xx.xx:port
Next, if you decided *not* to use chunked registration on the server, it is recommended to also disable zero page detection. While this is not                                             
strictly necessary, zero page detection also significantly slows down performance on higher-throughput links (by about 50%), like 40 gbps infiniband cards:                                


$ virsh qemu-monitor-command --hmp --cmd "migrate_set_capability check_for_zero off" # always enabled by default                                   
Finally, set the migration speed to match your hardware's capabilities:                                             
$ virsh qemu-monitor-command --hmp --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device                                       
Finally, perform the actual migration:                                                                               
$ virsh migrate domain rdma:xx.xx.xx.xx:port               


== Performance ==
== Performance ==
Line 32: Line 44:
== Protocol Design ==
== Protocol Design ==


# In order to provide maximum cross-device compatibility, we use the '''librdmacm''' library, which abstracts out the RDMA capabilities of each individual type of RDMA device, including infiniband, iWARP, as well as RoCE. This patch has been tested on both RoCE and infiniband devices from Mellanox.
Migration with RDMA is separated into two parts:                                                                     
# Currently, the XBZRLE capability and the detection of zero pages (dup_page()) significantly slow down the empircal throughput observed when RDMA is activated, so the code path skips these capabilities when RDMA is enabled. Hopefully, we can stop doing this in the future and come up with a way to preserve these capabilities simultaneously with the use of RDMA.
 
1. The transmission of the pages using RDMA
2. Everything else (a control channel is introduced)
 
"Everything else" is transmitted using a formal protocol now, consisting of infiniband SEND / RECV messages.
 
An infiniband SEND message is the standard ibverbs message used by applications of infiniband hardware. The only difference between a SEND message and an RDMA message is that SEND message cause completion notifications to be posted to the completion queue (CQ) on the infiniband receiver side, whereas RDMA messages (used for pc.ram) do not (to behave like an actual DMA).
 
Messages in infiniband require two things:
 
1. registration of the memory that will be transmitted
2. (SEND/RECV only) work requests to be posted on both sides of the network before the actual transmission can occur.
 
RDMA messages much easier to deal with. Once the memory on the receiver side is registered and pinned, we're basically done. All that is required is for the sender side to start dumping bytes onto the link.
 
SEND messages require more coordination because the receiver must have reserved space (using a receive work request) on the receive queue (RQ) before QEMUFileRDMA can start using them to carry all the bytes as a transport for migration of device state.
 
To begin the migration, the initial connection setup is
as follows (migration-rdma.c):
 
1. Receiver and Sender are started (command line or libvirt):
2. Both sides post two RQ work requests
3. Receiver does listen()
4. Sender does connect()
5. Receiver accept()
6. Check versioning and capabilities (described later)
 
At this point, we define a control channel on top of SEND messages
which is described by a formal protocol. Each SEND message has a
header portion and a data portion (but together are transmitted
as a single SEND message).
 
Header:
    * Length  (of the data portion)
    * Type    (what command to perform, described below)
    * Version (protocol version validated before send/recv occurs)
 
The 'type' field has 7 different command values:
    1. None
    2. Ready            (control-channel is available)
    3. QEMU File        (for sending non-live device state)
    4. RAM Blocks        (used right after connection setup)
    5. Register request  (dynamic chunk registration)
    6. Register result  ('rkey' to be used by sender)
    7. Register finished (registration for current iteration finished)
 
After connection setup is completed, we have two protocol-level
functions, responsible for communicating control-channel commands
using the above list of values:
 
Logically:
 
qemu_rdma_exchange_recv(header, expected command type)
 
1. We transmit a READY command to let the sender know that we are *ready* to receive some data bytes on the control channel.
2. Before attempting to receive the expected command, we post another RQ work request to replace the one we just used up.
3. Block on a CQ event channel and wait for the SEND to arrive.
4. When the send arrives, librdmacm will unblock us.
5. Verify that the command-type and version received matches the one we expected.
 
qemu_rdma_exchange_send(header, data, optional response header & data):
 
1. Block on the CQ event channel waiting for a READY command from the receiver to tell us that the receiver is *ready* for us to transmit some new bytes.
2. Optionally: if we are expecting a response from the command (that we have no yet transmitted), let's post an RQ work request to receive that data a few moments later.
3. When the READY arrives, librdmacm will unblock us and we immediately post a RQ work request to replace the one we just used up.
4. Now, we can actually post the work request to SEND the requested command type of the header we were asked for.
5. Optionally, if we are expecting a response (as before), we block again and wait for that response using the additional work request we previously posted. (This is used to carry 'Register result' commands #6 back to the sender which hold the rkey need to perform RDMA. All of the remaining command types (not including 'ready') described above all use the aformentioned two functions to do the hard work:
 
1. After connection setup, RAMBlock information is exchanged using this protocol before the actual migration begins.
2. During runtime, once a 'chunk' becomes full of pages ready to be sent with RDMA, the registration commands are used to ask the other side to register the memory for this chunk and respond with the result (rkey) of the registration.
3. Also, the QEMUFile interfaces also call these functions (described below) when transmitting non-live state, such as devices or to send its own protocol information during the migration process.
 
== Versioning ==
 
librdmacm provides the user with a 'private data' area to be exchanged
at connection-setup time before any infiniband traffic is generated.
 
This is a convenient place to check for protocol versioning because the
user does not need to register memory to transmit a few bytes of version
information.
 
This is also a convenient place to negotiate capabilities
(like dynamic page registration).
 
If the version is invalid, we throw an error.
 
If the version is new, we only negotiate the capabilities that the
requested version is able to perform and ignore the rest.
 
== QEMUFileRDMA Interface ==
 
QEMUFileRDMA introduces a couple of new functions:
 
1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
2. qemu_rdma_put_buffer((QEMUFileOps rdma_write_ops)
 
These two functions are very short and simply used the protocol
describe above to deliver bytes without changing the upper-level
users of QEMUFile that depend on a bytstream abstraction.
 
Finally, how do we handoff the actual bytes to get_buffer()?
 
Again, because we're trying to "fake" a bytestream abstraction
using an analogy not unlike individual UDP frames, we have
to hold on to the bytes received from control-channel's SEND
messages in memory.
 
Each time we receive a complete "QEMU File" control-channel
message, the bytes from SEND are copied into a small local holding area.
 
Then, we return the number of bytes requested by get_buffer()
and leave the remaining bytes in the holding area until get_buffer()
comes around for another pass.
 
If the buffer is empty, then we follow the same steps
listed above and issue another "QEMU File" protocol command,
asking for a new SEND message to re-fill the buffer.
 
== Migration of pc.ram ==


We use two kinds of RDMA messages:                                                                                   
At the beginning of the migration, (migration-rdma.c),
the sender and the receiver populate the list of RAMBlocks
to be registered with each other into a structure.
Then, using the aforementioned protocol, they exchange a
description of these blocks with each other, to be used later
during the iteration of main memory. This description includes
a list of all the RAMBlocks, their offsets and lengths and
possibly includes pre-registered RDMA keys in case dynamic
page registration was disabled on the server-side, otherwise not.


1. RDMA WRITES (to the receiver)
Main memory is not migrated with the aforementioned protocol,
but is instead migrated with normal RDMA Write operations.


2. RDMA SEND (for non-live state, like devices and CPU)                                                             
Pages are migrated in "chunks" (about 1 Megabyte right now).
Chunk size is not dynamic, but it could be in a future implementation.
There's nothing to indicate that this is useful right now.


First, migration-rdma.c does the initial connection establishment
When a chunk is full (or a flush() occurs), the memory backed by
using the URI 'rdma:host:port' on the QMP command line.                                                             
the chunk is registered with librdmacm and pinned in memory on
                                                                                                                     
both sides using the aforementioned protocol.
Second, the normal live migration process kicks in for 'pc.ram'.


During iterative phase of the migration, only RDMA WRITE messages
After pinning, an RDMA Write is generated and tramsmitted
are used. Messages are grouped into "chunks" which get pinned by
for the entire chunk.
the hardware in 64-page increments. Each chunk is acknowledged in
the Queue Pairs completion queue (not the individual pages).


During iteration of RAM, there are no messages sent, just RDMA writes.                                              
Chunks are also transmitted in batches: This means that we
do not request that the hardware signal the completion queue
for the completion of *every* chunk. The current batch size
is about 64 chunks (corresponding to 64 MB of memory).
Only the last chunk in a batch must be signaled.
This helps keep everything as asynchronous as possible
and helps keep the hardware busy performing RDMA operations.


During the last iteration, once the devices and CPU is ready to be
== Error-handling ==
sent, we begin to use the RDMA SEND messages.


Due to the asynchronous nature of RDMA, the receiver of the migration                                               
Infiniband has what is called a "Reliable, Connected"
must post Receive work requests in the queue *before* a SEND work request                                           
link (one of 4 choices). This is the mode in which
can be posted.  
we use for RDMA migration.


To achieve this, both sides perform an initial 'barrier' synchronization.                                           
If a *single* message fails,
Before the barrier, we already know that both sides have a receive work                                             
the decision is to abort the migration entirely and
request posted, and then both sides exchange and block on the completion                                             
cleanup all the RDMA descriptors and unregister all
queue waiting for each other to know the other peer is alive and ready                                               
the memory.
to send the rest of the live migration state (qemu_send/recv_barrier()).                                             
At this point, the use of QEMUFile between both sides for communication
proceeds as normal.                                                                                                  


The difference between TCP and SEND comes in migration-rdma.c: Since
After cleanup, the Virtual Machine is returned to normal
we cannot simply dump the bytes into a socket, instead a SEND message
operation the same way that would happen if the TCP
must be preceeded by one side instructing the other side *exactly* how                                               
socket is broken during a non-RDMA based migration.
many bytes the SEND message will contain.                                                                            


Each time a SEND is received, the receiver buffers the message and
== TODO ==
divies out the bytes from the SEND to the qemu_loadvm_state() function
until all the bytes from the buffered SEND message have been exhausted.                                             
 
Before the SEND is exhausted, the receiver sends an 'ack' SEND back                                                 
to the sender to let the savevm_state_* functions know that they                                                     
can resume and start generating more SEND messages.                                                                 


This ping-pong of SEND messages happens until the live migration completes.
1. Currently, cgroups swap limits for *both* TCP and RDMA on the sender-side is broken. This is more poignant for RDMA because RDMA requires memory registration. Fixing this requires infiniband page registrations to be zero-page aware, and this does not yet work properly.
2. Currently overcommit for the the *receiver* side of TCP works, but not for RDMA. While dynamic page registration *does* work, it is only useful if the is_zero_page() capability is remained enabled (which it is by default). However, leaving this capability turned on *significantly* slows down the RDMA throughput, particularly on hardware capable of transmitting faster than 10 gbps (such as 40gbps links).
3. Use of the recent /dev/<pid>/pagemap would likely solve some of these problems.
4. Also, some form of balloon-device usage tracking would also help aleviate some of these issues.


== Links ==                                                                     
== Links ==                                                                     
* [http://www.canturkisci.com/ETC/papers/IBMJRD2011/preprint.pdf Original RDMA Live Migration Paper]
* [http://www.canturkisci.com/ETC/papers/IBMJRD2011/preprint.pdf Original RDMA Live Migration Paper]

Revision as of 02:40, 9 April 2013

Summary

Live migration using RDMA instead of TCP.

Contact

Description

Uses standard OFED software stack, which supports both RoCE and Infiniband.

Usage

Compiling:

$ ./configure --enable-rdma --target-list=x86_64-softmmu
$ make

Running

First, decide if you want dynamic page registration on the server-side. This always happens on the primary-VM side, but is optional on the server. Doing this allows you to support overcommit (such as cgroups or ballooning) with a smaller footprint on the server-side without having to register the entire VM memory footprint.

NOTE: This significantly slows down performance (about 30% slower).

$ virsh qemu-monitor-command --hmp --cmd "migrate_set_capability chunk_register_destination on" # disabled by default                                

Next, if you decided *not* to use chunked registration on the server, it is recommended to also disable zero page detection. While this is not strictly necessary, zero page detection also significantly slows down performance on higher-throughput links (by about 50%), like 40 gbps infiniband cards:

$ virsh qemu-monitor-command --hmp --cmd "migrate_set_capability check_for_zero off" # always enabled by default                                     

Finally, set the migration speed to match your hardware's capabilities:

$ virsh qemu-monitor-command --hmp --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device                                        

Finally, perform the actual migration:

$ virsh migrate domain rdma:xx.xx.xx.xx:port                 

Performance

Perf.png

Protocol Design

Migration with RDMA is separated into two parts:

1. The transmission of the pages using RDMA
2. Everything else (a control channel is introduced)

"Everything else" is transmitted using a formal protocol now, consisting of infiniband SEND / RECV messages.

An infiniband SEND message is the standard ibverbs message used by applications of infiniband hardware. The only difference between a SEND message and an RDMA message is that SEND message cause completion notifications to be posted to the completion queue (CQ) on the infiniband receiver side, whereas RDMA messages (used for pc.ram) do not (to behave like an actual DMA).

Messages in infiniband require two things:

1. registration of the memory that will be transmitted
2. (SEND/RECV only) work requests to be posted on both sides of the network before the actual transmission can occur.

RDMA messages much easier to deal with. Once the memory on the receiver side is registered and pinned, we're basically done. All that is required is for the sender side to start dumping bytes onto the link.

SEND messages require more coordination because the receiver must have reserved space (using a receive work request) on the receive queue (RQ) before QEMUFileRDMA can start using them to carry all the bytes as a transport for migration of device state.

To begin the migration, the initial connection setup is as follows (migration-rdma.c):

1. Receiver and Sender are started (command line or libvirt):
2. Both sides post two RQ work requests
3. Receiver does listen()
4. Sender does connect()
5. Receiver accept()
6. Check versioning and capabilities (described later)

At this point, we define a control channel on top of SEND messages which is described by a formal protocol. Each SEND message has a header portion and a data portion (but together are transmitted as a single SEND message).

Header:

   * Length  (of the data portion)
   * Type    (what command to perform, described below)
   * Version (protocol version validated before send/recv occurs)

The 'type' field has 7 different command values:

   1. None
   2. Ready             (control-channel is available)
   3. QEMU File         (for sending non-live device state)
   4. RAM Blocks        (used right after connection setup)
   5. Register request  (dynamic chunk registration)
   6. Register result   ('rkey' to be used by sender)
   7. Register finished (registration for current iteration finished)

After connection setup is completed, we have two protocol-level functions, responsible for communicating control-channel commands using the above list of values:

Logically:

qemu_rdma_exchange_recv(header, expected command type)

1. We transmit a READY command to let the sender know that we are *ready* to receive some data bytes on the control channel.
2. Before attempting to receive the expected command, we post another RQ work request to replace the one we just used up.
3. Block on a CQ event channel and wait for the SEND to arrive.
4. When the send arrives, librdmacm will unblock us.
5. Verify that the command-type and version received matches the one we expected.

qemu_rdma_exchange_send(header, data, optional response header & data):

1. Block on the CQ event channel waiting for a READY command from the receiver to tell us that the receiver is *ready* for us to transmit some new bytes.
2. Optionally: if we are expecting a response from the command (that we have no yet transmitted), let's post an RQ work request to receive that data a few moments later.
3. When the READY arrives, librdmacm will unblock us and we immediately post a RQ work request to replace the one we just used up.
4. Now, we can actually post the work request to SEND the requested command type of the header we were asked for.
5. Optionally, if we are expecting a response (as before), we block again and wait for that response using the additional work request we previously posted. (This is used to carry 'Register result' commands #6 back to the sender which hold the rkey need to perform RDMA. All of the remaining command types (not including 'ready') described above all use the aformentioned two functions to do the hard work:
1. After connection setup, RAMBlock information is exchanged using this protocol before the actual migration begins.
2. During runtime, once a 'chunk' becomes full of pages ready to be sent with RDMA, the registration commands are used to ask the other side to register the memory for this chunk and respond with the result (rkey) of the registration.
3. Also, the QEMUFile interfaces also call these functions (described below) when transmitting non-live state, such as devices or to send its own protocol information during the migration process.

Versioning

librdmacm provides the user with a 'private data' area to be exchanged at connection-setup time before any infiniband traffic is generated.

This is a convenient place to check for protocol versioning because the user does not need to register memory to transmit a few bytes of version information.

This is also a convenient place to negotiate capabilities (like dynamic page registration).

If the version is invalid, we throw an error.

If the version is new, we only negotiate the capabilities that the requested version is able to perform and ignore the rest.

QEMUFileRDMA Interface

QEMUFileRDMA introduces a couple of new functions:

1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)

These two functions are very short and simply used the protocol describe above to deliver bytes without changing the upper-level users of QEMUFile that depend on a bytstream abstraction.

Finally, how do we handoff the actual bytes to get_buffer()?

Again, because we're trying to "fake" a bytestream abstraction using an analogy not unlike individual UDP frames, we have to hold on to the bytes received from control-channel's SEND messages in memory.

Each time we receive a complete "QEMU File" control-channel message, the bytes from SEND are copied into a small local holding area.

Then, we return the number of bytes requested by get_buffer() and leave the remaining bytes in the holding area until get_buffer() comes around for another pass.

If the buffer is empty, then we follow the same steps listed above and issue another "QEMU File" protocol command, asking for a new SEND message to re-fill the buffer.

Migration of pc.ram

At the beginning of the migration, (migration-rdma.c), the sender and the receiver populate the list of RAMBlocks to be registered with each other into a structure. Then, using the aforementioned protocol, they exchange a description of these blocks with each other, to be used later during the iteration of main memory. This description includes a list of all the RAMBlocks, their offsets and lengths and possibly includes pre-registered RDMA keys in case dynamic page registration was disabled on the server-side, otherwise not.

Main memory is not migrated with the aforementioned protocol, but is instead migrated with normal RDMA Write operations.

Pages are migrated in "chunks" (about 1 Megabyte right now). Chunk size is not dynamic, but it could be in a future implementation. There's nothing to indicate that this is useful right now.

When a chunk is full (or a flush() occurs), the memory backed by the chunk is registered with librdmacm and pinned in memory on both sides using the aforementioned protocol.

After pinning, an RDMA Write is generated and tramsmitted for the entire chunk.

Chunks are also transmitted in batches: This means that we do not request that the hardware signal the completion queue for the completion of *every* chunk. The current batch size is about 64 chunks (corresponding to 64 MB of memory). Only the last chunk in a batch must be signaled. This helps keep everything as asynchronous as possible and helps keep the hardware busy performing RDMA operations.

Error-handling

Infiniband has what is called a "Reliable, Connected" link (one of 4 choices). This is the mode in which we use for RDMA migration.

If a *single* message fails, the decision is to abort the migration entirely and cleanup all the RDMA descriptors and unregister all the memory.

After cleanup, the Virtual Machine is returned to normal operation the same way that would happen if the TCP socket is broken during a non-RDMA based migration.

TODO

1. Currently, cgroups swap limits for *both* TCP and RDMA on the sender-side is broken. This is more poignant for RDMA because RDMA requires memory registration. Fixing this requires infiniband page registrations to be zero-page aware, and this does not yet work properly. 
2. Currently overcommit for the the *receiver* side of TCP works, but not for RDMA. While dynamic page registration *does* work, it is only useful if the is_zero_page() capability is remained enabled (which it is by default). However, leaving this capability turned on *significantly* slows down the RDMA throughput, particularly on hardware capable of transmitting faster than 10 gbps (such as 40gbps links).
3. Use of the recent /dev/<pid>/pagemap would likely solve some of these problems. 
4. Also, some form of balloon-device usage tracking would also help aleviate some of these issues.

Links