Features/DirtyRateCalc

From QEMU

Introduction

QEMU provides a few ways to measure dirty rates for the guest. It can be used for VM monitor purpose, and it can provide a clue on how hard would it be to migrate this VM with live migrations.

There're currently three modes supported for dirty rate calculations:

  • Page sampling
  • Dirty bitmap
  • Dirty ring

The page sampling mode can be used anytime, while the dirty bitmap or dirty ring mode will be dependent on what's the dirty tracking mechanism enabled for the specific virtual machine.

For QMP, one can use the command "calc-dirty-rate" to trigger a sample procedure with specific parameters. Then, one can use "query-dirty-rate" to check the results. The corresponding HMP commands are "calc_dirty_rate" and "info dirty-rate".

Modes of Operations

Page Sampling Mode

The page sampling mode is the 1st mode got supported for dirty rate calculations. The algorithm is based on small page hash values.

When the tracking is triggered, the hypervisor will select a few pages (specified in the sample-pages= parameter, with a default value of 512 pages per GB), calculate the hash value for these pages and remember them. Then the hypervisor waits for a specific length of time (specified by call-time=) and redo the hash calculation. If any of the page got a different hash value on its data stored, it means this page has changed during the period.

An example to start the sampling with 1024 sample pages per GB and sample period of 3 seconds:

(QEMU) calc-dirty-rate calc-time=3 mode=page-sampling sample-pages=1024
{                            
   "arguments": {          
       "calc-time": 3,     
       "mode": "page-sampling",
       "sample-pages": 1024
   },                                                                                        
   "execute": "calc-dirty-rate"
}
{
   "return": {}
}

Before the 3 seconds end, it'll show that it's still during measuring:

(QEMU) query-dirty-rate                         
{
   "arguments": {},
   "execute": "query-dirty-rate"
}
{
   "return": {
       "calc-time": 3,
       "mode": "page-sampling",
       "sample-pages": 1024,
       "start-time": 59478,
       "status": "measuring"
   }
}

After that, we should see the status reported as "measured" and value reported correspondingly.

(QEMU) query-dirty-rate
{
   "arguments": {},
   "execute": "query-dirty-rate"
}
{
   "return": {
       "calc-time": 3,
       "dirty-rate": 200,
       "mode": "page-sampling",
       "sample-pages": 1024,
       "start-time": 59478,
       "status": "measured"
   }
}

Dirty Bitmap Mode

We can enable dirty bitmap mode of dirty rate measurement when dirty bitmap based dirty tracking is enabled on the guest (no "-accel kvm,dirty-ring-size=N" specified in QEMU cmdline).

Dirty bitmap based dirty tracking is done using a bitmap to represent the guest memory. When a page is written, one bit (of the bitmap) will get set and it means the page that this bit represents is dirty.

This bitmap is per-VM, it means we do not treat each vcpu differently - any of the vcpu that writes to one page will set the bit. After the bit is set, it'll always be set for the whole process of dirty rate calculation process.

To start dirty rate measurement with dirty bitmap mode:

(QEMU) calc-dirty-rate calc-time=3 mode=dirty-bitmap
{
   "arguments": {
       "calc-time": 3,
       "mode": "dirty-bitmap"
   },
   "execute": "calc-dirty-rate"
}
{
   "return": {}
}

Results:

(QEMU) query-dirty-rate
{
   "arguments": {},
   "execute": "query-dirty-rate"
}
{
   "return": {
       "calc-time": 3,
       "dirty-rate": 202,
       "mode": "dirty-bitmap",
       "sample-pages": 0,
       "start-time": 60679,
       "status": "measured"
   }
}

Dirty Ring Mode

Dirty ring mode can provide a finer grained dirty rate measurement in per-vCPU basis. It can only be used when dirty ring is enabled for the specific guest (with "-accel kvm,dirty-ring-size=N" specified in QEMU cmdline).

Dirty ring is per-vcpu structure that contains an array of PFNs (Page Frame Numbers) of guest memory, it means firstly each vcpu has its own ring structure to keep the dirty pages, meanwhile it's possible that one dirty page can exist in more than one rings. For example, if two vcpus writes to the same page which used to be clean, then both vcpus will push one PFN of this page to its own dirty ring, the PFN will be reported to userspace (QEMU) as dirty pages. When accounted, the same page can be accounted more than once. But that's not always happening, for example, if the 2nd vcpu writes after the 1st vcpu writes and get the page fault resolved, then only one dirty PFN will be recorded and it'll only be recorded in the 1st vcpu's dirty ring.

To kickoff a dirty-ring based calculation:

(QEMU) calc-dirty-rate calc-time=3 mode=dirty-ring
{
   "arguments": {
       "calc-time": 3,
       "mode": "dirty-ring"
   },
   "execute": "calc-dirty-rate"
}
{
   "return": {}
}

Its result will be shown in both per-VM (in the original "dirty-rate" field) and per-vCPU way (in "vcpu-dirty-rate" section):

(QEMU) query-dirty-rate
{
   "arguments": {},
   "execute": "query-dirty-rate"
}
{
   "return": {
       "calc-time": 3,
       "dirty-rate": 185,
       "mode": "dirty-ring",
       "sample-pages": 0,
       "start-time": 60901,
       "status": "measured",
       "vcpu-dirty-rate": [
           {
               "dirty-rate": 0,
               "id": 0
           },
           {
               "dirty-rate": 0,
               "id": 1
           },
           {
               "dirty-rate": 0,
               "id": 2
           },
           {
               "dirty-rate": 200,
               "id": 3
           },
       }
   }
}

Misc

One thing to mention is that page-sample solution can be inaccurate because the pages to sample are only a portion of the system page, meanwhile the selection is random. From that aspect, dirty bitmap and dirty ring based approaches will provide much more accurate result. However it has a benefit that it does not need KVM dirty tracking intervention. It means the measurement overhead can be fully transparent to the guest but only done in a single host thread (if ignoring processor cache pollutions).

On the other hand, either dirty bitmap or dirty ring mode measurements could have an impact on guest workload performance. One thing is the overhead to start/stop the guest dirty page tracking mechanism could intervene with guest memory accesses. For example, starting dirty tracking will need to split host huge pages mapped to guest into smaller pages (KVM so far only tracks dirty pages in small pages size, for x86_64 it's 4KB). Meanwhile, when trapping each writes the vCPU could need to take a host page fault depending on the host configurations (e.g. whether PML is enabled, with which the page fault overhead can be greatly reduced).