Revision as of 21:49, 1 December 2014

MultiThreaded support in the TCG

OverView

Qemu can currently emulate a number of CPU’s in parallel, but it does so in a single thread. Given many modern hosts are multi-core, and many targets equally use multiple cores, a significant performance advantage can be realised by making use of multiple host threads to simulate multiple target cores.

This is work in progress - we expect to publish results on this wiki page as progress is made.

Plan and problems to solve

The TCG today is close to being thread safe, but there is still some concern that there are remaining issues. We will address this by first focusing on user-level TCG threads as this seems a straightforward target. Subsequently the wider case of system level multi-threading will be looked at. The following is an currently incomplete list of issues to address:

Global TCG State

Currently there is no protection against two threads attempting to generate code at the same time into the translation buffer. This means you do see corrupted code generation from time to time in multi-threaded apps. There are a couple of approaches we could take from adding locking to the code generator so only one thread at a time could generate code to having separate translation buffers for each thread of execution.

The key question here is whether the translated code cache should be per-guest-CPU (ie per-thread) or global (or at least shared between multiple guest CPUs). Having it per-guest-CPU means less possibility of locking contention, but means more overhead generating code -- every time the guest reschedules a process to another guest CPU we'll have to translate all its code all over again for the new CPU. A strictly global cache is not a great idea either because it won't work if we eventually move to supporting heterogenous systems (eg one ARM CPU and one SH4). But letting each guest CPU have a pointer to its TCG cache (which could then be shared between all the guest CPUs in a cluster) could be made to work (and would let you examine the perf impact of sharing the code cache or not with fairly minor tweaks: just instantiate N caches, or one, and set the pointers to match).

There are also a number of global variables and assumptions in the various back-ends which will need to be audited. I suspect these values will need to be wrapped up in a portable TCGContext.

Dirty tracking

Currently we handle guest writes to code like this:

we have a set of bitmaps for tracking dirty memory of various kinds
for memory which is "clean" (meaning in this context "we've cached a translated code block for this address") we set the TLB entries up to force a slow-path write
slow-path writes and also DMA writes end up calling invalidate_and_set_dirty(), which does "if (this range is clean) { invalidate tbs in range; mark range as dirty; }"

So this is a fairly long sequence of operations (guest write; read bitmaps; update TB cache structures; update bitmaps) which is currently effectively atomic because of the single threading, and will need thought to avoid races. It's more complex than the "just add locks/per-core versions of data structures" parts mentioned above, because it cuts across several different data structures at once (QEMU TLB; global memory dirty bitmaps; TB caching).

NB: linux-user mode handles this a bit differently by marking memory as read-only and catching the signal on guest write attempts; the problems are probably slightly different there.

Signal Handling

There are two types of signal we need to handle. Synchronous (e.g. SIGBUS, SIGSEG) and Asynchronous (e.g. SIGSTOP, SIGINT, SIGUSR). While any signal can be sent asynchronously most of the common synchronous ones occur when there is an error in the translated code. As such rectifying machine state is fairly well tested. For Asynchronus signals there are a plethora of edge cases to deal with especially around the handling of signals with respect to system calls. If they arrive during translated code there behaviour is fairly easy to handle however when in QEMU's own code care has to be taken that syscalls respond correctly to the EINTR.

Memory consistency

Host and guest might implement different memory consistency models. While supporting a weak ordering model on a strong ordering backend isn't a problem it's going to be hard supporting strong ordering on a weakly ordered backend.

- Watch out for subtle differences; e.g. x86 is mostly strong ordered but can reorder stores made by the same CPU doing the load.

Instruction atomicity

Atomic instructions must be translated to an atomic host operation.

I'd suggest the following refinement sequence:

add the concept of memory coherence domains (MCD) in QEMU with a lock on each (can start with a system-wide MCD)
wrap every instruction with the lock of the corresponding MCD
remove locking for non-atomic instructions
- Take care in the way that non-atomics interact with atomics, an architecture might define something about what non-atomics see, and what a non-atomic store does during an atomic.
add atomic primitives in TCG (should have MCD as argument), translating them to using the appropriate lock
optimize TCG atomics to use atomic instructions on the host
- Somehow deal with ARM/MIPS/Power split load/store atomics that might have arbitrary stuff inbetween.

How to get involved

Right now, there is a small dedicated team looking at this issue. Those are:

Fred Konrad
Mark Burton
Pavel Dovgaluk

If you would like to be involved, please use the qemu-devel mail list.

We will run phone conference calls as appropriate to co-ordinate activity and we will feed back to the main Qemu mail lists as progress is made.

Other Work

This is the most important section initially, and we welcome any, and all comments and other work. If you know of any patch sets that may be of value, PLEASE let us know via the qemu-devel mail list.

Proof of concept implementations

Below are all the proof of concept implementations we have found thus far. It is highly likely that some of these patch sets can help us to reach an up-streamable solution. At the very least these provide some evidence that there is a performance improvement to be had.

HQEMU
- http://dl.acm.org/citation.cfm?id=2259030&CFID=454906387&CFTOKEN=60579010

PQEMU
- https://github.com/podinx/PQEMU
- http://www.cs.nthu.edu.tw/~ychung/conference/ICPADS2011.pdf

COREMU

@@ Line 26: / Line 26: @@
 * slow-path writes and also DMA writes end up calling invalidate_and_set_dirty(), which does "if (this range is clean) { invalidate tbs in range; mark range as dirty; }"
 So this is a fairly long sequence of operations (guest write; read bitmaps; update TB cache structures; update bitmaps) which is currently effectively atomic because of the single threading, and will need thought to avoid races. It's more complex than the "just add locks/per-core versions of data structures" parts mentioned above, because it cuts across several different data structures at once (QEMU TLB; global memory dirty bitmaps; TB caching).
+NB: linux-user mode handles this a bit differently by marking memory as read-only and catching the signal on guest write attempts; the problems are probably slightly different there.
 ===Signal Handling===