Revision as of 11:01, 27 July 2016

MultiThreaded support in the TCG

This is work in progress. The most tested combination is ARMv7 running on an x86 backend however the general patches run for all architectures depending on what the test case is doing. For full support however each Front End (guest) and Back End (tcg host) need to be converted to have solutions for:

Atomic Instructions
Memory Coherence (honouring barriers)

The intention is to support all combinations where they make sense. See the bottom of the page for links, recent discussions and code.

Overview

Qemu can currently emulate a number of CPU’s in parallel, but it does so in a single thread. Given many modern hosts are multi-core, and many targets equally use multiple cores, a significant performance advantage can be realised by making use of multiple host threads to simulate multiple target cores.

There was a talk at KVM Forum 2015 (video slides) which acts as a useful primer. The general thread safety for system-emulation TCG builds on the work already done for linux-user emulation. Indeed some of the work has already been merged and is making a difference to the linux-user code. The main focus is working on whole system emulation.

The last design document was was posted to the list in June 2016. The current work in progress can be found in Alex's GIT tree.

Already Merged Work

Atomic patching of TranslationBlocks
Re-factoring of main cpu_exec loop
QHT based lookups of next TB

Ready to Merge

Lockless hot-path in cpu_exec (build on QHT, in Paolo's tree for post 2.7)

Plan and problems to solve

There are 3 main groups of problems and the additional work of enabling the various front and backends.

=General Thread Safety

These are covered by the current "Base enabling patches for MTTCG" (v3, WIP Branch). This is an architecture independent patch series which allows you to run multi-threaded test programs as long as they don't make any assumptions about:

Atomicity
Memory consistency
Cache flushes behaviour (v4 should fix cputlb)

This basically means dedicated test programs see Alex's kvm-unit-tests

Memory consistency

This is a current 2016 GSoC project

Host and guest might implement different memory consistency models. While supporting a weak ordering model on a strong ordering backend isn't a problem it's going to be hard supporting strong ordering on a weakly ordered backend.

Watch out for subtle differences; e.g. x86 is mostly strong ordered but can reorder stores made by the same CPU doing the load.

Instruction atomicity=

There a number of approaches being discussed on the list at the moment

How to get involved

Right now, there is a small dedicated team looking at this issue. Those are:

Fred Konrad (Core MTTCG patch set)
Alvise Rigo (LL/SC work)
Alex Bennée (Review, testing)
Mark Burton
Pavel Dovgalyuk

Mailing List

If you would like to be involved, please use the mail list: mttcg@listserver.greensocs.com

You can subscribe here: http://listserver.greensocs.com/wws/info/mttcg

If you send to this mail list, please make sure to copy qemu-devel as well.

There is a once a fortnight phone conference with summary notes posted to the mailing lists (archives).

Current Code

Remember these trees are WORK-IN-PROGRESS and could be broken at any particular point. Branches may be re-based without notice.

MTTCG Work:

Latest Tree: https://github.com/stsquad/qemu (branch:mttcg/enable-mttcg-for-armv7-v1)
Fred's Tree: http://git.greensocs.com/fkonrad/mttcg.git (branch:multi_tcg_v8)

LL/SC Work

Alvise's Tree: https://git.virtualopensystems.com/dev/qemu-mt.git (branch:slowpath-for-atomic-v7-no-mttcg)

MTTCG Test Cases:

These are tests specifically designed to exercise the code, based on kvm-unit-tests:

https://github.com/stsquad/kvm-unit-tests/tree/mttcg/current-tests-v5

Other Work

This is the most important section initially, and we welcome any, and all comments and other work. If you know of any patch sets that may be of value, PLEASE let us know via the qemu-devel mail list.

Proof of concept implementations

Below are all the proof of concept implementations we have found thus far. It is highly likely that some of these patch sets can help us to reach an up-streamable solution. At the very least these provide some evidence that there is a performance improvement to be had.

HQEMU
- http://dl.acm.org/citation.cfm?id=2259030&CFID=454906387&CFTOKEN=60579010

PQEMU
- https://github.com/podinx/PQEMU
- http://www.cs.nthu.edu.tw/~ychung/conference/ICPADS2011.pdf

COREMU

Follow up work

There are some additional things that will need to be looked at for user-mode emulation.

Signal Handling

There are two types of signal we need to handle. Synchronous (e.g. SIGBUS, SIGSEG) and Asynchronous (e.g. SIGSTOP, SIGINT, SIGUSR). While any signal can be sent asynchronously most of the common synchronous ones occur when there is an error in the translated code. As such rectifying machine state is fairly well tested. For Asynchronus signals there are a plethora of edge cases to deal with especially around the handling of signals with respect to system calls. If they arrive during translated code there behaviour is fairly easy to handle however when in QEMU's own code care has to be taken that syscalls respond correctly to the EINTR.

@@ Line 1: / Line 1: @@
 =MultiThreaded support in the TCG=
-'''This is work in progress'''. Currently the only working combination is ARMv7 running on an x86 backend however the intention is to support all combinations where they make sense. See the bottom of the page for links, recent discussions and code.
+'''This is work in progress'''. The most tested combination is ARMv7 running on an x86 backend however the general patches run for all architectures depending on what the test case is doing. For full support however each Front End (guest) and Back End (tcg host) need to be converted to have solutions for:
+* Atomic Instructions
+* Memory Coherence (honouring barriers)
+The intention is to support all combinations where they make sense. See the bottom of the page for links, recent discussions and code.
 ==Overview==
@@ Line 7: / Line 12: @@
 Qemu can currently emulate a number of CPU’s in parallel, but it does so in a single thread. Given many modern hosts are multi-core, and many targets equally use multiple cores, a significant performance advantage can be realised by making use of multiple host threads to simulate multiple target cores.
-There was a talk at KVM Forum 2015 ([https://www.youtube.com/watch?v=KnSW0WjWHZI video] [http://www.linux-kvm.org/images/c/cf/02x02-Alex_Benee-Towards_Multithreaded_TCG.pdf slides]) which acts as a useful primer.
+There was a talk at KVM Forum 2015 ([https://www.youtube.com/watch?v=KnSW0WjWHZI video] [http://www.linux-kvm.org/images/c/cf/02x02-Alex_Benee-Towards_Multithreaded_TCG.pdf slides]) which acts as a useful primer. The general thread safety for system-emulation TCG builds on the work already done for linux-user emulation. Indeed some of the work has already been merged and is making a difference to the linux-user code. The main focus is working on whole system emulation.
-==Plan and problems to solve==
+The last design document was [https://lists.gnu.org/archive/html/qemu-devel/2016-06/msg00928.html was posted to the list in June 2016]. The current work in progress can be found in [https://raw.githubusercontent.com/stsquad/qemu/mttcg/base-patches-v4/docs/multi-thread-tcg.txt Alex's GIT tree].
-The TCG today is close to being thread safe, but there is still some concern that there are remaining issues. The current work is focusing system-emulation TCG threads. The last design document was [https://lists.gnu.org/archive/html/qemu-devel/2016-04/msg00757.html was posted to the list in April 2016].
+==Already Merged Work==
-The following is an ''currently incomplete'' list of issues to address:
+* Atomic patching of TranslationBlocks
+* Re-factoring of main cpu_exec loop
+* QHT based lookups of next TB
-===Global TCG State===
+==Ready to Merge==
-Currently there is little protection against two threads attempting to generate code at the same time into the translation buffer. The linux-user code has introduced some basic locking that offers some protection. This is built upon for system emulation mode.
+* Lockless hot-path in cpu_exec (build on QHT, in Paolo's tree for post 2.7)
-===Translation Cache===
+==Plan and problems to solve==
-Currently we operate with a single shared cache. This means generation has to be serialise but also care has to be taken when we exit the main run loop. As the cache is global the locking also need to safely protect the invalidation and patching of TranslationBlocks and all associated jump lookups and caches.
+There are 3 main groups of problems and the additional work of enabling the various front and backends.
-===Dirty tracking===
+====General Thread Safety===
-Currently we handle guest writes to code like this:
+These are covered by the current "Base enabling patches for MTTCG" ([https://lists.gnu.org/archive/html/qemu-devel/2016-06/msg00922.html v3], [https://github.com/stsquad/qemu/tree/mttcg/base-patches-v4 WIP Branch]). This is an architecture independent patch series which allows you to run multi-threaded test programs as long as they don't make any assumptions about:
-* we have a set of bitmaps for tracking dirty memory of various kinds
-* for memory which is "clean" (meaning in this context "we've cached a translated code block for this address") we set the TLB entries up to force a slow-path write
-* slow-path writes and also DMA writes end up calling invalidate_and_set_dirty(), which does "if (this range is clean) { invalidate tbs in range; mark range as dirty; }"
-So this is a fairly long sequence of operations (guest write; read bitmaps; update TB cache structures; update bitmaps) which is currently effectively atomic because of the single threading, and will need thought to avoid races. It's more complex than the "just add locks/per-core versions of data structures" parts mentioned above, because it cuts across several different data structures at once (QEMU TLB; global memory dirty bitmaps; TB caching). Also it's quite easy to forget because if it doesn't work then actually quite a lot of guest code still works fine...
-NB: linux-user mode handles this a bit differently by marking memory as read-only and catching the signal on guest write attempts; the problems are probably slightly different there.
+* Atomicity
+* Memory consistency
+* Cache flushes behaviour (v4 should fix cputlb)
+This basically means dedicated test programs [https://github.com/stsquad/kvm-unit-tests/tree/mttcg/current-tests-v5 see Alex's kvm-unit-tests]
 ===Memory consistency===
-Host and guest might implement different memory consistency models. While supporting a weak ordering model on a strong ordering backend isn't a problem it's going to be hard supporting strong ordering on a weakly ordered backend. '''There is a 2016 GSoC project for this'''
+'''This is a current 2016 GSoC project'''
+Host and guest might implement different memory consistency models. While supporting a weak ordering model on a strong ordering backend isn't a problem it's going to be hard supporting strong ordering on a weakly ordered backend.
 * Watch out for subtle differences; e.g. x86 is mostly strong ordered but can reorder stores made by the same CPU doing the load.
-===Instruction atomicity===
+==Instruction atomicity===
-Atomic instructions must be translated to an atomic host operation.
-I'd suggest the following refinement sequence:
+There a number of approaches being discussed on the list at the moment
-* add the concept of memory coherence domains (MCD) in QEMU with a lock on each (can start with a system-wide MCD)
-* wrap every instruction with the lock of the corresponding MCD
-* remove locking for non-atomic instructions
-** Take care in the way that non-atomics interact with atomics, an architecture might define something about what   non-atomics see, and what a non-atomic store does during an atomic.
-* add atomic primitives in TCG (should have MCD as argument), translating them to using the appropriate lock
-* optimize TCG atomics to use atomic instructions on the host
-** Somehow deal with ARM/MIPS/Power split load/store atomics that might have arbitrary stuff inbetween.
 ==How to get involved==