Revision as of 11:43, 27 July 2016

MultiThreaded support in the TCG

This is work in progress. The most tested combination is ARMv7 running on an x86 backend however the general patches run for all architectures depending on what the test case is doing. For full support however each Front End (guest) and Back End (tcg host) need to be converted to have solutions for:

Atomic Instructions
Memory Coherence (honouring barriers)

The intention is to support all combinations where they make sense. See the bottom of the page for links, recent discussions and code.

Overview

Qemu can currently emulate a number of CPU’s in parallel, but it does so in a single thread. Given many modern hosts are multi-core, and many targets equally use multiple cores, a significant performance advantage can be realised by making use of multiple host threads to simulate multiple target cores.

There was a talk at KVM Forum 2015 (video slides) which acts as a useful primer. The general thread safety for system-emulation TCG builds on the work already done for linux-user emulation. Indeed some of the work has already been merged and is making a difference to the linux-user code. The main focus is working on whole system emulation.

The last design document was was posted to the list in June 2016. The current work in progress can be found in Alex's GIT tree.

Already Merged Work

Atomic patching of TranslationBlocks
Re-factoring of main cpu_exec loop
QHT based lookups of next TB

Ready to Merge

Lockless hot-path in cpu_exec (build on QHT, in Paolo's tree for post 2.7)
cpu-exec: Safe work in quiescent state (gives thread safe tb_flush, in Alex's tree)

Plan and problems to solve

There are 3 main groups of problems and the additional work of enabling the various front and back ends.

General Thread Safety

These are covered by the current "Base enabling patches for MTTCG" (v3, WIP Branch). This is an architecture independent patch series which allows you to run multi-threaded test programs as long as they don't make any assumptions about:

Atomicity
Memory consistency
Cache flushes behaviour (v4 should fix cputlb)

This basically means dedicated test programs see Alex's kvm-unit-tests

Memory consistency

This is a current 2016 GSoC project

Host and guest might implement different memory consistency models. While supporting a weak ordering model on a strong ordering backend isn't a problem it's going to be hard supporting strong ordering on a weakly ordered backend.

Watch out for subtle differences; e.g. x86 is mostly strong ordered but can reorder stores made by the same CPU doing the load.

Instruction atomicity=

There a number of approaches being discussed on the list at the moment:

cmpxchg-based emulation of atomics

This work by Emilio Cota and Richard Henderson adds a number of atomic primitives which can be used in TCG code to emulate atomic instructions and paired load-link store-conditionals.

Slow path for atomic instruction emulation

This work by Alvise Rigo tweaks the SoftMMU emulation to trigger a slow path in contended cases.

Front-end and Back-end conversions

Each front end will need to be converted to use MTTCG aware atomics and instrument their barrier instructions.

Each back end will need to support the generation of new TCGOps required to support the front ends.

How to get involved

Right now, there is a small dedicated team looking at this issue. Those are:

Alex Bennée (Review, testing, base enabling tree)
Fred Konrad (Original core MTTCG patch set)
Alvise Rigo (LL/SC work)
Emilio Cota (QHT, cmpxchg atomics)
Mark Burton
Pavel Dovgalyuk

Mailing List

If you would like to be involved, please use the mail list: mttcg@listserver.greensocs.com

You can subscribe here:

       http://listserver.greensocs.com/wws/info/mttcg

If you send to this mail list, please make sure to copy qemu-devel as well.

There is a once a fortnight phone conference with summary notes posted to the mailing lists (archives).

Current Code

Remember these trees are WORK-IN-PROGRESS and could be broken at any particular point. Branches may be re-based without notice.

MTTCG Work:

Latest Tree: https://github.com/stsquad/qemu (branch:mttcg/enable-mttcg-for-armv7-v1)
Fred's Tree: http://git.greensocs.com/fkonrad/mttcg.git (branch:multi_tcg_v8)

LL/SC Work

Alvise's Tree: https://git.virtualopensystems.com/dev/qemu-mt.git (branch:slowpath-for-atomic-v8-no-mttcg)

MTTCG Test Cases:

These are tests specifically designed to exercise the code, based on kvm-unit-tests:

https://github.com/stsquad/kvm-unit-tests/tree/mttcg/current-tests-v5

Other Work

This is the most important section initially, and we welcome any, and all comments and other work. If you know of any patch sets that may be of value, PLEASE let us know via the qemu-devel mail list.

Proof of concept implementations

Below are all the proof of concept implementations we have found thus far. It is highly likely that some of these patch sets can help us to reach an up-streamable solution. At the very least these provide some evidence that there is a performance improvement to be had.

HQEMU
- http://dl.acm.org/citation.cfm?id=2259030&CFID=454906387&CFTOKEN=60579010

PQEMU
- https://github.com/podinx/PQEMU
- http://www.cs.nthu.edu.tw/~ychung/conference/ICPADS2011.pdf

COREMU

@@ Line 25: / Line 25: @@
 * Lockless hot-path in cpu_exec (build on QHT, in Paolo's tree for post 2.7)
+* cpu-exec: Safe work in quiescent state (gives thread safe tb_flush, in Alex's tree)
 ==Plan and problems to solve==
@@ Line 48: / Line 49: @@
 * Watch out for subtle differences; e.g. x86 is mostly strong ordered but can reorder stores made by the same CPU doing the load.
-==Instruction atomicity===
+===Instruction atomicity====
-There a number of approaches being discussed on the list at the moment
+There a number of approaches being discussed on the list at the moment:
+==== cmpxchg-based emulation of atomics ====
+This work by Emilio Cota and Richard Henderson adds a number of atomic primitives which can be used in TCG code to emulate atomic instructions and paired load-link store-conditionals.
+==== Slow path for atomic instruction emulation ====
+This work by Alvise Rigo tweaks the SoftMMU emulation to trigger a slow path in contended cases.
+===Front-end and Back-end conversions===
+Each front end will need to be converted to use MTTCG aware atomics and instrument their barrier instructions.
+Each back end will need to support the generation of new TCGOps required to support the front ends.
 ==How to get involved==
@@ Line 56: / Line 71: @@
 Right now, there is a small dedicated team looking at this issue. Those are:
-* Fred Konrad (Core MTTCG patch set)
+* Alex Bennée (Review, testing, base enabling tree)
+* Fred Konrad (Original core MTTCG patch set)
 * Alvise Rigo (LL/SC work)
-* Alex Bennée (Review, testing)
+* Emilio Cota (QHT, cmpxchg atomics)
 * Mark Burton
 * Pavel Dovgalyuk
@@ Line 84: / Line 100: @@
 LL/SC Work
-* Alvise's Tree: https://git.virtualopensystems.com/dev/qemu-mt.git ('''branch:'''slowpath-for-atomic-v7-no-mttcg)
+* Alvise's Tree: https://git.virtualopensystems.com/dev/qemu-mt.git ('''branch:'''slowpath-for-atomic-v8-no-mttcg)
 MTTCG Test Cases:
@@ Line 112: / Line 128: @@
 ** [http://ipads.se.sjtu.edu.cn/lib/exe/fetch.php?media=publications:coremu-ppopp11.pdf COREMU: a Scalable and Portable Parallel Full-system Emulator]
 ** [http://ipads.se.sjtu.edu.cn/lib/exe/fetch.php?media=publications:reemu-ppopp13.pdf Scalable Deterministic Replay in a Parallel Full-system Emulator]
-==Follow up work==
-There are some additional things that will need to be looked at for ''user-mode'' emulation.
-===Signal Handling===
-There are two types of signal we need to handle. Synchronous (e.g. SIGBUS, SIGSEG) and Asynchronous (e.g. SIGSTOP, SIGINT, SIGUSR). While any signal can be sent asynchronously most of the common synchronous ones occur when there is an error in the translated code. As such rectifying machine state is fairly well tested. For Asynchronus signals there are a plethora of edge cases to deal with especially around the handling of signals with respect to system calls. If they arrive during translated code there behaviour is fairly easy to handle however when in QEMU's own code care has to be taken that syscalls respond correctly to the EINTR.