Features/tcg-multithread: Difference between revisions

From QEMU
(Update as a proper feature now)
Line 1: Line 1:
=MultiThreaded support in the TCG=
This is the feature that allows the Tiny Code Generator run one host-thread per guest thread or guest vCPU (in system emulation mode). It was first introduced in QEMU [[ChangeLog/2.9 2.9]] for Alpha and ARM. Work to enable full multi-threading support in additional system emulations is on going.
 
'''This is work in progress'''. The most tested combination is ARMv7 running on an x86 backend however the general patches run for all architectures depending on what the test case is doing. For full support however each Front End (guest) and Back End (tcg host) need to be converted to have solutions for:
 
* Atomic Instructions
* Memory Coherence (honouring barriers)
 
The intention is to support all combinations where they make sense. See the bottom of the page for links, recent discussions and code.


==Overview==
==Overview==


Qemu can currently emulate a number of CPU’s in parallel, but it does so in a single thread. Given many modern hosts are multi-core, and many targets equally use multiple cores, a significant performance advantage can be realised by making use of multiple host threads to simulate multiple target cores.
QEMU's system emulation mode could always emulate multiple vCPUs but it scheduled them in a single thread and executed each one in tern in a round-robin fashion. To switch to a host-thread per vCPU a number of changes had to be made to the core code as well as explicit support in each guest architecture. The design decisions are documented in {{src|path=docs/multi-thread-tcg.txt}}.
 
There was a talk at KVM Forum 2015 ([https://www.youtube.com/watch?v=KnSW0WjWHZI video] [http://www.linux-kvm.org/images/c/cf/02x02-Alex_Benee-Towards_Multithreaded_TCG.pdf slides]) which acts as a useful primer. The general thread safety for system-emulation TCG builds on the work already done for linux-user emulation. Indeed some of the work has already been merged and is making a difference to the linux-user code. The main focus is working on whole system emulation.
 
The last design document was [https://lists.gnu.org/archive/html/qemu-devel/2016-06/msg00928.html was posted to the list in June 2016]. The current work in progress can be found in [https://raw.githubusercontent.com/stsquad/qemu/mttcg/base-patches-v4/docs/multi-thread-tcg.txt Alex's GIT tree].
 
==Already Merged Work==
 
* Atomic patching of TranslationBlocks
* Re-factoring of main cpu_exec loop
* QHT based lookups of next TB
* Initial memory consistency support (GSoC 2016)


* Lockless hot-path in cpu_exec (build on QHT)
There was a talk at KVM Forum 2015 ([https://www.youtube.com/watch?v=KnSW0WjWHZI video] [http://www.linux-kvm.org/images/c/cf/02x02-Alex_Benee-Towards_Multithreaded_TCG.pdf slides]) which is a little out of date but acts as a useful primer on the challenges involved.
* cpu-exec: Safe work in quiescent state (gives thread safe tb_flush)


== Ready to Merge ==
==Controlling MTTCG==


* cmpxchg-based atomics
Once a MTTCG guest is supported there should be no need to enable it explicitly. The system emulation will enable it if the following conditions are met:


==Plan and problems to solve==
* The guest architecture has defined TARGET_SUPPORTS_MTTCG
* The host architectures TCG_TARGET_DEFAULT_MO supports TCG_GUEST_DEFAULT_MO


There are 3 main groups of problems and the additional work of enabling the various front and back ends.
When this is not the case you can force MTTCG by specifying:


===General Thread Safety===
    $QEMU $OPTS --accel tcg,thread=multi


These are covered by the current "Base enabling patches for MTTCG" ([https://lists.gnu.org/archive/html/qemu-devel/2016-06/msg00922.html v3], [https://github.com/stsquad/qemu/tree/mttcg/base-patches-v4 WIP Branch]). This is an architecture independent patch series which allows you to run multi-threaded test programs as long as they don't make any assumptions about:
although you are likely to get strange behaviour. If you suspect that guest emulation is incorrect you can revert to single threaded mode and re-run your test:


* Atomicity
    $QEMU $OPTS --accel tcg,thread=single
* Memory consistency
   
* Cache flushes behaviour (v4 should fix cputlb)
==Incompatibilities==


This basically means dedicated test programs [https://github.com/stsquad/kvm-unit-tests/tree/mttcg/current-tests-v5 see Alex's kvm-unit-tests]
MTTCG is not compatible with -icount and enabling icount will force a single threaded run.


===Memory consistency===
==Developer Details==


Host and guest might implement different memory consistency models. While supporting a weak ordering model on a strong ordering back-end isn't a problem it's going to be hard supporting strong ordering on a weakly ordered back-end.
===Porting a guest architecture===


* Remaining Case: strong on weak, ex. emulating x86 memory model on ARM systems
Before MTTCG can be enabled for a guest the following changes must be made.


===Instruction atomicity===
* Port atomic primitives to use tcg_gen_atomic_
* Define TCG_GUEST_DEFAULT_MO
* Audit instructions that modify system state
  - generally this means taking BQL (e.g. HELPER(set_cp_reg))
* Audit MMU management functions
  - cputlb provides an API for various tlb_flush_FOO operations
* Audit power/reset sequences
  - see for example {{src|path=target/arm/powerctl}}


There a number of approaches being discussed on the list at the moment:
The work queue API async_[safe_]run_on_cpu provides a mechanism for one vCPU to queue work on another.


==== cmpxchg-based emulation of atomics ====
===Testing===


This work by Emilio Cota and Richard Henderson adds a number of atomic primitives which can be used in TCG code to emulate atomic instructions and paired load-link store-conditionals.
Ideally you'll want a comprehensive set of tests to exercise the corner cases of system emulation behaviour. See [https://github.com/stsquad/kvm-unit-tests/tree/mttcg/current-tests-v7 Alex's kvm-unit-tests] for an example of how the ARM architecture is exercised.


==== Slow path for atomic instruction emulation ====
===Further Work===


This work by Alvise Rigo tweaks the SoftMMU emulation to trigger a slow path in contended cases.
* Enabling strong-on-weak memory consistency (e.g. emulate x86 on an ARM host)


===Front-end and Back-end conversions===
===People===


Each front end will need to be converted to use MTTCG aware atomics and instrument their barrier instructions.
Now MTTCG is merged it is supported by the TCG maintainers. However the following people where involved:


Each back end will need to support the generation of new TCGOps required to support the front ends.


==How to get involved==
Right now, there is a small dedicated team looking at this issue. Those are:
* Alex Bennée (Review, testing, base enabling tree)
* Fred Konrad (Original core MTTCG patch set)
* Fred Konrad (Original core MTTCG patch set)
* Alex Bennée (ARM testing, base enabling tree)
* Alvise Rigo (LL/SC work)
* Alvise Rigo (LL/SC work)
* Emilio Cota (QHT, cmpxchg atomics)
* Emilio Cota (QHT, cmpxchg atomics)
* Mark Burton
* Pavel Dovgalyuk
=== Mailing List ===
If you would like to be involved, please use the mail list: mttcg@listserver.greensocs.com
You can subscribe here:
        http://listserver.greensocs.com/wws/info/mttcg
If you send to this mail list, please make sure to copy qemu-devel as well.
There is a once a fortnight phone conference with summary notes posted to the mailing lists ([http://lists.nongnu.org/archive/cgi-bin/namazu.cgi?query=MTTCG&submit=Search%21&idxname=qemu-devel&max=20&result=normal&sort=date%3Alate archives]).
===Current Code===
Remember these trees are '''WORK-IN-PROGRESS''' and could be broken at any particular point. Branches may be re-based without notice.
MTTCG Work:
* Latest Tree: https://github.com/stsquad/qemu ('''branch:'''mttcg/enable-mttcg-for-armv7-v1)
* Fred's Tree: http://git.greensocs.com/fkonrad/mttcg.git ('''branch:'''multi_tcg_v8)
LL/SC Work
* Alvise's Tree: https://git.virtualopensystems.com/dev/qemu-mt.git ('''branch:'''slowpath-for-atomic-v8-no-mttcg)
MTTCG Test Cases:
These are tests specifically designed to exercise the code, based on kvm-unit-tests:
* https://github.com/stsquad/kvm-unit-tests/tree/mttcg/current-tests-v5
==Other Work==
This is the most important section initially, and we welcome any, and all comments and other work.
If you know of any patch sets that may be of value, PLEASE let us know via the qemu-devel mail list.
===Proof of concept implementations===
Below are all the proof of concept implementations we have found thus far. Most of them seem to have bitrotted.
* HQEMU
** http://dl.acm.org/citation.cfm?id=2259030&CFID=454906387&CFTOKEN=60579010
* PQEMU
** https://github.com/podinx/PQEMU
** http://www.cs.nthu.edu.tw/~ychung/conference/ICPADS2011.pdf
* COREMU
** http://sourceforge.net/p/coremu/home/Home/
** [http://ipads.se.sjtu.edu.cn/lib/exe/fetch.php?media=publications:coremu-ppopp11.pdf COREMU: a Scalable and Portable Parallel Full-system Emulator]
** [http://ipads.se.sjtu.edu.cn/lib/exe/fetch.php?media=publications:reemu-ppopp13.pdf Scalable Deterministic Replay in a Parallel Full-system Emulator]

Revision as of 14:00, 12 July 2017

This is the feature that allows the Tiny Code Generator run one host-thread per guest thread or guest vCPU (in system emulation mode). It was first introduced in QEMU ChangeLog/2.9 2.9 for Alpha and ARM. Work to enable full multi-threading support in additional system emulations is on going.

Overview

QEMU's system emulation mode could always emulate multiple vCPUs but it scheduled them in a single thread and executed each one in tern in a round-robin fashion. To switch to a host-thread per vCPU a number of changes had to be made to the core code as well as explicit support in each guest architecture. The design decisions are documented in docs/multi-thread-tcg.txt.

There was a talk at KVM Forum 2015 (video slides) which is a little out of date but acts as a useful primer on the challenges involved.

Controlling MTTCG

Once a MTTCG guest is supported there should be no need to enable it explicitly. The system emulation will enable it if the following conditions are met:

  • The guest architecture has defined TARGET_SUPPORTS_MTTCG
  • The host architectures TCG_TARGET_DEFAULT_MO supports TCG_GUEST_DEFAULT_MO

When this is not the case you can force MTTCG by specifying:

   $QEMU $OPTS --accel tcg,thread=multi

although you are likely to get strange behaviour. If you suspect that guest emulation is incorrect you can revert to single threaded mode and re-run your test:

   $QEMU $OPTS --accel tcg,thread=single
   

Incompatibilities

MTTCG is not compatible with -icount and enabling icount will force a single threaded run.

Developer Details

Porting a guest architecture

Before MTTCG can be enabled for a guest the following changes must be made.

  • Port atomic primitives to use tcg_gen_atomic_
  • Define TCG_GUEST_DEFAULT_MO
  • Audit instructions that modify system state
 - generally this means taking BQL (e.g. HELPER(set_cp_reg))
  • Audit MMU management functions
 - cputlb provides an API for various tlb_flush_FOO operations
  • Audit power/reset sequences
 - see for example target/arm/powerctl

The work queue API async_[safe_]run_on_cpu provides a mechanism for one vCPU to queue work on another.

Testing

Ideally you'll want a comprehensive set of tests to exercise the corner cases of system emulation behaviour. See Alex's kvm-unit-tests for an example of how the ARM architecture is exercised.

Further Work

  • Enabling strong-on-weak memory consistency (e.g. emulate x86 on an ARM host)

People

Now MTTCG is merged it is supported by the TCG maintainers. However the following people where involved:


  • Fred Konrad (Original core MTTCG patch set)
  • Alex Bennée (ARM testing, base enabling tree)
  • Alvise Rigo (LL/SC work)
  • Emilio Cota (QHT, cmpxchg atomics)