Features/tcg-multithread: Difference between revisions

From QEMU
(Updated with links to kvm forum 2015 talk and other minor tweaks)
 
(19 intermediate revisions by 3 users not shown)
Line 1: Line 1:
=MultiThreaded support in the TCG=
This is the feature that allows the Tiny Code Generator run one host-thread per guest thread or guest vCPU (in system emulation mode). It was first introduced in QEMU [[ChangeLog/2.9|2.9]] for Alpha and ARM. Work to enable full multi-threading support in additional system emulations is on going.
 
'''This is work in progress''', see the bottom of the page for links, recent discussions and code.


==Overview==
==Overview==


Qemu can currently emulate a number of CPU’s in parallel, but it does so in a single thread. Given many modern hosts are multi-core, and many targets equally use multiple cores, a significant performance advantage can be realised by making use of multiple host threads to simulate multiple target cores.
QEMU's system emulation mode could always emulate multiple vCPUs but it scheduled them in a single thread and executed each one in tern in a round-robin fashion. To switch to a host-thread per vCPU a number of changes had to be made to the core code as well as explicit support in each guest architecture. The design decisions are documented in {{src|path=docs/devel/multi-thread-tcg.txt}}.


There was a talk at KVM Forum 2015 ([https://www.youtube.com/watch?v=KnSW0WjWHZI video] [http://www.linux-kvm.org/images/c/cf/02x02-Alex_Benee-Towards_Multithreaded_TCG.pdf slides]) which acts as a useful primer.
There was a talk at KVM Forum 2015 ([https://www.youtube.com/watch?v=KnSW0WjWHZI video] [http://www.linux-kvm.org/images/c/cf/02x02-Alex_Benee-Towards_Multithreaded_TCG.pdf slides]) which is a little out of date but acts as a useful primer on the challenges involved.


==Plan and problems to solve==
==Controlling MTTCG==


The TCG today is close to being thread safe, but there is still some concern that there are remaining issues. The current work is focusing system-emulation TCG threads. There is currently ongoing discussion on a design document which [https://lists.gnu.org/archive/html/qemu-devel/2015-06/msg03458.html was posted to the list in June].
Once a MTTCG guest is supported there should be no need to enable it explicitly. The system emulation will enable it if the following conditions are met:


The following is an ''currently incomplete'' list of issues to address:
* The guest architecture has defined TARGET_SUPPORTS_MTTCG
* The host architectures TCG_TARGET_DEFAULT_MO supports TCG_GUEST_DEFAULT_MO


===Global TCG State===
When this is not the case you can force MTTCG by specifying:


Currently there is no protection against two threads attempting to generate code at the same time into the translation buffer. This means you do see corrupted code generation from time to time in multi-threaded apps. There are a couple of approaches we could take from adding locking to the code generator so only one thread at a time could generate code to having separate translation buffers for each thread of execution.
    $QEMU $OPTS --accel tcg,thread=multi


The key question here is whether the translated code cache should be per-guest-CPU (ie per-thread) or global (or at least shared between multiple guest CPUs). Having it per-guest-CPU means less possibility of locking contention, but means more overhead generating code -- every time the guest reschedules a process to another guest CPU we'll have to translate all its code all over again for the new CPU. A strictly global cache is not a great idea either because it won't work if we eventually move to supporting heterogenous systems (eg one ARM CPU and one SH4). But letting each guest CPU have a pointer to its TCG cache (which could then be shared between all the guest CPUs in a cluster) could be made to work (and would let you examine the perf impact of sharing the code cache or not with fairly minor tweaks: just instantiate N caches, or one, and set the pointers to match).
although you are likely to get strange behaviour. If you suspect that guest emulation is incorrect you can revert to single threaded mode and re-run your test:


There are also a number of global variables and assumptions in the various back-ends which will need to be audited. I suspect these values will need to be wrapped up in a portable TCGContext.
    $QEMU $OPTS --accel tcg,thread=single
   
==Incompatibilities==


===Shared Cache===
MTTCG is not compatible with -icount and enabling icount will force a single threaded run.


Sharing a cache would allow us to take advantage of code that's translated by one core and then used on another. On the other hand with one per core you can perform updates on the caches with a lot less locking; however you've still got to be able to do invalidates across all the caches if any core does the write, and that could also get tricky (and expensive).
==Developer Details==


Having a per-core pointer to a qom TCGCacheClass seems attractive. However this might affect (adversely) the fast path. None the less,  it probably makes sense to refactor tb_* functions and such to have a TCGCache as first argument.
===Porting a guest architecture===


===Dirty tracking===
Before MTTCG can be enabled for a guest the following changes must be made.


Currently we handle guest writes to code like this:
* Correctly translate atomic/exclusive instructions (see tcg_gen_atomic_)
* we have a set of bitmaps for tracking dirty memory of various kinds
* Ensure the translation step correctly handles barrier instructions (tcg_gen_mb)
* for memory which is "clean" (meaning in this context "we've cached a translated code block for this address") we set the TLB entries up to force a slow-path write
* Define TCG_GUEST_DEFAULT_MO
* slow-path writes and also DMA writes end up calling invalidate_and_set_dirty(), which does "if (this range is clean) { invalidate tbs in range; mark range as dirty; }"
* Audit instructions that modify system state
So this is a fairly long sequence of operations (guest write; read bitmaps; update TB cache structures; update bitmaps) which is currently effectively atomic because of the single threading, and will need thought to avoid races. It's more complex than the "just add locks/per-core versions of data structures" parts mentioned above, because it cuts across several different data structures at once (QEMU TLB; global memory dirty bitmaps; TB caching). Also it's quite easy to forget because if it doesn't work then actually quite a lot of guest code still works fine...
** generally this means taking BQL (e.g. HELPER(set_cp_reg))
* Audit MMU management functions
** cputlb provides an API for various tlb_flush_FOO operations
** updates to the guests page tables need to be atomic (e.g. dirty bits)
* Audit power/reset sequences
** see for example {{src|path=target/arm/arm-powerctl.c}}


NB: linux-user mode handles this a bit differently by marking memory as read-only and catching the signal on guest write attempts; the problems are probably slightly different there.
The work queue API async_[safe_]run_on_cpu provides a mechanism for one vCPU to queue work on another.


===Memory consistency===
Once this work is done your final patch can update configure and enable TARGET_SUPPORTS_MTTCG


Host and guest might implement different memory consistency models. While supporting a weak ordering model on a strong ordering backend isn't a problem it's going to be hard supporting strong ordering on a weakly ordered backend.
===Testing===
** Watch out for subtle differences; e.g. x86 is mostly strong ordered but can reorder stores made by the same CPU doing the load.


===Instruction atomicity===
Ideally you'll want a comprehensive set of tests to exercise the corner cases of system emulation behaviour. See [https://github.com/stsquad/kvm-unit-tests/tree/mttcg/current-tests-v7 Alex's kvm-unit-tests] for an example of how the ARM architecture is exercised.


Atomic instructions must be translated to an atomic host operation.
==Further Work==


I'd suggest the following refinement sequence:
* Enabling strong-on-weak memory consistency (e.g. emulate x86 on an ARM host)
* add the concept of memory coherence domains (MCD) in QEMU with a lock on each (can start with a system-wide MCD)
* wrap every instruction with the lock of the corresponding MCD
* remove locking for non-atomic instructions
** Take care in the way that non-atomics interact with atomics, an architecture might define something about what  non-atomics see, and what a non-atomic store does during an atomic.
* add atomic primitives in TCG (should have MCD as argument), translating them to using the appropriate lock
* optimize TCG atomics to use atomic instructions on the host
** Somehow deal with ARM/MIPS/Power split load/store atomics that might have arbitrary stuff inbetween.


==How to get involved==
==People==


Right now, there is a small dedicated team looking at this issue. Those are:
Now MTTCG is merged it is supported by the TCG maintainers. However the following people where involved:


* Fred Konrad (Core MTTCG patch set)
* Fred Konrad (Original core MTTCG patch set)
* Alex Bennée (ARM testing, base enabling tree)
* Alvise Rigo (LL/SC work)
* Alvise Rigo (LL/SC work)
* Alex Bennée (Review, testing)
* Emilio Cota (QHT, cmpxchg atomics)
* Mark Burton
* Pavel Dovgalyuk
 
=== Mailing List ===
If you would like to be involved, please use the mail list: mttcg@listserver.greensocs.com
 
You can subscribe here:
http://listserver.greensocs.com/wws/info/mttcg
 
If you send to this mail list, please make sure to copy qemu-devel as well.
 
There is a once a fortnight phone conference with summary notes posted to the mailing lists ([http://lists.nongnu.org/archive/cgi-bin/namazu.cgi?query=MTTCG&submit=Search%21&idxname=qemu-devel&max=20&result=normal&sort=date%3Alate archives]).
 
===Current Code===
 
Remember these trees are '''WORK-IN-PROGRESS''' and could be broken at any particular point. Branches may be re-based without notice.
 
MTTCG Work:
 
* Fred's Tree: http://git.greensocs.com/fkonrad/mttcg.git ('''branch:'''multi_tcg_v8)
* Alex's Tree: https://github.com/stsquad/qemu ('''branch:'''mttcg/multi_tcg_v8_ajb-r1)
 
LL/SC Work
 
* Alvise's Tree: https://git.virtualopensystems.com/dev/qemu-mt.git ('''branch:'''slowpath-for-atomic-v7-no-mttcg)
 
==Other Work==
 
This is the most important section initially, and we welcome any, and all comments and other work.
If you know of any patch sets that may be of value, PLEASE let us know via the qemu-devel mail list.
 
===Proof of concept implementations===
Below are all the proof of concept implementations we have found thus far. It is highly likely that some of these patch sets can help us to reach an up-streamable solution. At the very least these provide some evidence that there is a performance improvement to be had.
 
* HQEMU
** http://dl.acm.org/citation.cfm?id=2259030&CFID=454906387&CFTOKEN=60579010
 
* PQEMU
** https://github.com/podinx/PQEMU
** http://www.cs.nthu.edu.tw/~ychung/conference/ICPADS2011.pdf
 
* COREMU
** http://sourceforge.net/p/coremu/home/Home/
** [http://ipads.se.sjtu.edu.cn/lib/exe/fetch.php?media=publications:coremu-ppopp11.pdf COREMU: a Scalable and Portable Parallel Full-system Emulator]
** [http://ipads.se.sjtu.edu.cn/lib/exe/fetch.php?media=publications:reemu-ppopp13.pdf Scalable Deterministic Replay in a Parallel Full-system Emulator]
 
==Follow up work==
 
There are some additional things that will need to be looked at for ''user-mode'' emulation.


===Signal Handling===
==Other Reading==


There are two types of signal we need to handle. Synchronous (e.g. SIGBUS, SIGSEG) and Asynchronous (e.g. SIGSTOP, SIGINT, SIGUSR). While any signal can be sent asynchronously most of the common synchronous ones occur when there is an error in the translated code. As such rectifying machine state is fairly well tested. For Asynchronus signals there are a plethora of edge cases to deal with especially around the handling of signals with respect to system calls. If they arrive during translated code there behaviour is fairly easy to handle however when in QEMU's own code care has to be taken that syscalls respond correctly to the EINTR.
* Emilio's slides for his CGO17 paper - http://www.cs.columbia.edu/~cota/pubs/cota_cgo17-slides.pdf
* Cross-ISA Machine Emulation for Multicores - http://www.cs.columbia.edu/~cota/pubs/cota_cgo17.pdf DOI:10.1109/CGO.2017.7863741
* Cross-ISA Machine Instrumentation using Fast and Scalable Dynamic Binary Translation - https://dl.acm.org/doi/pdf/10.1145/3313808.3313811

Latest revision as of 10:14, 4 August 2021

This is the feature that allows the Tiny Code Generator run one host-thread per guest thread or guest vCPU (in system emulation mode). It was first introduced in QEMU 2.9 for Alpha and ARM. Work to enable full multi-threading support in additional system emulations is on going.

Overview

QEMU's system emulation mode could always emulate multiple vCPUs but it scheduled them in a single thread and executed each one in tern in a round-robin fashion. To switch to a host-thread per vCPU a number of changes had to be made to the core code as well as explicit support in each guest architecture. The design decisions are documented in docs/devel/multi-thread-tcg.txt.

There was a talk at KVM Forum 2015 (video slides) which is a little out of date but acts as a useful primer on the challenges involved.

Controlling MTTCG

Once a MTTCG guest is supported there should be no need to enable it explicitly. The system emulation will enable it if the following conditions are met:

  • The guest architecture has defined TARGET_SUPPORTS_MTTCG
  • The host architectures TCG_TARGET_DEFAULT_MO supports TCG_GUEST_DEFAULT_MO

When this is not the case you can force MTTCG by specifying:

   $QEMU $OPTS --accel tcg,thread=multi

although you are likely to get strange behaviour. If you suspect that guest emulation is incorrect you can revert to single threaded mode and re-run your test:

   $QEMU $OPTS --accel tcg,thread=single
   

Incompatibilities

MTTCG is not compatible with -icount and enabling icount will force a single threaded run.

Developer Details

Porting a guest architecture

Before MTTCG can be enabled for a guest the following changes must be made.

  • Correctly translate atomic/exclusive instructions (see tcg_gen_atomic_)
  • Ensure the translation step correctly handles barrier instructions (tcg_gen_mb)
  • Define TCG_GUEST_DEFAULT_MO
  • Audit instructions that modify system state
    • generally this means taking BQL (e.g. HELPER(set_cp_reg))
  • Audit MMU management functions
    • cputlb provides an API for various tlb_flush_FOO operations
    • updates to the guests page tables need to be atomic (e.g. dirty bits)
  • Audit power/reset sequences

The work queue API async_[safe_]run_on_cpu provides a mechanism for one vCPU to queue work on another.

Once this work is done your final patch can update configure and enable TARGET_SUPPORTS_MTTCG

Testing

Ideally you'll want a comprehensive set of tests to exercise the corner cases of system emulation behaviour. See Alex's kvm-unit-tests for an example of how the ARM architecture is exercised.

Further Work

  • Enabling strong-on-weak memory consistency (e.g. emulate x86 on an ARM host)

People

Now MTTCG is merged it is supported by the TCG maintainers. However the following people where involved:

  • Fred Konrad (Original core MTTCG patch set)
  • Alex Bennée (ARM testing, base enabling tree)
  • Alvise Rigo (LL/SC work)
  • Emilio Cota (QHT, cmpxchg atomics)

Other Reading