Internships/ProjectIdeas/MTTCGEnhancements

MTTCG Performance Enhancements

Summary: The MTTCG Project is a project that converted the TCG engine from single threaded execution to multi-threaded execution to take advantage of all cores on a modern processor. With this conversion several performance bottlenecks were identified when running strongly ordered guests like x86 on weakly ordered hosts like ARM64. The first part of the project will be to quantify the identified bottlenecks for TCG performance. Based on this data, you need to prioritize one of the following sub-tasks.

Measure performance bottlenecks experimentally

 - Reasons for code flushes in the current code execution
 - Re-translation overhead for commonly used translation blocks
 - Consistency overhead caused by generating fence instructions for all loads/stores

Place TranslationBlock structures into the same memory block as code_gen_buffer

Consider what happens within every TB:

(1) We have one or more references to the TB address, via exit_tb.

For aarch64, this will normally require 2-4 insns.

 # alpha-softmmu
 0x7f75152114:  d0ffb320      adrp x0, #-0x99a000 (addr 0x7f747b8000)
 0x7f75152118:  91004c00      add x0, x0, #0x13 (19)
 0x7f7515211c:  17ffffc3      b #-0xf4 (addr 0x7f75152028)

 # alpha-linux-user
 0x00569500:  d2800260      mov x0, #0x13
 0x00569504:  f2b59820      movk x0, #0xacc1, lsl #16
 0x00569508:  f2c00fe0      movk x0, #0x7f, lsl #32
 0x0056950c:  17ffffdf      b #-0x84 (addr 0x569488)

We would reduce this to one insn, always, if the TB were close by, since the ADR instruction has a range of 1MB.

(2) We have zero to two references to a linked TB, via goto_tb.

Remove the 128MB translation cache size limit on ARM64.

The translation cache size for an ARM64 host is currently limited to 128 MB. This limitation is imposed by utilizing a branch instruction which encodes the jump offset and is limited by the number of bits it can use for the range of the offset. The performance impact by this limitation is severe and can be observed when you try to run large programs like a browser in the guest. The cache is flushed several times before the browser starts and the performance is not satisfactory. This limitation can be overcome by generating a branch-to-register instruction and utilizing that when the destination address is outside the range of what can be encoded in current branch instruction.

Based on the previous task of placing the translation structures within the code gen buffer, we can remove this 128 MB cache size limit as follows:

(i) Raise the maximum to 2GB by aligning an instruction pair, adrp+add, to compute the address; the following insn would branch. The update code would write a new destination by modifing the adrp+add with a single 64-bit store.

(ii) Eliminate the maximum altogether by referencing the destination directly in the TB. This is the !USE_DIRECT_JUMP path. It is normally not used on 64-bit targets because computing the full 64-bit address of the TB is harder, or just as hard, as computing the full 64-bit address of the destination.

However, if the TB is nearby, aarch64 can load the address from TB.jmp_target_addr in one insn, with LDR (literal). This pc-relative load also has a 1MB range.

This has the side benefit that it is much quicker to re-link TBs, both in the computation of the code for the destination as well as re-flushing the icache.

Implement an LRU translation block code cache.

In the current mechanism that it is not necessary to know how much code is going to be generated for a given set of TCG opcodes. When we reach the high-water mark, we flush everything and start over at the beginning of the buffer. We can improve this situation by not flushing the TBs that were recently used i.e., by implementing an LRU policy for freeing the blocks. If you manage the cache with an allocator, you'll need to know in advance how much code is going to be generated. This is going to require that you generate position-independent code into an external buffer and copy it into the code gen buffer after determining the size. We can then implement an LRU policy for removing unused blocks and saving the translation cache.

Avoid consistency overhead for strong memory model guests by generating load-acquire and store-release instructions.

To run a strongly ordered guest on a weakly ordered host using MTTCG, for example, x86 on ARM64, we have to generate fence instructions for all the guest memory accesses to ensure consistency. The overhead imposed by these fence instructions is significant (almost 3x when compared to a run without fence instructions). ARM64 provides load-acquire and store-release instructions which are sequentially consistent and can be used instead of generating fence instructions. Add support to generate these instructions in the TCG run-time to reduce the consistency overhead in MTTCG. You have to use the memory access auxiliary info tags to generate appropriate fences on the host architecture unlike the current situation, where only explicit guest fence instructions are translated.

Further Reading:

Richard Henderson's email on how to improve TCG performance
QEMU internals presentation
TCG documentation
The kernel has a detailed guide to memory barriers

Requirements: Working on this will require the student to develop a good understanding of the internals of tiny code generator (TCG) in QEMU. An understanding of compiler theory or previous knowledge of the TCG would also be beneficial to this work. Finally familiarity with GIT and being able to frequently re-base work on upstream master branch would be useful.

Details:

Skill level: intermediate
Language: C
Mentor: Alex Bennée <alex.bennee@linaro.org> (stsquad on IRC)
Suggested by: Pranith Kumar, Alex Bennée, and Richard Henderson