Internships/ProjectIdeas/MTTCGEnhancements

MTTCG Performance Enhancements

Summary: The MTTCG Project is a project that converted the TCG engine from single threaded execution to multi-threaded execution to take advantage of all cores on a modern processor. With this conversion several performance bottlenecks were identified when running strongly ordered guests like x86 on weakly ordered guests like ARM64.

Implement jump-to-register instruction on ARM64 to overcome the 128MB translation cache size limit.

The translation cache size for an ARM64 host is currently limited to 128 MB. This limitation is imposed by utilizing a branch instruction which encodes the jump offset and is limited by the number of bits it can use for the range of the offset. The performance impact by this limitation is severe and can be observed when you try to run large programs like a browser in the guest. The cache is flushed several times before the browser starts and the performance is not satisfactory. This limitation can be overcome by generating a branch-to-register instruction and utilizing that when the destination address is outside the range of what can be encoded in current branch instruction.

Eliminate the maximum altogether by referencing the destination directly in the TB. This is the !USE_DIRECT_JUMP path. It is normally not used on 64-bit targets because computing the full 64-bit address of the TB is harder, or just as hard, as computing the full 64-bit address of the destination. However, if the TB is nearby, aarch64 can load the address from TB.jmp_target_addr in one insn, with LDR (literal). This pc-relative load also has a 1MB range. This has the side benefit that it is much quicker to re-link TBs, both in the computation of the code for the destination as well as re-flushing the icache.

Implement an LRU translation block code cache.

In the current mechanism that it is not necessary to know how much code is going to be generated for a given set of TCG opcodes. When we reach the high-water mark, we flush everything and start over at the beginning of the buffer. We can improve this situation by not flushing the TBs that were recently used i.e., by implementing an LRU policy for freeing the blocks. If you manage the cache with an allocator, you'll need to know in advance how much code is going to be generated. This is going to require that you generate position-independent code into an external buffer and copy it into the code gen buffer after determining the size. We can then implement an LRU policy for removing unused blocks and saving the translation cache.

Avoid consistency overhead for strong memory model guests by generating load-acquire and store-release instructions.

To run a strongly ordered guest on a weakly ordered host using MTTCG, for example, x86 on ARM64, we have to generate fence instructions for all the guest memory accesses to ensure consistency. The overhead imposed by these fence instructions is significant (almost 3x when compared to a run without fence instructions). ARM64 provides load-acquire and store-release instructions which are sequentially consistent and can be used instead of generating fence instructions. I plan to add support to generate these instructions in the TCG run-time to reduce the consistency overhead in MTTCG.

Further Reading:

Richard Henderson's email on how to improve TCG performance
QEMU internals presentation
TCG documentation
The kernel has a detailed guide to memory barriers

Requirements: Working on this will require the student to develop a good understanding of the internals of tiny code generator (TCG) in QEMU. An understanding of compiler theory or previous knowledge of the TCG would also be beneficial to this work. Finally familiarity with GIT and being able to frequently re-base work on upstream master branch would be useful.

Details:

Skill level: intermediate
Language: C
Mentor: Alex Bennée <alex.bennee@linaro.org> (stsquad on IRC)
Suggested by: Pranith Kumar, Alex Bennée, and Richard Henderson