Internships/ProjectIdeas/TCGCodeQuality

Measure Tiny Code Generation Quality

Status: Vanderson M. do Rosario <vandersonmr2@gmail.com> (vanderson on #qemu IRC) is working on this project for 2019 GSoC.

Mentor: Alex Bennée <alex.bennee@linaro.org> (stsquad on #qemu IRC)

Project Github: vandersonmr/gsoc-qemu [1]

Summary: in most applications, the majority of the execution time is spent in a very small portion of code. Regions of a code which have high-frequency execution are called hot while all other regions are called cold. As a direct consequence, emulators also spent most of their execution time emulating these hot regions and, so, dynamic compilers and translators need to pay extra attention to them. To guarantee that these hot regions are compiled/translated generating high-quality code is fundamental to achieve a final high-performance emulation. Thus, one of the most important steps in tuning an emulator performance is to identify which are the hot regions and to measure their translation quality.

TBStatsitics (TBStats)

Improving the code generation of the TCG backend is a hard task that involves reading through large amounts of text looking for anomalies in the generated code. It would be nice to have tools to more readily extract and parse code generation information. This would include options to dump:

The hottest Translations Blocks (TB) and their execution count (total and atomic).
Translation counters:
- The number of times a TB has been translated, uncached and spanned.
Code quality metrics:
- The number of TB guest, IR (TCG ops), and host instructions.
- The Number of spills during the register allocation.

For that reason, we collect all this information dynamically for very TB or for a specific set of TBs and store it on TBStatistics structures. Every TB can have one TBStatistics linked to it by a new field inserted in the TranslationBlock structure[2]. Moreover, TBStatistics are not flushed during tb_flush and they survive longer being able to be relinked to retranslated TBs using their keys (phys_pc, pc, flags, cs_base) to matches new TBs and their TBStats.

struct TBStatistics {
   tb_page_addr_t phys_pc;
   target_ulong pc;
   uint32_t     flags;
   /* cs_base isn't included in the hash but we do check for matches */
   target_ulong cs_base;

   /* Translation stats */
   struct {
       unsigned long total;
       unsigned long uncached;
       unsigned long spanning;
   } translations;

   /* Execution stats */
   struct {
       unsigned long total;
       unsigned long atomic;
   } executions;

   struct {
       unsigned num_guest_inst;
       unsigned num_host_inst;
       unsigned num_tcg_inst;
       unsigned spills;
   } code;

   /* HMP information - used for referring to previous search */
   int display_id;
};

Creating and Storing TBStats

When a TB is going to be created in tb_gen_code, we check if the CPU_LOG_HOT_TBS is set and if the TB entry address (tb->pc) is in the range of interest[3]. If so, the tb_get_stats is called which can create a new TBStatistic (if no other if the same keys exist) or return an existing one.

   if (qemu_loglevel_mask(CPU_LOG_HOT_TBS) && qemu_log_in_addr_range(tb->pc))
       tb->tb_stats = tb_get_stats(phys_pc, pc, cs_base, flags);

Collecting/Filling TBStatistics Information

Controling TBStatistics Collection

Dumping TBStatistics Information

Future Work and Schedule

Improving the code generation of the TCG backend is a hard task that involves reading through large amounts of text looking for anomalies in the generated code. It would be nice to have tools to more readily extract and parse code generation information. This would include:

Modifying code generator, dumping additional data

which are hot blocks (frequently run, hence more important performance wise)
export block JIT information for perf tool (the later version)

Tweaking -d op,out_asm output

how many fills/spills in a block (where register contents are moved due to register pressure)
number of host instructions for each guest instruction (JIT profiling has a basic version of this)
elide or beautify common blocks like softmmu access macros (which are always the same)

Modifying the HMP

support interactive exploration of translation state (system emulation)

QEMU currently only works on translating simple basic blocks with one or two exit paths. This work could be a pre-cursor to supporting Internships/ProjectIdeas/Multi-exit Hot Blocks in the future.