Internships/ProjectIdeas/TCGCodeQuality: Difference between revisions

From QEMU
No edit summary
Line 174: Line 174:


When in the system emulation mode, the collection of the statistics can be delayed and initiated using an HPM command: '''start_stats'''.  
When in the system emulation mode, the collection of the statistics can be delayed and initiated using an HPM command: '''start_stats'''.  
The command dfilter can also be used to limit the addresses in which the statistics will be collected and dumped.


To list the hottest TBs the command info tbs can be used. This command can also receive as parameter the number of tbs to be dumped and by which metric it will be sorted. Follows an example of the output:
To list the hottest TBs the command info tbs can be used. This command can also receive as parameter the number of tbs to be dumped and by which metric it will be sorted. Follows an example of the output:
== Dumping TBStatistics Information ==


=== info tbs ===
=== info tbs ===
Line 190: Line 190:
  TB9: phys:0xfe8fe virt:0x00000000000fe8fe flags:0x000040 (trans:1 uncached:0 exec:630784 ints: g:1 op:36 h:75 h/g: 75.000000)
  TB9: phys:0xfe8fe virt:0x00000000000fe8fe flags:0x000040 (trans:1 uncached:0 exec:630784 ints: g:1 op:36 h:75 h/g: 75.000000)
  TB10: phys:0x897d virt:0x000000000000897d flags:0x0000b0 (trans:1 uncached:0 exec:528246 ints: g:1 op:30 h:56 h/g: 56.000000)
  TB10: phys:0x897d virt:0x000000000000897d flags:0x0000b0 (trans:1 uncached:0 exec:528246 ints: g:1 op:30 h:56 h/g: 56.000000)
If necessary to iteratively examine one of this listed tbs the '''info tb id'' command can be used, where the id is the same as the ones listed by info tbs. The following example shows the result of dumping the guest instructions of tb 1, but also guest and TCG op instructions can be dumped by passing a second argument.


=== info tb ===
=== info tb ===
Line 199: Line 201:
  0x00034d56:  3b 1d 30 00 00 00        cmpl    0x30, %ebx
  0x00034d56:  3b 1d 30 00 00 00        cmpl    0x30, %ebx
  0x00034d5c:  72 af                    jb      0x34d0d
  0x00034d5c:  72 af                    jb      0x34d0d
Finally, there is an option to dump not the "n" hottest blocks but all necessary hot blocks to achieve m% of the total execution couting. Moreover, this is a useful metric to understand the execution footprint of a program, with more dense applications having a smaller number of blocks to achieve m% of the execution.


=== info coverset ===
=== info coverset ===
Line 216: Line 220:
* how many fills/spills in a block (where register contents are moved due to register pressure)
* how many fills/spills in a block (where register contents are moved due to register pressure)
* export block JIT information for perf tool (the later version)
* export block JIT information for perf tool (the later version)
* replace the CONF_PROFILE by using info from the TBStatistics to reconstruct its results.
* benchmark qemu while collecting these statistics.


QEMU currently only works on translating simple basic blocks with one or two exit paths. This work could be a pre-cursor to supporting [[Internships/ProjectIdeas/Multi-exit Hot Blocks]] in the future.
QEMU currently only works on translating simple basic blocks with one or two exit paths. This work could be a pre-cursor to supporting [[Internships/ProjectIdeas/Multi-exit Hot Blocks]] in the future.

Revision as of 02:27, 8 July 2019

Measure Tiny Code Generation Quality

Status: Vanderson M. do Rosario <vandersonmr2@gmail.com> (vanderson on #qemu IRC) is working on this project for 2019 GSoC.

Mentor: Alex Bennée <alex.bennee@linaro.org> (stsquad on #qemu IRC)

Project Github: vandersonmr/gsoc-qemu [1]

Summary: in most applications, the majority of the execution time is spent in a very small portion of code. Regions of a code which have high-frequency execution are called hot while all other regions are called cold. As a direct consequence, emulators also spent most of their execution time emulating these hot regions and, so, dynamic compilers and translators need to pay extra attention to them. To guarantee that these hot regions are compiled/translated generating high-quality code is fundamental to achieve a final high-performance emulation. Thus, one of the most important steps in tuning an emulator performance is to identify which are the hot regions and to measure their translation quality.

TBStatsitics (TBStats)

Improving the code generation of the TCG backend is a hard task that involves reading through large amounts of text looking for anomalies in the generated code. It would be nice to have tools to more readily extract and parse code generation information. This would include options to dump:

  • The hottest Translations Blocks (TB) and their execution count (total and atomic).
  • Translation counters:
    • The number of times a TB has been translated, uncached and spanned.
  • Code quality metrics:
    • The number of TB guest, IR (TCG ops), and host instructions.
    • The Number of spills during the register allocation.

For that reason, we collect all this information dynamically for very TB or for a specific set of TBs and store it on TBStatistics structures. Every TB can have one TBStatistics linked to it by a new field inserted in the TranslationBlock structure[2]. Moreover, TBStatistics are not flushed during tb_flush and they survive longer being able to be relinked to retranslated TBs using their keys (phys_pc, pc, flags, cs_base) to matches new TBs and their TBStats.

struct TBStatistics {
   tb_page_addr_t phys_pc;
   target_ulong pc;
   uint32_t     flags;
   /* cs_base isn't included in the hash but we do check for matches */
   target_ulong cs_base;
/* Translation stats */ struct { unsigned long total; unsigned long uncached; unsigned long spanning; } translations;
/* Execution stats */ struct { unsigned long total; unsigned long atomic; } executions;
struct { unsigned num_guest_inst; unsigned num_host_inst; unsigned num_tcg_inst; unsigned spills; } code;
/* HMP information - used for referring to previous search */ int display_id; };

Creating and Storing TBStats

When a TB is going to be created in tb_gen_code, we check if the CPU_LOG_HOT_TBS is set and if the TB entry address (tb->pc) is in the range of interest[3]. If so, the tb_get_stats is called which can create a new TBStatistic (if no other if the same keys exist) or return an existing one. The tb_get_stats function creates TBStatistics structures and stores them in a qht hash table in the TBContext[4].

   if (qemu_loglevel_mask(CPU_LOG_HOT_TBS) && qemu_log_in_addr_range(tb->pc))
       tb->tb_stats = tb_get_stats(phys_pc, pc, cs_base, flags);

Collecting/Filling TBStatistics Information

To fill the fields in the TBStatistics dynamically, different parts of the code need to be changed. We list here how we collect the information for each field.

Execution Count - Hotness

To collect the execution count of each TB, we instrument the begin of each one of them, adding a call to a helper function called exec_freq. This is done by the gen_tb_exec_count function. The exec_freq helper receives the address of the TBStatistic structure linked to the TB.

include/exec/gen-icount.h:

static inline void gen_tb_exec_count(TranslationBlock *tb)
{
    if (qemu_loglevel_mask(CPU_LOG_HOT_TBS) && tb->tb_stats) {
        TCGv_ptr ptr = tcg_const_ptr(tb->tb_stats);
        gen_helper_inc_exec_freq(ptr);
        tcg_temp_free_ptr(ptr);
    }
}

The helper function access the field executions.total of the TBStatistic structure and increment it atomically counting the execution of the TB.

accel/tcg/tcg-runtime.c:

void HELPER(inc_exec_freq)(void *ptr)
{
    TBStatistics *stats = (TBStatistics *) ptr;
    g_assert(stats);
    atomic_inc(&stats->executions.total);
}


Spanning

Every time a TB spans we atomically increment the translations.spanning field.

accel/tcg/translate-all.c:

TranslationBlock *tb_gen_code(CPUState *cpu,
                            target_ulong pc, target_ulong cs_base,
                            uint32_t flags, int cflags)
   ...
   if ((pc & TARGET_PAGE_MASK) != virt_page2) {
        phys_page2 = get_page_addr_code(env, virt_page2);
        if (tb->tb_stats) {
            atomic_inc(&tb->tb_stats->translations.spanning);
        }
   }
   ...

Uncached

Guest Instructions

The guest instructions are already counted in the DisasContextBase in the translator_loop, so we simply copy it to the code.num_guest_inst.

accel/tcg/translator.c:

void translator_loop(const TranslatorOps *ops, DisasContextBase *db,
                     CPUState *cpu, TranslationBlock *tb, int max_insns)
   ....
   db->tb->tb_stats->code.num_guest_inst = db->num_insns;
   ....

Host Instructions

As there is no function which only counts the number of instruction using the disassembler, we adapt the disas function in a get_num_insts which returns the number of instructions of any assembly of the supported architectures. The difference between both functions is that the second one uses a fake fprintf which do nothing, so it iterates over the assembly, counting the instructions but without printing them.

accel/tcg/translate-all.c:

TranslationBlock *tb_gen_code(CPUState *cpu,
                             target_ulong pc, target_ulong cs_base,
                             uint32_t flags, int cflags)
   ...
   size_t code_size = gen_code_size;
   if (tcg_ctx->data_gen_ptr) {
        code_size = tcg_ctx->data_gen_ptr - tb->tc.ptr;
   }
   qemu_log_lock();
   atomic_set(&tb->tb_stats->code.num_host_inst,
                get_num_insts(tb->tc.ptr, code_size));
   qemu_log_unlock();
   ...

disas.c:

static int fprintf_fake(struct _IO_FILE *a, const char *b, ...)
{
   return 1;
}

unsigned get_num_insts(void *code, unsigned long size)
{
   CPUDebug s;
   ... 
   s.info.fprintf_func = fprintf_fake;
   unsigned num_insts = 0;
   for (pc = (uintptr_t)code; size > 0; pc += count, size -= count) {
       num_insts++;
       count = print_insn(pc, &s.info);
       if (count < 0) {
           break;
       }
   }
   return num_insts;
}


TCG Instructions

To count the number of TCG instructions we need only to iterate over the ops in the TCGContext and them store the result in the code.num_tcg_inst.

tcg/tcg.c:

int tcg_gen_code(TCGContext *s, TranslationBlock *tb)
  ....
  int n = 0;
  QTAILQ_FOREACH(op, &s->ops, link) {
      n++;
  }
  tb->tb_stats->code.num_tcg_inst = n;
  ....

Controling and Dump TBStatistics

We added the -d hot_tbs which activates the collection of statistics of the TBs. In linux-user mode the hottest blocks are dumped at the end of the execution and they can be limited by passing a value just as: -d hot_tbs:value.

When in the system emulation mode, the collection of the statistics can be delayed and initiated using an HPM command: start_stats.

The command dfilter can also be used to limit the addresses in which the statistics will be collected and dumped.

To list the hottest TBs the command info tbs can be used. This command can also receive as parameter the number of tbs to be dumped and by which metric it will be sorted. Follows an example of the output:

info tbs

TB1: phys:0x34d54 virt:0x0000000000034d54 flags:0x0000f0 (trans:1 uncached:0 exec:5202684 ints: g:3 op:34 h:55 h/g: 18.333334)
TB2: phys:0x34d0d virt:0x0000000000034d0d flags:0x0000f0 (trans:1 uncached:0 exec:5199468 ints: g:4 op:38 h:69 h/g: 17.250000)
TB3: phys:0xec1c1 virt:0x00000000000ec1c1 flags:0x0000b0 (trans:1 uncached:0 exec:872031 ints: g:2 op:26 h:23 h/g: 11.500000)
TB4: phys:0xec1c5 virt:0x00000000000ec1c5 flags:0x0000b0 (trans:1 uncached:0 exec:871841 ints: g:3 op:25 h:48 h/g: 16.000000)
TB5: phys:0x34cae virt:0x0000000000034cae flags:0x0000f0 (trans:1 uncached:0 exec:833787 ints: g:1 op:29 h:53 h/g: 53.000000)
TB6: phys:0x39aaf virt:0x0000000000039aaf flags:0x0000f0 (trans:1 uncached:0 exec:698880 ints: g:2 op:29 h:32 h/g: 16.000000)
TB7: phys:0x39aa1 virt:0x0000000000039aa1 flags:0x0000f0 (trans:1 uncached:0 exec:698334 ints: g:4 op:41 h:85 h/g: 21.250000)
TB8: phys:0x38d05 virt:0x0000000000038d05 flags:0x0000f0 (trans:1 uncached:0 exec:656640 ints: g:2 op:21 h:64 h/g: 32.000000)
TB9: phys:0xfe8fe virt:0x00000000000fe8fe flags:0x000040 (trans:1 uncached:0 exec:630784 ints: g:1 op:36 h:75 h/g: 75.000000)
TB10: phys:0x897d virt:0x000000000000897d flags:0x0000b0 (trans:1 uncached:0 exec:528246 ints: g:1 op:30 h:56 h/g: 56.000000)

If necessary to iteratively examine one of this listed tbs the 'info tb id command can be used, where the id is the same as the ones listed by info tbs. The following example shows the result of dumping the guest instructions of tb 1, but also guest and TCG op instructions can be dumped by passing a second argument.

info tb

------------------------------
TB1: phys:0x34d54 virt:0x0000000000034d54 flags:0x0000f0 (trans:1 uncached:0 exec:6960245 ints: g:3 op:34 h:55 h/g: 18.333334)
----------------
IN: 
0x00034d54:  01 cb                    addl     %ecx, %ebx
0x00034d56:  3b 1d 30 00 00 00        cmpl     0x30, %ebx
0x00034d5c:  72 af                    jb       0x34d0d

Finally, there is an option to dump not the "n" hottest blocks but all necessary hot blocks to achieve m% of the total execution couting. Moreover, this is a useful metric to understand the execution footprint of a program, with more dense applications having a smaller number of blocks to achieve m% of the execution.

info coverset

TB1: phys:0x34d54 virt:0x0000000000034d54 flags:0x0000f0 (trans:1 uncached:0 exec:5202684 ints: g:3 op:34 h:55 h/g: 18.333334)
TB2: phys:0x34d0d virt:0x0000000000034d0d flags:0x0000f0 (trans:1 uncached:0 exec:5199468 ints: g:4 op:38 h:69 h/g: 17.250000)
TB3: phys:0xec1c1 virt:0x00000000000ec1c1 flags:0x0000b0 (trans:1 uncached:0 exec:872031 ints: g:2 op:26 h:23 h/g: 11.500000)
------------------------------
# of TBs to reach 90% of the total exec count: 3
Total exec count: 12630686
------------------------------

Future Work and Schedule

Future work would include:

  • elide or beautify common blocks like softmmu access macros (which are always the same)
  • how many fills/spills in a block (where register contents are moved due to register pressure)
  • export block JIT information for perf tool (the later version)
  • replace the CONF_PROFILE by using info from the TBStatistics to reconstruct its results.
  • benchmark qemu while collecting these statistics.

QEMU currently only works on translating simple basic blocks with one or two exit paths. This work could be a pre-cursor to supporting Internships/ProjectIdeas/Multi-exit Hot Blocks in the future.