Internships/ProjectIdeas/TCGCodeQuality
Measure Tiny Code Generation Quality
Status: Vanderson M. do Rosario <vandersonmr2@gmail.com> (vanderson on #qemu IRC) is working on this project for 2019 GSoC.
Mentor: Alex Bennée <alex.bennee@linaro.org> (stsquad on #qemu IRC)
Project Github: vandersonmr/gsoc-qemu [1]
Summary: in most applications, the majority of the execution time is spent in a very small portion of code. Regions of a code which have high-frequency execution are called hot while all other regions are called cold. As a direct consequence, emulators also spent most of their execution time emulating these hot regions and, so, dynamic compilers and translators need to pay extra attention to them. To guarantee that these hot regions are compiled/translated generating high-quality code is fundamental to achieve a final high-performance emulation. Thus, one of the most important steps in tuning an emulator performance is to identify which are the hot regions and to measure their translation quality.
TBStatsitics (TBStats)
Improving the code generation of the TCG backend is a hard task that involves reading through large amounts of text looking for anomalies in the generated code. It would be nice to have tools to more readily extract and parse code generation information. This would include options to dump:
- The hottest Translations Blocks (TB) and their execution count (total and atomic).
- Translation counters:
- The number of times a TB has been translated, uncached and spanned.
- Code quality metrics:
- The number of TB guest, IR (TCG ops), and host instructions.
- The Number of spills during the register allocation.
For that reason, we collect all this information dynamically for very TB or for a specific set of TBs and store it on TBStatistics structures. Every TB can have one TBStatistics linked to it by a new field inserted in the TranslationBlock structure[2]. Moreover, TBStatistics are not flushed during tb_flush and they survive longer being able to be relinked to retranslated TBs using their keys (phys_pc, pc, flags, cs_base) to matches new TBs and their TBStats.
struct TBStatistics { tb_page_addr_t phys_pc; target_ulong pc; uint32_t flags; /* cs_base isn't included in the hash but we do check for matches */ target_ulong cs_base;
/* Translation stats */ struct { unsigned long total; unsigned long uncached; unsigned long spanning; } translations;
/* Execution stats */ struct { unsigned long total; unsigned long atomic; } executions;
struct { unsigned num_guest_inst; unsigned num_host_inst; unsigned num_tcg_inst; unsigned spills; } code;
/* HMP information - used for referring to previous search */ int display_id; };
Creating and Storing TBStats
When a TB is going to be created in tb_gen_code, we check if the CPU_LOG_HOT_TBS is set and if the TB entry address (tb->pc) is in the range of interest[3]. If so, the tb_get_stats is called which can create a new TBStatistic (if no other if the same keys exist) or return an existing one. The tb_get_stats function creates TBStatistics structures and stores them in a qht hash table in the TBContext[4].
if (qemu_loglevel_mask(CPU_LOG_HOT_TBS) && qemu_log_in_addr_range(tb->pc)) tb->tb_stats = tb_get_stats(phys_pc, pc, cs_base, flags);
Collecting/Filling TBStatistics Information
To fill the fields in the TBStatistics dynamically, different parts of the code need to be changed. We list here how we collect the information for each field.
Execution Count - Hotness
include/exec/gen-icount.h:
static inline void gen_tb_exec_count(TranslationBlock *tb) { if (qemu_loglevel_mask(CPU_LOG_HOT_TBS) && tb->tb_stats) { TCGv_ptr ptr = tcg_const_ptr(tb->tb_stats); gen_helper_inc_exec_freq(ptr); tcg_temp_free_ptr(ptr); } }
accel/tcg/tcg-runtime.c:
void HELPER(inc_exec_freq)(void *ptr) { TBStatistics *stats = (TBStatistics *) ptr; g_assert(stats); atomic_inc(&stats->executions.total); }
Spanning
accel/tcg/translate-all.c:
TranslationBlock *tb_gen_code(CPUState *cpu, target_ulong pc, target_ulong cs_base, uint32_t flags, int cflags) ... if ((pc & TARGET_PAGE_MASK) != virt_page2) { phys_page2 = get_page_addr_code(env, virt_page2); if (tb->tb_stats) { atomic_inc(&tb->tb_stats->translations.spanning); } } ...
Uncached
Guest Instructions
accel/tcg/translator.c:
void translator_loop(const TranslatorOps *ops, DisasContextBase *db, CPUState *cpu, TranslationBlock *tb, int max_insns) .... db->tb->tb_stats->code.num_guest_inst = db->num_insns; ....
Host Instructions
accel/tcg/translate-all.c:
TranslationBlock *tb_gen_code(CPUState *cpu, target_ulong pc, target_ulong cs_base, uint32_t flags, int cflags) ... size_t code_size = gen_code_size; if (tcg_ctx->data_gen_ptr) { code_size = tcg_ctx->data_gen_ptr - tb->tc.ptr; } qemu_log_lock(); atomic_set(&tb->tb_stats->code.num_host_inst, get_num_insts(tb->tc.ptr, code_size)); qemu_log_unlock(); ...
disas.c:
static int fprintf_fake(struct _IO_FILE *a, const char *b, ...) { return 1; } unsigned get_num_insts(void *code, unsigned long size) { CPUDebug s; ... s.info.fprintf_func = fprintf_fake; unsigned num_insts = 0; for (pc = (uintptr_t)code; size > 0; pc += count, size -= count) { num_insts++; count = print_insn(pc, &s.info); if (count < 0) { break; } } return num_insts; }
TCG Instructions
tcg/tcg.c:
int tcg_gen_code(TCGContext *s, TranslationBlock *tb) .... int n = 0; QTAILQ_FOREACH(op, &s->ops, link) { n++; } tb->tb_stats->code.num_tcg_inst = n; ....
= Controling TBStatistics Collection
Dumping TBStatistics Information
TB1: phys:0x34d54 virt:0x0000000000034d54 flags:0x0000f0 (trans:1 uncached:0 exec:5202684 ints: g:3 op:34 h:55 h/g: 18.333334) TB2: phys:0x34d0d virt:0x0000000000034d0d flags:0x0000f0 (trans:1 uncached:0 exec:5199468 ints: g:4 op:38 h:69 h/g: 17.250000) TB3: phys:0xec1c1 virt:0x00000000000ec1c1 flags:0x0000b0 (trans:1 uncached:0 exec:872031 ints: g:2 op:26 h:23 h/g: 11.500000) TB4: phys:0xec1c5 virt:0x00000000000ec1c5 flags:0x0000b0 (trans:1 uncached:0 exec:871841 ints: g:3 op:25 h:48 h/g: 16.000000) TB5: phys:0x34cae virt:0x0000000000034cae flags:0x0000f0 (trans:1 uncached:0 exec:833787 ints: g:1 op:29 h:53 h/g: 53.000000) TB6: phys:0x39aaf virt:0x0000000000039aaf flags:0x0000f0 (trans:1 uncached:0 exec:698880 ints: g:2 op:29 h:32 h/g: 16.000000) TB7: phys:0x39aa1 virt:0x0000000000039aa1 flags:0x0000f0 (trans:1 uncached:0 exec:698334 ints: g:4 op:41 h:85 h/g: 21.250000) TB8: phys:0x38d05 virt:0x0000000000038d05 flags:0x0000f0 (trans:1 uncached:0 exec:656640 ints: g:2 op:21 h:64 h/g: 32.000000) TB9: phys:0xfe8fe virt:0x00000000000fe8fe flags:0x000040 (trans:1 uncached:0 exec:630784 ints: g:1 op:36 h:75 h/g: 75.000000) TB10: phys:0x897d virt:0x000000000000897d flags:0x0000b0 (trans:1 uncached:0 exec:528246 ints: g:1 op:30 h:56 h/g: 56.000000)
------------------------------ TB1: phys:0x34d54 virt:0x0000000000034d54 flags:0x0000f0 (trans:1 uncached:0 exec:6960245 ints: g:3 op:34 h:55 h/g: 18.333334) ---------------- IN: 0x00034d54: 01 cb addl %ecx, %ebx 0x00034d56: 3b 1d 30 00 00 00 cmpl 0x30, %ebx 0x00034d5c: 72 af jb 0x34d0d
TB1: phys:0x34d54 virt:0x0000000000034d54 flags:0x0000f0 (trans:1 uncached:0 exec:5202684 ints: g:3 op:34 h:55 h/g: 18.333334) TB2: phys:0x34d0d virt:0x0000000000034d0d flags:0x0000f0 (trans:1 uncached:0 exec:5199468 ints: g:4 op:38 h:69 h/g: 17.250000) TB3: phys:0xec1c1 virt:0x00000000000ec1c1 flags:0x0000b0 (trans:1 uncached:0 exec:872031 ints: g:2 op:26 h:23 h/g: 11.500000) ------------------------------ # of TBs to reach 90% of the total exec count: 3 Total exec count: 12630686 ------------------------------
Future Work and Schedule
Improving the code generation of the TCG backend is a hard task that involves reading through large amounts of text looking for anomalies in the generated code. It would be nice to have tools to more readily extract and parse code generation information. This would include:
Modifying code generator, dumping additional data
- which are hot blocks (frequently run, hence more important performance wise)
- export block JIT information for perf tool (the later version)
Tweaking -d op,out_asm output
- how many fills/spills in a block (where register contents are moved due to register pressure)
- number of host instructions for each guest instruction (JIT profiling has a basic version of this)
- elide or beautify common blocks like softmmu access macros (which are always the same)
Modifying the HMP
- support interactive exploration of translation state (system emulation)
QEMU currently only works on translating simple basic blocks with one or two exit paths. This work could be a pre-cursor to supporting Internships/ProjectIdeas/Multi-exit Hot Blocks in the future.