深入研究QEMU：微代码生成器 (TCG)，第一部分

深入研究QEMU：微代码生成器 (TCG)，第一部分
A deep dive into QEMU: The Tiny Code Generator (TCG), part 1 (2021)

原始链接: https://airbus-seclab.github.io/qemu_blog/tcg_p1.html

## QEMU TCG 引擎：概要 QEMU 微型代码生成器 (TCG) 是负责在宿主机上执行目标指令的引擎。它采用前端/后端模型：前端生成中间表示 (IR) 代码，而后端将此 IR 翻译成宿主机特定的汇编代码。翻译从 `tb_gen_code` 开始，它分配一个翻译块 (TB) 并使用 `gen_intermediate_code` 生成 IR。此函数依赖于架构，在通用的 `translator_loop` 中利用目标特定的“翻译器操作符”（例如，PowerPC 的 `ppc_tr_ops`）。该循环由这些操作符引导，处理核心翻译过程。 TB 包括序言和尾声以进行优化（例如，块链）。反汇编依赖于通用的 `DisasContextBase` 和目标特定的 `DisasContext`，捕获 CPU 状态以进行上下文翻译。单个指令由诸如 `ppc_tr_translate_insn` 之类的函数进行翻译，这些函数会查阅操作码处理程序以生成相应的 IR。例如，PowerPC 的 `cmp` 指令被翻译成一系列执行比较的 TCG 指令。最终，TCG 生成的 IR 代码*仍然*需要翻译成宿主机汇编。示例显示了 PowerPC 指令的 IR 等效项，包括用于内存访问和系统寄存器操作的特殊操作码，这些由 TCG “助手”处理（详见另一篇文章）。最终的 TB 包括用于管理执行流程的指令，可能链接到下一个 TB 或返回到 QEMU 翻译器。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录深入 QEMU：微码生成器 (TCG)，第一部分 (2021) (airbus-seclab.github.io) 82 分，由 costco 1 天前发布 | 隐藏 | 过去 | 收藏 | 2 条评论 epilys 1 天前 | 下一个 [–] (2021)。请记住，TCG API 和内部结构没有稳定的保证，因此如果你与当前 QEMU 代码进行交叉引用，肯定会发现差异。回复 bionsystem 1 天前 | 上一个 [–] > 友善地倒带这让我想起了那部优秀的电影，我会再看一遍（与帖子或主题无关，除了章节标题）。回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

原文

This blog post details some internals of the QEMU TCG engine, the machinery responsible for executing target instructions on the host.

You should have already read Execution loop and Breakpoints handling blog posts to have some pointers.

Be kind rewind

The vCPU thread executes instructions through tcg_cpu_exec which finds and/or generates translated blocks.

As previously explained, tb_gen_code will generate intermediate representation (IR) code thanks to gen_intermediate_code and then host architecture assembly instructions with tcg_gen_code:

TranslationBlock *tb_gen_code(CPUState *cpu,
                              target_ulong pc, target_ulong cs_base,
                              uint32_t flags, int cflags)
{
    tb = tb_alloc(pc);
...
    /* generate IR code */
    gen_intermediate_code(cpu, tb, max_insns);
...
    /* generate machine code */
    gen_code_size = tcg_gen_code(tcg_ctx, tb);
...
}

The QEMU TCG has a notion of frontend and backend operations. The frontend-ops are the generated intermediate code which is what QEMU will execute. The backend-ops are the operations implemented on the host CPU, which is where the code is executed.

The QEMU git tree has a README introduction about the TCG which details the IR language. It’s an interesting read for sure.

Overview

The gen_intermediate_code function is a VM architecture dependent wrapper to the translator_loop generic function. Let’s imagine we were emulating PowerPC code on an Intel x86 host.

The gen_intermediate_code would look like:

static const TranslatorOps ppc_tr_ops = {
    .init_disas_context = ppc_tr_init_disas_context,
    .tb_start           = ppc_tr_tb_start,
    .insn_start         = ppc_tr_insn_start,
    .breakpoint_check   = ppc_tr_breakpoint_check,
    .translate_insn     = ppc_tr_translate_insn,
    .tb_stop            = ppc_tr_tb_stop,
    .disas_log          = ppc_tr_disas_log,
};

void gen_intermediate_code(CPUState *cs, TranslationBlock *tb, int max_insns)
{
    DisasContext ctx;

    translator_loop(&ppc_tr_ops, &ctx.base, cs, tb, max_insns);
}

While the translator_loop remains the same for any architecture, it relies on target specific translator operators. In our case, the PowerPC ones (ppc_tr_ops).

void translator_loop(const TranslatorOps *ops, DisasContextBase *db,
                     CPUState *cpu, TranslationBlock *tb, int max_insns)
{
    ops->init_disas_context(db, cpu);
...
    gen_tb_start(db->tb);
    ops->tb_start(db, cpu);

    while (true) {
        ops->translate_insn(db, cpu);
    }
...
    ops->tb_stop(db, cpu);
    gen_tb_end(db->tb, db->num_insns - bp_insn);
}

Each TB has a prologue (tb_start), and an epilogue (tb_end) with a generic and target specific part (if needed). Usually epilogues are placeholders for block chaining, which is an optimization feature of the TCG that enables TBs to be called successively after execution, without the need to get back to the QEMU code and look for the next TB to execute.

Disassembly context

The architecture dependent DisasContext is created alongside a generic DisasContextBase.

For the PPC target, the ppc_tr_init_disas_context handler will record current cpu state information. This means that TBs are highly contextual and might not be reused in any situation where we execute at a previously translated location.

static void ppc_tr_init_disas_context(DisasContextBase *dcbase, CPUState *cs)
{
    DisasContext *ctx = container_of(dcbase, DisasContext, base);
    CPUPPCState *env = cs->env_ptr;
    int bound;

    ctx->exception = POWERPC_EXCP_NONE;
    ctx->spr_cb = env->spr_cb;
    ctx->pr = msr_pr;
    ctx->mem_idx = env->dmmu_idx;
    ctx->dr = msr_dr;
...
    ctx->insns_flags = env->insns_flags;
    ctx->insns_flags2 = env->insns_flags2;
    ctx->access_type = -1;
    ctx->need_access_type = !(env->mmu_model & POWERPC_MMU_64B);
    ctx->le_mode = !!(env->hflags & (1 << MSR_LE));
    ctx->default_tcg_memop_mask = ctx->le_mode ? MO_LE : MO_BE;
    ctx->flags = env->flags;
...

Translated Block prologue/epilogue

The generic TB prologue is generated from gen_tb_start

static inline void gen_tb_start(TranslationBlock *tb)
{
    TCGv_i32 count, imm;

    tcg_ctx->exitreq_label = gen_new_label();
    if (tb_cflags(tb) & CF_USE_ICOUNT) {
        count = tcg_temp_local_new_i32();
    } else {
        count = tcg_temp_new_i32();
    }

    tcg_gen_ld_i32(count, cpu_env,
                   -ENV_OFFSET + offsetof(CPUState, icount_decr.u32));

    if (tb_cflags(tb) & CF_USE_ICOUNT) {
        imm = tcg_temp_new_i32();
        /* We emit a movi with a dummy immediate argument. Keep the insn index
         * of the movi so that we later (when we know the actual insn count)
         * can update the immediate argument with the actual insn count.  */
        tcg_gen_movi_i32(imm, 0xdeadbeef);
        icount_start_insn = tcg_last_op();

        tcg_gen_sub_i32(count, count, imm);
        tcg_temp_free_i32(imm);
    }

    tcg_gen_brcondi_i32(TCG_COND_LT, count, 0, tcg_ctx->exitreq_label);

    if (tb_cflags(tb) & CF_USE_ICOUNT) {
        tcg_gen_st16_i32(count, cpu_env,
                         -ENV_OFFSET + offsetof(CPUState, icount_decr.u16.low));
    }

    tcg_temp_free_i32(count);
}

It injects instructions to check for instruction count and an exit condition.

The epilogue is generated from gen_tb_end. It injects instructions to exit from the TB (tcg_gen_exit_tb).

static inline void gen_tb_end(TranslationBlock *tb, int num_insns)
{
    if (tb_cflags(tb) & CF_USE_ICOUNT) {
        /* Update the num_insn immediate parameter now that we know
         * the actual insn count.  */
        tcg_set_insn_param(icount_start_insn, 1, num_insns);
    }

    gen_set_label(tcg_ctx->exitreq_label);
    tcg_gen_exit_tb(tb, TB_EXIT_REQUESTED);
}

The TB_EXIT_REQUESTED is a special value that tells QEMU to use the exitreq_label created during the prologue. This label will eventually be updated with the address of the next TB to be executed.

Translating instructions

The operator used to translate target instructions to IR is translate_insn (ppc_tr_translate_insn for PowerPC). This function makes use of the target CPU opcodes handlers table which implements IR generation for every target native instructions.

static void ppc_tr_translate_insn(DisasContextBase *dcbase, CPUState *cs)
{
    opc_handler_t **table, *handler;

    table = cpu->opcodes;
    handler = table[opc1(ctx->opcode)];
...
    (*(handler->handler))(ctx);
}

struct PowerPCCPU {
...
    /* Those resources are used only during code translation */
    /* opcode handlers */
    opc_handler_t *opcodes[PPC_CPU_OPCODES_LEN];
...
}

static opcode_t opcodes[] = {
GEN_HANDLER(cmp, 0x1F, 0x00, 0x00, 0x00400000, PPC_INTEGER),
GEN_HANDLER(cmpi, 0x0B, 0xFF, 0xFF, 0x00400000, PPC_INTEGER),
GEN_HANDLER(cmpl, 0x1F, 0x00, 0x01, 0x00400001, PPC_INTEGER),
...
GEN_LDS(lbz, ld8u, 0x02, PPC_INTEGER)
GEN_LDS(lha, ld16s, 0x0A, PPC_INTEGER)
...
};

For the PowerPC cmp instruction, the associated handler will be gen_cmp which will emit the following IR instructions:

static void gen_cmp(DisasContext *ctx)
{
    if ((ctx->opcode & 0x00200000) && (ctx->insns_flags & PPC_64B)) {
        gen_op_cmp(cpu_gpr[rA(ctx->opcode)], cpu_gpr[rB(ctx->opcode)],
                   1, crfD(ctx->opcode));
    } else {
        gen_op_cmp32(cpu_gpr[rA(ctx->opcode)], cpu_gpr[rB(ctx->opcode)],
                     1, crfD(ctx->opcode));
    }
}

static inline void gen_op_cmp(TCGv arg0, TCGv arg1, int s, int crf)
{
    TCGv t0 = tcg_temp_new();
    TCGv t1 = tcg_temp_new();
    TCGv_i32 t = tcg_temp_new_i32();

    tcg_gen_movi_tl(t0, CRF_EQ);
    tcg_gen_movi_tl(t1, CRF_LT);
    tcg_gen_movcond_tl((s ? TCG_COND_LT : TCG_COND_LTU),
                       t0, arg0, arg1, t1, t0);
    tcg_gen_movi_tl(t1, CRF_GT);
    tcg_gen_movcond_tl((s ? TCG_COND_GT : TCG_COND_GTU),
                       t0, arg0, arg1, t1, t0);

    tcg_gen_trunc_tl_i32(t, t0);
    tcg_gen_trunc_tl_i32(cpu_crf[crf], cpu_so);
    tcg_gen_or_i32(cpu_crf[crf], cpu_crf[crf], t);

    tcg_temp_free(t0);
    tcg_temp_free(t1);
    tcg_temp_free_i32(t);
}

Example PowerPC basic block translation

The PowerPC code here after shows 3 types of operations:

simple instructions: arithmetic, immediate operands (lis, ori, xor)
a memory write (stw)
a system register write (mtmsr)

0xfff00100:  lis    r1,1

0xfff00104:  ori    r1,r1,0x409c

0xfff00108:  xor    r0,r0,r0

0xfff0010c:  stw    r0,4(r1)

0xfff00110:  mtmsr  r0

The following code gives the TCG IR equivalent:

0xfff00100:  movi_i32    r1,$0x10000

0xfff00104:  movi_i32    tmp0,$0x409c
             or_i32      r1,r1,tmp0

0xfff00108:  movi_i32    r0,$0x0

0xfff0010c:  movi_i32    tmp1,$0x4
             add_i32     tmp0,r1,tmp1
             qemu_st_i32 r0,tmp0,beul,3

0xfff00110:  movi_i32    nip,$0xfff00114
             mov_i32     tmp0,r0
             call        store_msr,$0,tmp0
             movi_i32    nip,$0xfff00114
             exit_tb     $0x0
             set_label   $L0
             exit_tb     $0x7f5a0caf8043

We notice that the memory write operation as well as the MSR access are translated into strange IR opcodes:

qemu_st_i32
call store_msr

We will detail them in a blog post dedicated to TCG helpers.

Also notice the operands of the exit_tb opcode. The first one is $0x0 and might eventually be fixed with the absolute host address of the next TB to be executed.

The last one is also a host address ($0x7f5a0caf8043) where code is directly executed by the physical CPU. In this case, it goes back to the QEMU translator.

At this point however, no target translated code is still executed. We only end up with IR code which needs a final translation step to the host architecture.