内联 – 终极优化

内联 – 终极优化
Inlining – The Ultimate Optimisation

原始链接: https://xania.org/202512/17-inlining-the-ultimate-optimisation

## 内联：编译器赋能的优化本文探讨了编译器内联，一种常常被低估的强大优化技术。虽然传统上侧重于消除调用开销，但内联的真正优势在于*促进*进一步的优化。通过直接将函数的代码插入到调用位置，编译器获得一个局部副本进行分析和转换。这使得常量传播（例如，在知道某个值始终为真时简化代码）、死代码消除以及分支预测改进成为可能——这些在共享函数中是无法实现的。一个例子展示了将字符串转换为大写；内联允许编译器避免分支来检查大小写，而是直接基于其 ASCII 值操作字符。然而，过度内联会导致代码膨胀。编译器使用启发式方法来平衡性能提升和大小增加，有时会做出令人惊讶的决定。函数可见性也很重要——编译器需要函数的*定义*，而不仅仅是声明，才能进行内联。最终，内联不仅仅是节省几个周期；而是为编译器提供在代码使用点进行深度优化的自由。

## 内联与编译器优化 - Hacker News 讨论总结一篇关于内联作为优化技术的文章引发了 Hacker News 的讨论，探讨了各种相关的编译器策略。核心问题在于复制函数调用以实现不同的优化，*而无需*在每个调用站点进行完全复制——本质上是创建函数的专门版本。用户识别了几个术语来描述这种做法：**IPA 克隆**（GCC 的过程间分析）、**特化**和**函数克隆**。特化被特别强调为相关，机器学习编译器以根据常量和类型积极特化内核而闻名——这与 **JIT 编译** 的概念类似。对话还涉及了过短或过长函数的不利影响、注释对 JavaScript 内联的影响（历史上统计 AST 节点）、以及 **多态性** 的性能影响（通常较慢，因为它会阻止内联，但可以通过静态分派进行优化，如 Rust 中所示）。最后，人们注意到像 `force inline` 和 `flatten` 这样的编译器属性对于一致的优化决策很重要。

原文

Written by me, proof-read by an LLM.
Details at end.

Sixteen days in, and I’ve been dancing around what many consider the fundamental compiler optimisation: inlining. Not because it’s complicated - quite the opposite! - but because inlining is less interesting for what it does (copy-paste code), and more interesting for what it enables.

Initially inlining was all about avoiding the expense of the call^{itself, but nowadays inlining enables many other optimisations to shine.}

We’ve already encountered inlining (though I tried to limit it until now): On day 8 to get the size of a vector, we called its .size() method. I completely glossed over the fact that while size() is a method on std::vector, we don’t see a call in the assembly code, just the subtraction and shift.

So, how does inlining enable other optimisations? Using ARMv7^{, let’s convert a string to uppercase. We might have a utility change_case function^{that either turns a single character from upper to lower, or lower to upper, so, we’ll use it in our code:}}

The compiler decides^{to inline change_case into make_upper, and then seeing that upper is always true^{, it can simplify the whole code to:}}

.LBB0_1:
  ldrb r2, [r0]         ; read next `c`; c = *string;
  sub r3, r2, #97       ; tmp = c - 'a'
  uxtb r3, r3           ; tmp = tmp & 0xff
  cmp r3, #26           ; check tmp against 26
  sublo r2, r2, #32     ; if lower than 26 then c = c - 32
                        ; c = ((c - 'a') & 0xff) < 26 ? c - 32 : c;
  strb r2, [r0], #1     ; store 'c' back; *string++ = c
  subs r1, r1, #1       ; reduce counter
  bne .LBB0_1           ; loop if not zero

There’s no trace left of the !upper case and the compiler, having inlined the code, has a fresh copy of the code to then further modify to take advantage of things it knows are true. It does a neat trick of avoiding a branch to check whether the character is uppercase: If (c - 'a') & 0xff^{is less than 26, it must be a lowercase character. It then conditionally subtracts 32, which has the effect of making a into A.}

Inlining gives the compiler the ability to make local changes: The implementation can be special cased at the inline site as by definition there’s no other callers to the code. The special casing can include propagating values known to be constants (like the upper bool above), and looking for code paths that are unused^.

Inlining has some drawbacks though: if it’s overused, the code size of your program can grow quite substantially^{. The compiler has to make its best guess as to whether inlining a function (and the functions that it calls…and so on), based on heuristics about the code size increase, and whether the perceived benefit is worth it. Ultimately it’s a guess though.}

In rare cases accepting the cost of calling a common routine can be a benefit: if there is an unavoidable branch in the routine that’s globally predictable, sometimes having one shared branch site can be better for the branch predictor. In many cases, though, the reverse is true: if there’s a branch in code that’s inlined many times across the codebase then sometimes the (more local) branch history for the many copies of that branch can yield more predictability. It’s…complex^.

An important consideration for inlining is the visibility of the definition of the function you’re calling (that is, the body of the function). If the compiler has only seen the declaration of a function (e.g. in the case above just char change_case(char c, bool upper);), then it can’t inline it: there’s nothing to inline! In modern C++ with templates and a lot of code in headers, this usually isn’t a problem, but if you’re trying to minimise build times and interdependency this can be an issue^.

Frustratingly, inlining is also one of the most heuristic-driven optimisations; with different compilers making reasonable but different guesses about which functions should be inlined. This can be frustrating when adding a single line to a function somewhere has ripple effects throughout a codebase affecting inlining decisions^.

All that said: Inlining is the ultimate enabling optimisation. On its own, copying function bodies into call sites might save a few cycles here and there. But give the compiler a fresh copy of code at the call site, and suddenly it can propagate constants, eliminate dead branches, and apply transformations that would be impossible with a shared function body. Who said copy paste was always bad?

See the video that accompanies this post.

This post is day 17 of Advent of Compiler Optimisations 2025, a 25-day series exploring how compilers transform our code.

← Calling all arguments | Partial inlining →

This post was written by a human (Matt Godbolt) and reviewed and proof-read by LLMs and humans.

Support Compiler Explorer on Patreon or GitHub, or by buying CE products in the Compiler Explorer Shop.

内联 – 终极优化 Inlining – The Ultimate Optimisation

内联 – 终极优化
Inlining – The Ultimate Optimisation