编译器优化有时会降低性能

编译器优化有时会降低性能
When Compiler Optimizations Hurt Performance

原始链接: https://nemanjatrifunovic.substack.com/p/when-compiler-optimizations-hurt

最近的基准测试比较了 UTF-8 序列长度计算方法，结果显示性能差异令人惊讶。使用 `std::countl_one` 和 `switch` 语句的硬件辅助方法，本意是提高效率，但处理数据的速度为 438-462 MB/s – 远低于速度超过 2000 MB/s 的更简单、"朴素"的分支方法。使用 Linux `perf` 进行调查显示存在内存阻塞，但根本原因是编译器 (clang++) 从 `switch` 语句生成了次优跳转表。使用 `-fno-jump-tables` 禁用跳转表生成后，性能恢复到与朴素方法相当的水平。有趣的是，g++ 最初并没有使用跳转表，因此该标志没有效果。这与之前的研究观察结果（例如 Julian Squires 的 2017 年文章）相符，强调跳转表并非*总是*最快的解决方案，有时反而会引入性能开销。该实验表明理解编译器优化及其对代码执行的影响非常重要。

## 编译器优化：性能悖论最近在Hacker News上的一场讨论强调了一个令人惊讶的问题：编译器优化并非*总是*能提升性能。虽然现代编译器在微观优化和大规模结构性改变方面表现出色，但存在一个“无人区”，优化在那里变得不可靠，甚至可能*降低*性能。这源于预测优化如何与特定硬件和数据集交互的复杂性。在一种CPU架构（如现代x86）上有效的优化，可能在另一种架构（如N64）上反而会降低性能。即使在同一架构内，缓存大小和并发进程等因素也会引入不可预测的变量。核心问题在于，编译器必须优化任意代码片段，缺乏完美优化的完整上下文。此外，看似有益的优化可能会引入意想不到的副作用，例如增加寄存器压力。普遍的共识是，优化需要仔细测量——对性能的假设往往是错误的，过去有效的方法并不能保证今天也能有效。归根结底，编译器并非万能的，依赖确定性规则并不总是能产生更快的代码。

原文

Recently I have been working on benchmarks comparing techniques for calculating UTF-8 sequence lengths.

The method I described in Decoding UTF-8. Part IV: Determining Sequence Length - Counting Leading Bits uses hardware-assisted counting of leading zero bits:

#include <bit>
int utf8_sequence_length(unsigned char lead_byte) {
    switch (std::countl_one(lead_byte)) {
        case 0: return 1;
        case 2: return 2;
        case 3: return 3;
        case 4: return 4;
        default: return 0; // invalid lead
    }
}

I was pretty disappointed with the performance of this function. It was processing between 438 MB/s and 462 MB/s of text data which was far less than the “naive” method described at Decoding UTF-8. Part II: Determining Sequence Length - a Straightforward Approach.

As we saw in the Part IV post, the compiler (clang++ 18.1.3, aarch64) emits a small lookup table to avoid branching:

        adrp    x9, .Lswitch.table.utf8_sequence_length(unsigned char)
        add     x9, x9, :lo12:.Lswitch.table.utf8_sequence_length(unsigned char)
        ldr     w0, [x9, w8, uxtw #2]
...
.Lswitch.table.utf8_sequence_length(unsigned char):
        .word   1
        .word   0
        .word   2
        .word   3
        .word   4

The lookup table generated by the compiler is a degenerate case of a jump table, which is a common optimization for switch-case statements. The “naive” code from Part II uses branching instructions instead and processes over 2000 MB/s. Obviously, the branch predictor did its magic for the case of pure ASCII text, but the performance gap still surprised me.

Playing with the Linux perf tool on a WSL Linux installation turned to be a pretty frustrating experience. It did point me to a very high number of cycles stalled in the back-end which often indicates that CPU is waiting for memory, but I couldn’t get perf annotate to work.

In any case, what I had was enough to let me try disabling jump tables; with

clang++ -O2 -fno-jump-tables --std=c++20

the switch-case part of the function compiles to:

        cmp     w0, #2
        b.gt    .LBB0_4
        cbz     w0, .LBB0_6
        cmp     w0, #2
        b.eq    .LBB0_5
.LBB0_3:
        mov     w0, wzr
        ret
.LBB0_4:
        cmp     w0, #3
        ccmp    w0, #4, #4, ne
        b.ne    .LBB0_3
.LBB0_5:
        ret
.LBB0_6:
        mov     w0, #1
        ret

Lo and behold, the performance became comparable to the “naive” function from Part II.

I repeated the experiment using GNU g++ for AArch64. In that case, -fno-jump-tables had no effect because g++ never emitted a lookup table.

Not surprisingly, it turned out I was not the first to come up with this observation. For instance, there is an excellent article by Julian Squires from 2017 on the same topic, but focusing on x86-x64 platform: Are Jump Tables Always Fastest?

编译器优化有时会降低性能 When Compiler Optimizations Hurt Performance

编译器优化有时会降低性能
When Compiler Optimizations Hurt Performance