编译器优化有时会降低性能
When Compiler Optimizations Hurt Performance

原始链接: https://nemanjatrifunovic.substack.com/p/when-compiler-optimizations-hurt

最近的基准测试比较了 UTF-8 序列长度计算方法,结果显示性能差异令人惊讶。使用 `std::countl_one` 和 `switch` 语句的硬件辅助方法,本意是提高效率,但处理数据的速度为 438-462 MB/s – 远低于速度超过 2000 MB/s 的更简单、"朴素"的分支方法。 使用 Linux `perf` 进行调查显示存在内存阻塞,但根本原因是编译器 (clang++) 从 `switch` 语句生成了次优跳转表。使用 `-fno-jump-tables` 禁用跳转表生成后,性能恢复到与朴素方法相当的水平。 有趣的是,g++ 最初并没有使用跳转表,因此该标志没有效果。这与之前的研究观察结果(例如 Julian Squires 的 2017 年文章)相符,强调跳转表并非*总是*最快的解决方案,有时反而会引入性能开销。该实验表明理解编译器优化及其对代码执行的影响非常重要。

## 编译器优化:性能悖论 最近在Hacker News上的一场讨论强调了一个令人惊讶的问题:编译器优化并非*总是*能提升性能。虽然现代编译器在微观优化和大规模结构性改变方面表现出色,但存在一个“无人区”,优化在那里变得不可靠,甚至可能*降低*性能。 这源于预测优化如何与特定硬件和数据集交互的复杂性。在一种CPU架构(如现代x86)上有效的优化,可能在另一种架构(如N64)上反而会降低性能。即使在同一架构内,缓存大小和并发进程等因素也会引入不可预测的变量。 核心问题在于,编译器必须优化任意代码片段,缺乏完美优化的完整上下文。此外,看似有益的优化可能会引入意想不到的副作用,例如增加寄存器压力。普遍的共识是,优化需要仔细测量——对性能的假设往往是错误的,过去有效的方法并不能保证今天也能有效。归根结底,编译器并非万能的,依赖确定性规则并不总是能产生更快的代码。
相关文章

原文

Recently I have been working on benchmarks comparing techniques for calculating UTF-8 sequence lengths.

The method I described in Decoding UTF-8. Part IV: Determining Sequence Length - Counting Leading Bits uses hardware-assisted counting of leading zero bits:

#include <bit>
int utf8_sequence_length(unsigned char lead_byte) {
    switch (std::countl_one(lead_byte)) {
        case 0: return 1;
        case 2: return 2;
        case 3: return 3;
        case 4: return 4;
        default: return 0; // invalid lead
    }
}

I was pretty disappointed with the performance of this function. It was processing between 438 MB/s and 462 MB/s of text data which was far less than the “naive” method described at Decoding UTF-8. Part II: Determining Sequence Length - a Straightforward Approach.

As we saw in the Part IV post, the compiler (clang++ 18.1.3, aarch64) emits a small lookup table to avoid branching:

        adrp    x9, .Lswitch.table.utf8_sequence_length(unsigned char)
        add     x9, x9, :lo12:.Lswitch.table.utf8_sequence_length(unsigned char)
        ldr     w0, [x9, w8, uxtw #2]
...
.Lswitch.table.utf8_sequence_length(unsigned char):
        .word   1
        .word   0
        .word   2
        .word   3
        .word   4

The lookup table generated by the compiler is a degenerate case of a jump table, which is a common optimization for switch-case statements. The “naive” code from Part II uses branching instructions instead and processes over 2000 MB/s. Obviously, the branch predictor did its magic for the case of pure ASCII text, but the performance gap still surprised me.

Playing with the Linux perf tool on a WSL Linux installation turned to be a pretty frustrating experience. It did point me to a very high number of cycles stalled in the back-end which often indicates that CPU is waiting for memory, but I couldn’t get perf annotate to work.

In any case, what I had was enough to let me try disabling jump tables; with

clang++ -O2 -fno-jump-tables --std=c++20 

the switch-case part of the function compiles to:

        cmp     w0, #2
        b.gt    .LBB0_4
        cbz     w0, .LBB0_6
        cmp     w0, #2
        b.eq    .LBB0_5
.LBB0_3:
        mov     w0, wzr
        ret
.LBB0_4:
        cmp     w0, #3
        ccmp    w0, #4, #4, ne
        b.ne    .LBB0_3
.LBB0_5:
        ret
.LBB0_6:
        mov     w0, #1
        ret

Lo and behold, the performance became comparable to the “naive” function from Part II.

I repeated the experiment using GNU g++ for AArch64. In that case, -fno-jump-tables had no effect because g++ never emitted a lookup table.

Not surprisingly, it turned out I was not the first to come up with this observation. For instance, there is an excellent article by Julian Squires from 2017 on the same topic, but focusing on x86-x64 platform: Are Jump Tables Always Fastest?

联系我们 contact @ memedata.com