GCC 和 Clang 都会生成奇怪/低效的代码。

GCC 和 Clang 都会生成奇怪/低效的代码。
Both GCC and Clang generate strange/inefficient code

原始链接: https://codingmarginalia.blogspot.com/2026/02/both-gcc-and-clang-generate.html

这篇帖子详细介绍了Clang和GCC在编译一个简单的C++函数时生成的令人惊讶且不一致的汇编代码。该函数检查一个`std::array`是否只包含零。函数的逻辑依赖于将输入数组与零初始化的`std::array`进行比较。 GCC的行为随数组大小而变化。对于大小为1的情况，它使用`test`指令——有效地通过按位与检查单个元素是否为零。对于大小为2，它执行直接与零的比较。大小为3的结果是混合方法，既有比较，又有看似冗余的将寄存器设置为零的指令。 Clang表现出不同的低效之处。虽然大小为1可以干净地编译，但大小为2和3涉及在堆栈上不必要地初始化`allZeros`数组，尽管该值从未被读取。对于大小为3，Clang使用按位或来检测非零元素，这是一种巧妙但仍然低效的方法。作者得出结论，即使使用优化（-O3），编译器并不总是生成最佳或可预测的代码，即使是像数组大小这样小的变化也会如此。这强调了理解生成的汇编代码的重要性，并且不要盲目信任编译器的行为。

一个 Hacker News 的讨论强调，GCC 和 Clang 在编译 C++ 时都可能产生令人惊讶且潜在效率低下的代码，有时与它们编译 C 代码的结果有显著差异。一个关键点是，C++ 的抽象可能会阻碍编译器优化，迫使它们生成更通用的代码，而这些代码更难优化成高效的机器指令。虽然原始博文没有包含基准测试，但评论者们一致认为，某些生成的代码，例如不必要的内存清零，很可能不是最优的。然而，也有人提醒不要假设“奇怪”的代码*总是*效率低下，并指出流水线和内存使用方面的特殊情况可能带来好处。一位评论员链接到详细的调查，显示 GCC 的代码生成器存在已知问题。这场讨论强调了编译过程的复杂性——将高级代码通过多个优化阶段进行转换——以及它并不总是优先考虑最直观的机器代码。

原文

I ran into some surprisingly weird output of both Clang and gcc on a simple code snippet, and I thought I'd share it.

Consider the following C++ function which, in a roundabout way, checks whether an std::array passed as an argument only contains zeros:

#include <array>

static constexpr int arraySize = 1;

bool isAllZeros (const std::array<int, arraySize> &array) {
    std::array<int, arraySize> allZeros {};

    return array == allZeros;
}

In case you're wondering, the reason why this is correct is that initializing an array with { } results in each element of the array being value initialized, or in other words set to zero.

What happens if we compile this with gcc? Using godbolt and the latest gcc version (15.2), with optimizations on ("-O3") we get the following x86-64 Assembly code:

isAllZeros(std::array&ltint, 1ul&gt const&):
        mov     eax, DWORD PTR [rdi]
        test    eax, eax
        sete    al
        ret

Already here we get fairly non-intuitive output! We have set arraySize to 1, so we're effectively checking whether a single integer value is 0. The generated code does this by fetching the integer value, and'ing it with itself (which results in the same value), and then setting the return value of the function to be equal to the CPU's zero flag. This is not how your average Assembly programmer would do it, but it's still easy enough to understand (if perhaps wasteful-looking).

Let's see what happens if we set arraySize to 2:

isAllZeros(std::array&ltint, 2ul&gt const&):
        cmp     QWORD PTR [rdi], 0
        sete    al
        ret

That's more like it! Now we're simply fetching a QWORD-sized block of memory (8 bytes, which corresponds to two integers), comparing it to 0 and setting the return value to be the result of the comparison operation. This is a lot more intuitive than the arraySize = 1 case, and it's not clear why.

How about arraySize = 3, meaning a 12-byte block ?

isAllZeros(std::array&ltint, 3ul&gt const&):
        cmp     QWORD PTR [rdi], 0
        je      .L5
.L2:
        mov     eax, 1
        test    eax, eax
        sete    al
        ret
.L5:
        mov     eax, DWORD PTR [rdi+8]
        test    eax, eax
        jne     .L2
        xor     eax, eax
        test    eax, eax
        sete    al
        ret

Now things are getting really hectic. This time gcc decided to use a mixture of both strategies, using a cmp instruction to check whether the first 8 bytes are zero, and the test instruction to check the remaining 4 bytes.

That's not the weirdest part though. The strangest bit is the block in between the ".L2" and ".L5" labels, which as far as I can tell is using a very odd sequence of instructions to simply set eax to the value 0. A nearly-identical sequence is used at the end of the code to set the return value to 1.

How about clang? Surely we won't see two different compilers behaving oddly here? Again we use "-O3" and the latest version on godbolt, which is Clang 21.1.0.

With arraySize = 1:

isAllZeros(std::array&ltint, 1ul&gt const&):
        cmp     dword ptr [rdi], 0
        sete    al
        ret

Phew! Looks good and normal.

How about with size 2?

isAllZeros(std::array&ltint, 2ul&gt const&):
        mov     qword ptr [rsp - 8], 0
        cmp     qword ptr [rdi], 0
        sete    al
        ret

Mmmm. The actual comparison code looks just as good, but there's a new inefficiency here which gcc avoided: specifically, the first instruction which is writing a zero to the stack. This corresponds to initializing the allZeros variable on the stack, even though the rest of the Assembly code never reads this value.

Lastly and for completeness, here is the output from Clang with size 3:

isAllZeros(std::array&ltint, 3ul&gt const&):
        mov     dword ptr [rsp - 8], 0
        mov     qword ptr [rsp - 16], 0
        mov     eax, dword ptr [rdi + 8]
        or      rax, qword ptr [rdi]
        sete    al
        ret

Here we see Clang using bit manipulation with the or instruction to generate a value that is nonzero when any of the input elements are nonzero, and then setting the result of the comparison based on the result of the bitwise or operation. Clever really! However, the unnecessary initialisation of allZeros on the stack is still present. One last thing which is unclear is: why did clang not perform these unnecessary writes when arraySize was just 1?

Moral of the story? As advanced as compilers are, we certainly can't trust them to generate optimal code, or even to be predictable when doing seemingly trivial changes to code (such as changing the size of an array).

Note: If I missed anything I apologise in advance - please send any feedback to: the name of this blog (see the URL) @gmail.com

GCC 和 Clang 都会生成奇怪/低效的代码。 Both GCC and Clang generate strange/inefficient code

GCC 和 Clang 都会生成奇怪/低效的代码。
Both GCC and Clang generate strange/inefficient code