绕过分支预测器
Bypassing the Branch Predictor

原始链接: https://nicula.xyz/2025/03/10/bypassing-the-branch-predictor.html

## 优化罕见分支执行 本次探索着重于优化代码,其中关键路径(发送交易)很少被执行,导致分支预测失误和性能下降。场景涉及一个金融系统,大多数交易会被放弃,但“发送”路径*必须*快速。 核心问题在于CPU的分支预测器会学习预测常见的“放弃”路径,当罕见的“发送”路径被需要时,会造成延迟。对分支预测的直接控制有限;虽然较老的x86处理器(Pentium 4)提供了指令前缀来提示预测,但现代CPU会忽略这些。C++20的`[[likely]]`和`[[unlikely]]`属性主要重新排序代码,并不能可靠地影响分支预测器。 一种可行的解决方案是通过用模拟交易数据“预热”分支预测器,这些数据*总是*触发“发送”路径。这会使预测器倾向于选择快速路径。在网络层丢弃这些模拟交易会增加极小的开销。Carl Cook演示了这种技术,在高频交易环境中实现了5微秒的性能提升。这种方法优于尝试直接操作预测,因为它确保了整个“发送”路径都在缓存中被预热。

## 绕过分支预测器 - 摘要 这次Hacker News讨论的核心是减轻性能关键代码中,特别是高频交易(HFT)中分支误预测的惩罚。 误预测的分支会阻塞CPU流水线,影响速度。 探讨了多种方法。简单的条件移动有时可以帮助,但也可能阻碍推测。更复杂的解决方案包括利用内核中使用的反推测技巧,或在失败路径上故意触发异常以防止预测。另一个想法是向分支目标预测器发送大量潜在目标,以诱发误预测。 讨论还涉及CPU分支预测提示的局限性(通常被忽略)以及LLM可能产生不存在的API或指令。一个关键点是考虑缓存行为并确保一致的分支历史,以便“错误训练”预测器。最终,建议避免分支——例如,通过使不太频繁的路径上的事务失效——作为一种潜在的有效策略。 这次对话强调了代码、CPU架构和性能优化之间错综复杂的关系。
相关文章

原文

A couple of days ago I was thinking about what you can do when the branch predictor is effectively working against you, and thus pessimizing your program instead of optimizing it.

Let’s work with something relatively simple & concrete: consider that we want to write some kind of financial system (maybe a trading system) and all of our transaction requests arrive at a certain function before being either (a) sent out to some authoritative server, or (b) abandoned. Let’s also assume that:

  1. The vast majority of our transaction requests end up being abandoned at the last step.
  2. We care A LOT about the speed of the ‘send’ path and we want to make it as fast as possible.
  3. We don’t care at all about the speed of the ‘abandon’ path.

The code would look roughly like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
/* Defined somewhere else */
struct Transaction;
bool should_send(Transaction *t);
void send(Transaction *t);
void abandon(Transaction *t);

void resolve(Transaction *t)
{
    if (should_send(t)) {
        send(t);
    } else {
        abandon(t);
    }
}

The implication of assumption #1 is that the branch predictor will be heavily primed to predict that should_send(t) returns false. What this means is that when we finally get a transaction that needs to be sent out, we’ll likely have to pay a branch misprediction penalty (~20 cycles); plus, our send() function may not be in the instruction cache just yet, because it’s so rarely executed. Also, we won’t get the advantage of pipelining whatever instructions are needed for send().

Since we only care about the speed of the send() path, I was wondering if there’s a way of telling the CPU that we don’t actually want to rely on the branch predictor at all when executing send() or abandon(): we always want to assume that send() will be executed.

A low-level solution?


I asked Claude if there is such a way to basically hard-code branch prediction rules into the machine code, and the answer was that there’s no way to do this on x86, but there is a way on ARM: the BEQP (predict branch taken) and BEQNP (predict branch not taken) instructions.

Those ARM instructions are just hallucinated, and the reality is actually the other way around: ARM doesn’t have a way of hard-coding ‘predictions’, but x86 does. More accurately: some old x86 processors do. On the Pentium 4 series, those rules/hints are encoded as instruction prefixes:

  • 0x2E (branch not taken);
  • 0x3E (branch taken).

So if a jump instruction came in with the prefix 0x3E, then the processor would assume that the branch is taken.

On modern x86 processors, those instruction prefixes are simply ignored, so compilers obviously won’t even bother generating them when you’re targeting such a CPU.

Another ’low-level’ approach that doesn’t work here are the [[likely]] and [[unlikely]] attributes introduced in C++20. Those attributes will typically just reorder some labels/paths around to make sure that the path which we mark as [[likely]] will require less jumps. Whether we mark the send() branch as likely or we leave the code as-is, Clang and GCC generate the same assembly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
resolve(Transaction*):
    push    rbx
    mov     rbx, rdi
    call    should_send(Transaction*)
    mov     rdi, rbx
    test    al, al
    je      .L2
    pop     rbx
    jmp     send(Transaction*)
.L2:
    pop     rbx
    jmp     abandon(Transaction*)

It won’t really matter that the send() branch doesn’t need to do the jump at line 7, because the branch predictor will be primed to assume the abandon() branch nonetheless. If you’re writing code for a modern x86 processor, you can consider that [[likely]] and [[unlikely]] are completely unrelated to the CPU branch predictor, because the microarchitecture itself won’t have any way of overriding its predictions. So those attributes are just a reordering mechanism. As noted in the proposal doc, however, the compiler could use the likely/unlikely hints to override the branch predictor:

Some of the possible code generation improvements from using branch probability hints include:

  • […]
  • Some microarchitectures, such as Power, have branch hints which can override dynamic branch prediction.

A higher-level solution?


So if we can’t bypass the branch predictor on modern x86 processors with some fine-grained and guaranteed mechanism, and we also don’t want to go buy a bunch of Pentium 4 CPUs and run our code on them, we need to think about a higher-level solution that works well enough; it doesn’t need to be guaranteed to work 100% of the time.

One such solution that I know of is one that Carl Cook talked about during his CppCon 17 talk: we can fill our system with mocked transaction data for which should_send(t) returns true. We have to do this enough times such that the mocked transaction data becomes the vast majority over the real data, so the branch predictor will be primed to assume that the send() path will be executed practically on every resolve() call. Overall, this may actually be a better approach than hard-coding prediction rules, because those rules wouldn’t really guarantee us that the whole send() path would get executed (the assumption may just lead to partial execution, until the should_send(t) condition is actually evaluated and the pipeline is flushed; so at the end we may still have important stuff not placed in the instruction/data cache).

For getting rid of those mocked transactions after they get ‘sent out’ from our program, Carl Cook mentions that network cards are capable of identifying and discarding such messages without adding significant overhead to the real transactions, so we can just leave it at that.

As mentioned in his talk, this whole ‘mocked/dummy transaction’ system gives him a 5 microsecond speed-up which, as the talk’s title suggests, matters a lot for his use case.

Thanks for reading!

You can check out the r/cpp discussion regarding this blog post here.

References:

联系我们 contact @ memedata.com