通过价值投机击败 L1 缓存（2021）

通过价值投机击败 L1 缓存（2021）
Beating the L1 cache with value speculation (2021)

原始链接: https://mazzo.li/posts/value-speculation.html

本文介绍了一种称为“价值推测”的技术，该技术可以提高计算系统的性能。它的具体目标是提高将存储在链表中的大量数字序列相加的程序的执行速度。问题在于现代计算机中央处理单元 (CPU) 内指令的并行执行。尽管 CPU 同时处理多个指令，但分支会产生阻碍这种并行化的障碍。只要存在决策点，例如“if”语句、“while”或“for”循环，就会发生分支。分支会导致问题，因为 CPU 需要在前进之前确定是采用一条路径还是另一条路径，从而导致延迟。为了解决这些问题，本文提出了“价值投机”方法。通过仔细考虑CPU如何处理数据，作者建议猜测循环期间可能访问的数据值。系统可以使用分支预测器来尝试预测分支将采取的方向，以确定猜测是否正确。在前面提到的链表问题的上下文中，系统猜测下一个节点值，递增当前指针，并检查新指针是否与预期值匹配。如果猜测不正确，系统会相应调整指针。由于链表条目是相邻的，这种方法大大降低了分支预测的频率，使CPU能够更快地执行计算。然而，必须注意的是，这种方法需要仔细处理，以确保与不同软件编译器的兼容性。编译器有时会忽略这种看似多余的猜测机制，从而导致删除旨在提高性能的优化。因此，作者提供了解决方案的几种变体，以帮助说服编译器生成最佳的机器代码。在测试数据大小和实现的各种组合后，发现了显着的改进，16 KB 数据集的处理速率高达每秒 44.96 GB。当处理 CPU 缓存无法容纳的较大数据集时，处理速度被限制在每秒 15 GB 左右，但速度仍然快得令人印象深刻。总的来说，本文展示了一种巧妙的技术，通过仔细考虑底层硬件机制和软件工具施加的限制来提高计算系统的效率。

该用户讨论了一种用于在 C 和 C++ 中遍历数组和链表的方法。当递增指向数组元素的指针时，它指向内存中该类型的下一个对象。通过这样做，用户可以轻松地迭代数组。然而，他们指出，在链表上下文中向指针加 1 的工作方式类似，其中下一个链表节点可能是下一个数组元素，因为它们共享连续内存空间。用户提到现代 CPU 利用管道来加速同时执行多个指令的过程。当指令之间存在依赖性时就会出现瓶颈，从而导致称为管道停顿的错误。内存加载通常会引入这些瓶颈，因为加载内存通常比其他操作慢。为了克服这个问题，处理器可以乱序执行指令，尽管为了防止混乱，它们必须看起来好像正在执行。这允许 CPU 在确认内存中的下一个节点是否对应于实际列表中的下一个节点之前，推测性地开始处理循环的下一个迭代。如果预测结果与实际情况相符，则 CPU 已成功降低管道停顿的可能性。相反，如果预测失败，则不会提交任何结果，并且 CPU 按预期顺序向前移动。最后，用户承认，考虑到连续数组通常可以使用向量有效地执行此任务，该技术看起来很复杂并且似乎没有必要。他们建议，涉及频繁更改链表的特定场景可能需要利用这种方法来最大限度地减少昂贵操作的影响。此外，用户注意到，由于更好的缓存局部性和减少的指针追踪问题，在单个大分配中重新排列链表的内存布局可能会提高性能。或者，可以实现分段数组，其中链表元素单独存储，以实现类似的好处。然而，在实践中，对于典型的应用程序来说，此类技术可能过于复杂，因此需要更简单的解决方案，例如展开链表或通过向量内的索引表示连接。

原文

If we have a heuristic to guess some value cheaply, we can remove a data dependency in a tight loop using the branch predictor. This allows the CPU to run more instructions in parallel, increasing performance. If this explanation does not make much sense to you, keep reading to learn about some of the magic making your CPU fast!

Per Vognsen’s twitter feed is full of neat low-level curiosities, usually leveraging CPU features for some performance benefit.

Recently he tweeted about a trick that I had never heard of – value speculation. The trick exploits the branch predictor to guess values, enabling more instruction parallelism and therefore removing a bottleneck on the L1 cache. Note that the bottleneck is not due to L1 cache misses, but on L1 cache hits introducing unwanted data dependencies.

In this post I explain the machinery involved, including a primer on branch prediction and CPU caches, so that anybody with a passing knowledge of C and how code is executed on CPUs should be able to follow.

The code for the post is available here. All the numbers are from a Xeon E5-1650 v3, an Intel Haswell processor with L1 / L2 / L3 cache of 32kB, 256kB, and 15MB respectively. The code was compiled with clang -O3, and not with gcc, for reasons explained later.

Before starting, I’d like to stress that L1 cache hits are almost certainly not the bottleneck of your application! This is just a very neat trick that illuminates some CPU features, not a guide on how to improve the performance of your average piece of C code.

The setup – summing linked lists #

We have a simple linked list data type, and a function summing all the elements of a given linked list:

typedef struct Node {
  uint64_t value;
  struct Node *next; // NULL for the last node
} Node;

uint64_t sum1(Node* node) {
  uint64_t value = 0;
  while (node) {
    value += node->value;
    node = node->next;
  }
  return value;
}

// Allocate 5MB of linked list nodes, and link them sequentially, with
// random data in the `value`s.
uint64_t n = 312500llu; // 312500 * sizeof(Node) = 312500 * 16 bytes = 5000000 bytes
Node *nodes = malloc(n * sizeof(Node));
for (uint64_t i = 0; i < n - 1; i++) {
  nodes[i].value = random_uint64();
  nodes[i].next = &nodes[i+1];
}
nodes[n-1].value = random_uint64();
nodes[n-1].next = NULL;

// Now sum.
sum1(&nodes[0]);

On a server with a relatively old Xeon E5-1650 v3, running sum1 with the sample data takes 0.36 milliseconds, which means that we’re processing our linked list at roughly 14GB/s. In the rest of the post will will identify the bottleneck and get around it with value speculation, bringing the throughput for this dataset to 30GB/s.

The impact of the fix varies depending on the size of the dataset. If it is already entirely in the CPU cache, the improvement is much more pronounced, since otherwise we are quickly constrained by how fast we can read data from RAM. This graph shows the performance improvement over differently sized datasets (higher is better):

The chart shows the performance of sum1 together with the performance of two improved functions, sum2 and sum3. We go from a throughput of 14GB/s in sum1 to more than 45GB/s in sum3 if the data fits entirely in the L1 cache (the 16kB dataset), with the performance decreasing slightly for datasets fitting in the L2 and L3 cache (128kB and 5MB datasets). If the dataset does not fit entirely in any CPU cache (~4GB dataset) we go from 10GB/s to 15GB/s, which is as fast as the RAM allows.

#

Modern CPUs do not process instructions serially, but rather handle many at the same time. They read many instructions at once, break them down in stages, and then try to fill all the computation units they have with as many tasks from as many instructions as possible. For instance, modern Intel processors are designed for a throughput of 4 instructions per clock cycle, and AMD Zen processors for up to 5 or 6.

However, branches pose a challenge when wanting to execute instructions in parallel. Let’s go back to our function sum1:

uint64_t sum1(Node* node) {
  uint64_t value = 0;
  while (node) {
    value += node->value;
    node = node->next;
  }
  return value;
}

; rdi = node and rax = value.
; rax is the return value register (we're returning value)
sum1:
  xor     rax, rax                 ; value = 0
  test    rdi, rdi                 ; if node is NULL, exit, otherwise start loop
  je      end
loop:
  add     rax, qword ptr [rdi]     ; value += node->value
  mov     rdi, qword ptr [rdi + 8] ; node = node->next
  test    rdi, rdi                 ; if node is not NULL, repeat loop,
  jne     loop                     ; otherwise exit
end:
  ret

The loop body is made out of 4 instructions, the last of which a jump. Without special measures, every instruction up to the jne must be executed before proceeding to the next instruction, since we need to know if we’ll go to the beginning of the loop or continue. In other words the conditional jump would introduce a barrier in the instruction level parallelism internal to the CPU.

However, executing many instructions at once is so important that dedicated hardware – the branch predictor – is present in all modern CPUs to make an educated guess on which way we’ll go at every conditional jump. The details of how this works are beyond the scope of this blog post, but conceptually your CPU observes your program as it runs and tries to predict which branch will be taken by remembering what happened in the past.

Even without knowing much about the branch prediction, we expect the predictor to do a great job for our test case – we always go back to the beginning of the loop apart from when we stop consuming the list. On Linux, we can verify that this is the case with perf stat:

$ perf stat ./value-speculation-linux
...
         2,507,580      branch-misses             #    0.04% of all branches

The branch predictor gets it right 99.96% of the time. So the CPU can parallelize our instructions with abandon, right? …right?

#

Let’s focus on the loop body of sum1:

; rdi = node and rax = value.
loop:
  add     rax, qword ptr [rdi]     ; value += node->value
  mov     rdi, qword ptr [rdi + 8] ; node = node->next
  test    rdi, rdi                 ; if node is not NULL, repeat loop,
  jne     loop                     ; otherwise exit

To increment value (rax), we need to know the value of node (rdi), which depends on the mov in the previous iteration of the loop. The same is true for the mov itself – it is also dependent on the result of the previous mov to operate. So there’s a data dependency between each iteration of the loop: we must have finished reading node->next ([rdi + 8]) at iteration n before we can start executing the add and mov at iteration n+1.

Moreover, reading the node->next ([rdi + 8]) is slower than you might think.

Modern CPUs are a lot better at adding numbers than reading from memory. For this reason, a series of fast caches exist between the CPU and main memory. All reading and writing from main memory normally goes through the cache – if the data we are interested in is not already present, the CPU will load a chunk of memory (a “cache line”, 64 bytes on x86) which contains our desired data into the cache. The fastest cache is usually called L1 (successive caching layers being predictably called L2, L3, …).

Our setup is the best-case scenario when it comes to CPU caches – we read a bunch of memory sequentially, utilizing every byte along the way. However, even if the L1 cache is very fast, it is not free: it takes around 4 CPU cycles to read from it. This will make our mov and add take at least 4 cycles to complete. The other two instructions, je and test, will take only one cycle.

So the number of cycles needed to go through a single loop iteration is bounded by the 4 cycles it takes to read from L1 cache. The data I get from the Xeon I tested the program with is roughly consistent with this:

16kB, 10000 iterations
  sum1:  8465052097154389858,  1.12us,  14.25GB/s,  3.91 cycles/elem,  1.03 instrs/cycle,  3.48GHz,  4.01 instrs/elem
128kB, 10000 iterations
  sum1:  6947699366156898439,  9.06us,  14.13GB/s,  3.95 cycles/elem,  1.01 instrs/cycle,  3.49GHz,  4.00 instrs/elem
5000kB, 100 iterations
  sum1:  2134986631019855758,  0.36ms,  14.07GB/s,  3.96 cycles/elem,  1.01 instrs/cycle,  3.48GHz,  4.00 instrs/elem
4294MB, 1 iterations
  sum1: 15446485409674718527,  0.43 s,   9.94GB/s,  5.60 cycles/elem,  0.71 instrs/cycle,  3.48GHz,  4.00 instrs/elem

The important numbers are cycles/elem and instrs/cycle. We spend roughly 4 cycles per list element (that is to say, per loop iteration), corresponding to a throughput of roughly 1 instruction per cycle. Given that the CPU in question is designed for a throughput of 4 instructions per cycle, we’re wasting a lot of the CPU magic at our disposal, because we’re stuck waiting on the L1 cache.

#

We finally get to the trick. As discussed, we are stuck waiting on reading what the next node address is. However, in our setup we allocate the list in a contiguous block of memory, and therefore the nodes are always next to each other.

So here’s the key idea: try to guess the next node by just bumping the previous value. If the guess is wrong, set the node to the “real” next value. In C, this is how it would look like:

uint64_t faster_sum(Node* node) {
  uint64_t value = 0;
  Node* next = NULL;
  while (node) {
    value += node->value;
    next = node->next;
    // Guess the next value
    node++;
    // But fix it up if we guessed wrong (in case the nodes are not
    // next to each other).
    if (node != next) {
      node = next;
    }
  }
  return value;
}

This looks quite bizarre. We are still reading node->next in the comparison node != next to make sure our guess is right. So at first glance this might not seem like an improvement.

This is where the branch predictor comes in. In the case of lists where most nodes are next to each other (as is the case in our test code), the branch predictor will guess that the if (node != next) { ... } branch is not taken, and therefore we’ll go through loop iterations without having to wait for the L1 read.

Note that when the branch predictor is wrong (for example when the list ends, or if we have non-contiguous nodes) the CPU will need to backtrack and re-run from the failed branch prediction, which is costly (15 to 20 cycles on our processor). However, if the list is mostly contiguous, the trick works and makes our function 50-200% faster.

However there is one last challenge remaining to reach the final code and show you numbers – convincing compilers that our code is worth compiling.

#

Let’s go back to the code we showed for value speculation in C:

uint64_t faster_sum(Node* node) {
  uint64_t value = 0;
  Node* next = NULL;
  while (node) {
    value += node->value;
    next = node->next;
    node++;
    if (node != next) {
      node = next;
    }
  }
  return value;
}

Both gcc and clang easily deduce that the guessing is semantically pointless, and compile our trick away, making the compiled version of faster_sum the same as sum1. This is an instance where the compiler smartness undoes human knowledge about the underlying platform we’re compiling for.

Per Vognsen’s gist uses the following trick to get compilers to behave – this is the first improvement to our sum1, sum2:

static uint64_t sum2(Node *node) {
  uint64_t value = 0;
  while (node) {
    value += node->value;
    Node *predicted_next = node + 1;
    Node *next = node->next;
    if (next == predicted_next) {
      // Prevent compilers optimizing this apparently meaningless branch away
      // by making them think we're changing predicted_next here.
      //
      // This trick, however, does not work with GCC, only with clang. GCC here
      // derives that `next` and `predicted_next` are the same, and therefore
      // merges them into the same variable, which re-introduces the data
      // dependency we wanted to get rid of.
      asm("" : "+r"(predicted_next));
      node = predicted_next;
    } else {
      node = next;
    }
  }
  return value;
}

; rax = value, rcx = next, rdi = node
; Note that rax is the return value register (we are returning the value)
sum3:
  xor     rax, rax                   ; value = 0
  xor     rcx, rcx                   ; next = NULL
  test    rdi, rdi                   ; if node is null, go to the end,
  je      end                        ; otherwise start loop
loop_body:
  add     rax, qword ptr [rdi]       ; value += node->value
  mov     rcx, qword ptr [rdi + 8]   ; next = node->next
  add     rdi, 16                    ; node++
  cmp     rcx, rdi                   ; if node is equal to next,
  je      loop_body                  ; restart loop, otherwise fix up node
  mov     rdi, rcx                   ; node = next
  test    rdi, rdi                   ; if node is not NULL restart the loop,
  jne     loop_body                  ; otherwise exit.
end:
  ret

#

These are the final numbers for our four functions:

16kB, 10000 iterations
  sum1:  8465052097154389858,  1.12us,  14.25GB/s,  3.91 cycles/elem,  1.03 instrs/cycle,  3.48GHz,  4.01 instrs/elem
  sum2:  8465052097154389858,  0.57us,  27.97GB/s,  1.99 cycles/elem,  5.02 instrs/cycle,  3.48GHz, 10.01 instrs/elem
  sum3:  8465052097154389858,  0.36us,  44.96GB/s,  1.24 cycles/elem,  4.05 instrs/cycle,  3.48GHz,  5.01 instrs/elem
128kB, 10000 iterations
  sum1:  6947699366156898439,  9.05us,  14.14GB/s,  3.95 cycles/elem,  1.01 instrs/cycle,  3.49GHz,  4.00 instrs/elem
  sum2:  6947699366156898439,  4.51us,  28.38GB/s,  1.97 cycles/elem,  5.09 instrs/cycle,  3.49GHz, 10.00 instrs/elem
  sum3:  6947699366156898439,  3.79us,  33.80GB/s,  1.65 cycles/elem,  3.03 instrs/cycle,  3.49GHz,  5.00 instrs/elem
5000kB, 100 iterations
  sum1:  2134986631019855758,  0.35ms,  14.09GB/s,  3.95 cycles/elem,  1.01 instrs/cycle,  3.48GHz,  4.00 instrs/elem
  sum2:  2134986631019855758,  0.19ms,  26.27GB/s,  2.12 cycles/elem,  4.72 instrs/cycle,  3.48GHz, 10.00 instrs/elem
  sum3:  2134986631019855758,  0.17ms,  28.93GB/s,  1.93 cycles/elem,  2.60 instrs/cycle,  3.48GHz,  5.00 instrs/elem
4294MB, 1 iterations
  sum1: 15446485409674718527,  0.44 s,   9.66GB/s,  5.76 cycles/elem,  0.69 instrs/cycle,  3.48GHz,  4.00 instrs/elem
  sum2: 15446485409674718527,  0.33 s,  13.19GB/s,  4.22 cycles/elem,  2.37 instrs/cycle,  3.48GHz, 10.00 instrs/elem
  sum3: 15446485409674718527,  0.30 s,  14.20GB/s,  3.91 cycles/elem,  1.28 instrs/cycle,  3.47GHz,  5.00 instrs/elem

The numbers are provided by the Linux perf_event_open syscall.

The first three datasets are meant to fit in the L1 / L2 / L3 cache. In those cases, the improvements are very pronounced, and sum3 is crunching the data at around 4 instructions per cycle, which should be close to the limit on the processor I tested the code on. When the data does not fit in the cache, the bottleneck becomes filling it, and we process the data at roughly 15 GB/s.

I believe that this is as fast as one can go with “simple” single-threaded reading from RAM, and it’s consistent with data from sysbench:

$ sysbench memory --memory-block-size=1G --memory-oper=read --threads=1 run
...
102400.00 MiB transferred (15089.75 MiB/sec)
...

The RAM-reading speed could probably be improved using SIMD streaming instructions or by reading from multiple threads, although the implementation would be significantly more complicated.

And so we complete our journey into this low-level trick! If you want more of this, I can’t reccomend Per’s account enough – figuring out how his tricks works has been very educational.

Thanks to Alexandru Scvortov, Niklas Hambüchen, Alex Appetiti, and Carter T Schonwald for reading drafts of this post. Niklas also clarified some details regarding RAM speeds, and suggested sysbench to measure single threaded RAM reading speed in particular. Also thanks to Per Vognsen and Jason Rohem for spotting a few typos, and to Rihalto for pointing out a better sum3 and some misleading wording.

Bonus track – a compiler friendly C version #

Alexander Monakov suggested a more robust C function which works well with both gcc and clang, performs as well as sum3, and does not resort to any assembly:

uint64_t sum5(Node *node) {
  uint64_t value = 0;
  Node *next = NULL;
  for (; node; node = node->next) {
    for (;;) {
      value += node->value;
      if (node + 1 != node->next) {
        break;
      }
      node++;
    }
  }
  return value;
}