通过价值投机击败 L1 缓存(2021)
Beating the L1 cache with value speculation (2021)

原始链接: https://mazzo.li/posts/value-speculation.html

本文介绍了一种称为“价值推测”的技术,该技术可以提高计算系统的性能。 它的具体目标是提高将存储在链表中的大量数字序列相加的程序的执行速度。 问题在于现代计算机中央处理单元 (CPU) 内指令的并行执行。 尽管 CPU 同时处理多个指令,但分支会产生阻碍这种并行化的障碍。 只要存在决策点,例如“if”语句、“while”或“for”循环,就会发生分支。 分支会导致问题,因为 CPU 需要在前进之前确定是采用一条路径还是另一条路径,从而导致延迟。 为了解决这些问题,本文提出了“价值投机”方法。 通过仔细考虑CPU如何处理数据,作者建议猜测循环期间可能访问的数据值。 系统可以使用分支预测器来尝试预测分支将采取的方向,以确定猜测是否正确。 在前面提到的链表问题的上下文中,系统猜测下一个节点值,递增当前指针,并检查新指针是否与预期值匹配。 如果猜测不正确,系统会相应调整指针。 由于链表条目是相邻的,这种方法大大降低了分支预测的频率,使CPU能够更快地执行计算。 然而,必须注意的是,这种方法需要仔细处理,以确保与不同软件编译器的兼容性。 编译器有时会忽略这种看似多余的猜测机制,从而导致删除旨在提高性能的优化。 因此,作者提供了解决方案的几种变体,以帮助说服编译器生成最佳的机器代码。 在测试数据大小和实现的各种组合后,发现了显着的改进,16 KB 数据集的处理速率高达每秒 44.96 GB。 当处理 CPU 缓存无法容纳的较大数据集时,处理速度被限制在每秒 15 GB 左右,但速度仍然快得令人印象深刻。 总的来说,本文展示了一种巧妙的技术,通过仔细考虑底层硬件机制和软件工具施加的限制来提高计算系统的效率。

该用户讨论了一种用于在 C 和 C++ 中遍历数组和链表的方法。 当递增指向数组元素的指针时,它指向内存中该类型的下一个对象。 通过这样做,用户可以轻松地迭代数组。 然而,他们指出,在链表上下文中向指针加 1 的工作方式类似,其中下一个链表节点可能是下一个数组元素,因为它们共享连续内存空间。 用户提到现代 CPU 利用管道来加速同时执行多个指令的过程。 当指令之间存在依赖性时就会出现瓶颈,从而导致称为管道停顿的错误。 内存加载通常会引入这些瓶颈,因为加载内存通常比其他操作慢。 为了克服这个问题,处理器可以乱序执行指令,尽管为了防止混乱,它们必须看起来好像正在执行。 这允许 CPU 在确认内存中的下一个节点是否对应于实际列表中的下一个节点之前,推测性地开始处理循环的下一个迭代。 如果预测结果与实际情况相符,则 CPU 已成功降低管道停顿的可能性。 相反,如果预测失败,则不会提交任何结果,并且 CPU 按预期顺序向前移动。 最后,用户承认,考虑到连续数组通常可以使用向量有效地执行此任务,该技术看起来很复杂并且似乎没有必要。 他们建议,涉及频繁更改链表的特定场景可能需要利用这种方法来最大限度地减少昂贵操作的影响。 此外,用户注意到,由于更好的缓存局部性和减少的指针追踪问题,在单个大分配中重新排列链表的内存布局可能会提高性能。 或者,可以实现分段数组,其中链表元素单独存储,以实现类似的好处。 然而,在实践中,对于典型的应用程序来说,此类技术可能过于复杂,因此需要更简单的解决方案,例如展开链表或通过向量内的索引表示连接。
相关文章

原文

If we have a heuristic to guess some value cheaply, we can remove a data dependency in a tight loop using the branch predictor. This allows the CPU to run more instructions in parallel, increasing performance. If this explanation does not make much sense to you, keep reading to learn about some of the magic making your CPU fast!


Per Vognsen’s twitter feed is full of neat low-level curiosities, usually leveraging CPU features for some performance benefit.

Recently he tweeted about a trick that I had never heard of – value speculation. The trick exploits the branch predictor to guess values, enabling more instruction parallelism and therefore removing a bottleneck on the L1 cache. Note that the bottleneck is not due to L1 cache misses, but on L1 cache hits introducing unwanted data dependencies.

In this post I explain the machinery involved, including a primer on branch prediction and CPU caches, so that anybody with a passing knowledge of C and how code is executed on CPUs should be able to follow.

The code for the post is available here. All the numbers are from a Xeon E5-1650 v3, an Intel Haswell processor with L1 / L2 / L3 cache of 32kB, 256kB, and 15MB respectively. The code was compiled with clang -O3, and not with gcc, for reasons explained later.

Before starting, I’d like to stress that L1 cache hits are almost certainly not the bottleneck of your application! This is just a very neat trick that illuminates some CPU features, not a guide on how to improve the performance of your average piece of C code.

The setup – summing linked lists #

We have a simple linked list data type, and a function summing all the elements of a given linked list:

#

Modern CPUs do not process instructions serially, but rather handle many at the same time. They read many instructions at once, break them down in stages, and then try to fill all the computation units they have with as many tasks from as many instructions as possible. For instance, modern Intel processors are designed for a throughput of 4 instructions per clock cycle, and AMD Zen processors for up to 5 or 6.

However, branches pose a challenge when wanting to execute instructions in parallel. Let’s go back to our function sum1:

#

Let’s focus on the loop body of sum1:

#

We finally get to the trick. As discussed, we are stuck waiting on reading what the next node address is. However, in our setup we allocate the list in a contiguous block of memory, and therefore the nodes are always next to each other.

So here’s the key idea: try to guess the next node by just bumping the previous value. If the guess is wrong, set the node to the “real” next value. In C, this is how it would look like:

#

Let’s go back to the code we showed for value speculation in C:

#

These are the final numbers for our four functions:

16kB, 10000 iterations
  sum1:  8465052097154389858,  1.12us,  14.25GB/s,  3.91 cycles/elem,  1.03 instrs/cycle,  3.48GHz,  4.01 instrs/elem
  sum2:  8465052097154389858,  0.57us,  27.97GB/s,  1.99 cycles/elem,  5.02 instrs/cycle,  3.48GHz, 10.01 instrs/elem
  sum3:  8465052097154389858,  0.36us,  44.96GB/s,  1.24 cycles/elem,  4.05 instrs/cycle,  3.48GHz,  5.01 instrs/elem
128kB, 10000 iterations
  sum1:  6947699366156898439,  9.05us,  14.14GB/s,  3.95 cycles/elem,  1.01 instrs/cycle,  3.49GHz,  4.00 instrs/elem
  sum2:  6947699366156898439,  4.51us,  28.38GB/s,  1.97 cycles/elem,  5.09 instrs/cycle,  3.49GHz, 10.00 instrs/elem
  sum3:  6947699366156898439,  3.79us,  33.80GB/s,  1.65 cycles/elem,  3.03 instrs/cycle,  3.49GHz,  5.00 instrs/elem
5000kB, 100 iterations
  sum1:  2134986631019855758,  0.35ms,  14.09GB/s,  3.95 cycles/elem,  1.01 instrs/cycle,  3.48GHz,  4.00 instrs/elem
  sum2:  2134986631019855758,  0.19ms,  26.27GB/s,  2.12 cycles/elem,  4.72 instrs/cycle,  3.48GHz, 10.00 instrs/elem
  sum3:  2134986631019855758,  0.17ms,  28.93GB/s,  1.93 cycles/elem,  2.60 instrs/cycle,  3.48GHz,  5.00 instrs/elem
4294MB, 1 iterations
  sum1: 15446485409674718527,  0.44 s,   9.66GB/s,  5.76 cycles/elem,  0.69 instrs/cycle,  3.48GHz,  4.00 instrs/elem
  sum2: 15446485409674718527,  0.33 s,  13.19GB/s,  4.22 cycles/elem,  2.37 instrs/cycle,  3.48GHz, 10.00 instrs/elem
  sum3: 15446485409674718527,  0.30 s,  14.20GB/s,  3.91 cycles/elem,  1.28 instrs/cycle,  3.47GHz,  5.00 instrs/elem

The numbers are provided by the Linux perf_event_open syscall.

The first three datasets are meant to fit in the L1 / L2 / L3 cache. In those cases, the improvements are very pronounced, and sum3 is crunching the data at around 4 instructions per cycle, which should be close to the limit on the processor I tested the code on. When the data does not fit in the cache, the bottleneck becomes filling it, and we process the data at roughly 15 GB/s.

I believe that this is as fast as one can go with “simple” single-threaded reading from RAM, and it’s consistent with data from sysbench:

$ sysbench memory --memory-block-size=1G --memory-oper=read --threads=1 run
...
102400.00 MiB transferred (15089.75 MiB/sec)
...

The RAM-reading speed could probably be improved using SIMD streaming instructions or by reading from multiple threads, although the implementation would be significantly more complicated.

And so we complete our journey into this low-level trick! If you want more of this, I can’t reccomend Per’s account enough – figuring out how his tricks works has been very educational.

Thanks to Alexandru Scvortov, Niklas Hambüchen, Alex Appetiti, and Carter T Schonwald for reading drafts of this post. Niklas also clarified some details regarding RAM speeds, and suggested sysbench to measure single threaded RAM reading speed in particular. Also thanks to Per Vognsen and Jason Rohem for spotting a few typos, and to Rihalto for pointing out a better sum3 and some misleading wording.

Bonus track – a compiler friendly C version #

Alexander Monakov suggested a more robust C function which works well with both gcc and clang, performs as well as sum3, and does not resort to any assembly: