I/O不再是瓶颈？ (2022)

I/O不再是瓶颈？ (2022)
I/O is no longer the bottleneck? (2022)

原始链接: https://stoppels.ch/2022/11/27/io-is-no-longer-the-bottleneck.html

## 面试常见问题中 I/O 与 CPU 的对比最近由 Ben Hoyt 的博客引发的讨论表明，在诸如词频统计等典型的编程面试问题中，I/O *并非* 瓶颈，尽管存在普遍的假设。虽然顺序读取速度已大幅提升（测试中达到 1.6-12.8 GB/s），但 CPU 速度却停滞不前。作者尝试在 C 语言词计数器中实现读取速度性能，但未能成功。优化后的代码仅达到 278 MB/s，受分支阻碍，无法进行编译器向量化。移除分支后提升至 330 MB/s，但仍远低于磁盘速度。即使是使用 `wc -w` 的更简单的词计数，由于不同的空白定义，性能也较差，为 245.2 MB/s。核心问题似乎在于难以将标量、分支代码转换为高效的向量化指令（如 AVX2）。经过手动优化、AVX2 向量化的词计数器达到了 1.45 GB/s – 仍然仅为峰值磁盘读取速度的 ~11%。这表明磁盘速度*确实*得到了显著提升，但释放该潜力需要克服 CPU 端的限制和有效的编译器优化，或手动向量化。作者的代码已在 GitHub 上提供，供进一步探索和优化。

## I/O 不再是瓶颈？ (2022) - 摘要最近一篇博文 ([stoppels.ch](https://stoppels.ch/2022/11/30/io-is-no-longer-the-bottleneck/)) 引发了一场讨论，挑战了传统上认为 I/O 是现代 CPU 性能的主要限制因素的观点。核心论点是，CPU 性能越来越受到向单个核心提供数据的速率的限制——对于 x86 来说大约是 6GB/s，对于 Apple M 系列芯片来说大约是 20GB/s——本质上就是 `memcpy()` 的速度。虽然宣传的内存带宽可能更高，但这代表的是所有核心的*总*带宽，而不是每个核心的性能。这种限制会影响 JSON 和 Protobuf 等解析和序列化格式，因为它们在数据访问之前需要完全解析。然而，零拷贝格式可以通过跳过不必要的数据来绕过此瓶颈。作者强调了他们的 Lite³ 序列化格式，声称通过利用这一原理获得了显著的性能提升（在某些基准测试中比 simdjson 快高达 120 倍）。进一步的讨论表明，实际性能在很大程度上取决于内存缓存、CPU 架构和 DMA 操作的效率等因素。这场辩论还涉及内存带宽对垃圾回收的影响，以及新架构进一步模糊内存和 I/O 之间界限的潜力。

原文

Nov 27th, 2022

Recently Ben Hoyt published a blog post claiming that contrary to popular belief, I/O is not the bottleneck in typical programming interview problems such as counting word frequencies from a stream. Sequential read speed has come a long way, while CPU speed has stagnated.

Sequential reads are indeed incredibly fast. Using the same method as linked in Ben Hoyt's post, I'm getting 1.6 GB/s sequential reads on a cold cache, and 12.8 GB/s on a warm cache (best of five).

But it should be possible to count word frequencies at a speed of 1.6 GB/s even on a single thread, right?

(For the impatient: code is available on GitHub.)

The optimized C implementation

Ben Hoyt's blog refers to an earlier post which includes a faster C version of the word frequency counter. I compiled optimized.c with GCC 12, using -O3 -march=native flags, and ran it on the 425MB input file (100 copies of the King James Version of the Bible).

The result was surprisingly bad:

$ time ./optimized < bible-100.txt > /dev/null

real    0m1.525s
user    0m1.477s
sys     0m0.048s

That is only 278 MB/s on warm cache.

Vectorization

Looking at the code I realized one of the hot loops had many branches, including an early exit, which prevents the compiler from vectorizing:

for (; i < num_process; i++) {
    char c = buf[i];
    if (c <= ' ') {
        break;
    }
    if (c >= 'A' && c <= 'Z') {
        c += ('a' - 'A');
        buf[i] = c;
    }
    hash *= FNV_PRIME;
    hash ^= (uint64_t)c;
}

My initial attempt to improve performance was to move this lowercase logic out of the loop, like so:

for (int i = 0; i < BUF_SIZE; ++i) {
    buf[i] = buf[i] >= 'A' && buf[i] <= 'Z' ?  buf[i] - 'A' + 'a' : buf[i];
}

This simple change improved performance to 330 MB/s (using clang for better vectorization). Funnily enough, just adding these 3 lines before the loop, without deleting the original code gives comparable speed; strictly more work, but branch prediction does its job. Still, it's about a factor 5 away from cold cache sequential read speed.

Trying a simpler problem

At this point I thought it was unlikely I could squeeze a lot of performance out of the word frequency counter. Sure, there are cache misses in the hash map, so maybe it could be optimized for better cache locality of common words. Or potentially short words can benefit from perfect hashing on the stack. But what will that give? Another 20%?

Instead, let's look at an informative baseline. Just count words without keeping track of frequencies; no tedious hash maps.

In fact there's a program for that: wc -w. Such a single-purpose tool must be fast, right?

$ time wc -w < bible-100.txt > /dev/null 

real    0m1.758s
user    0m1.718s
sys     0m0.040s

Unexpectedly the performance is terrible... 245.2 MB/s. Why? Well, the man page says it's doing a different thing. The Ben Hoyt code only splits on ' ' whitespace, whereas wc uses ' ', '\n', '\t', ... and even locale specific characters.

How fast can word count be?

If the premise is that disk speed has caught up in the last decade, we should really be using new CPU features from that period. And that basically means: vectorize all the things. AVX2 is almost a decade old already. AVX-512 was available for the common people in 2017, but I'm on znver2, so I'll stick to AVX2.

Unfortunately, the compiler has a hard time autovectorizing word count. Maybe that proves the point of Ben Hoyt: disks got orders of magnitude faster "for free", but modern compilers don't magically generate machine code orders of magnitude faster. It's just difficult to translate branchy scalar programs into vectorized machine code.

Masks

Part of word count can trivially be auto-vectorized: suppose for simplicity we have a register size of 128 bits, in which we can store 16 consecutive characters. It's easy to locate the whitespace by broadcasting it into a register ahead of time, and then doing a single VPCMPEQB comparsion operation:

       | 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 |
input: | h o w   m a n y   w o r d s   a | r e   h e r e ?
 mask: | 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 |

But after getting a mask in a vector register, how do you go about and count words? The only thing I could think of is using a Move Byte Mask trick I've seen in Cosmopolitan libc's strlen implementation. The relevant instruction PMOVMSKB moves the long bit mask into a 32 bit int, and then you do your usual bit tricks.

Bit tricks

What are the usual bit tricks? One great candidate is Find First Set or ffs in short — this is a great name given how tedious it is to get bit tricks right. This instruction can be used to iterate over set bits, like so:

#include <stdio.h>

int main() {
    int mask = 0b0100000100001000;
    int prev = 0;
    while (mask) {
        int curr = __builtin_ffs(mask);
        if (curr > prev + 1)
            printf("Word start at %d\n", curr);
        prev = curr;
        if (curr == 32) // don't ask, sigh.
            break;
        mask = (mask >> curr) << curr;
    }
}

It outputs the following, corresponding to the mask example above:

Word start at 4
Word start at 9
Word start at 15

Putting it together

I ended up writing this code explicitly using immintrin.h which is an absolutely dreadful experience. Next time I'll use a high-level API (in the past I've had a lot of fun vectorizing things interactively in the Julia REPL with VectorizationBase.jl). But at least I felt like I had some control over the generated machine code.

Using AVX2 with 256-bit registers, you need to align your data on 32 bits, which I did (of course not without messing it up first). I reserved 6 of the registers to hold all broadcasted whitespace characters. Then I explicitly unrolled the vectorized loop 4 times, so in every iteration we process 128 bytes of data.

It took an awful lot of time to fix the off-by-one bugs, but in the end I managed to get a working program, tested against a non-zero amount of text files on my computer:

$ ./wc-avx2 < bible-100.txt 
82113300
$ wc -w < bible-100.txt 
82113300

So, how fast?!

$ time ./wc-avx2 < bible-100.txt 
82113300

real    0m0.227s
user    0m0.186s
sys     0m0.041s

That comes down to 1.45 GB/s (on a warm cache).

Sigh. So hand-optimized, single-threaded word count is only getting about 11% of the sequential disk read speed. And on a cold cache?

$ sysctl -w vm.drop_caches=3
$ time ./wc-avx2 < bible-100.txt
82113300

real    0m0.395s
user    0m0.196s
sys     0m0.117s

Still more time in user than sys :(. So yeah, maybe the disk speed has caught up statement is indeed true.

Get the code

I've put my code up on GitHub. If you know better bit-tricks, feel free to submit a pull request.