tolower() 与 AVX-512

tolower() 与 AVX-512
tolower() with AVX-512

原始链接: https://dotat.at/@/2024-07-28-tolower-avx512.html

在本文中，作者讨论了使用 SIMD 指令来优化 Rust 编写的哈希函数。他们在 Olivier Giniaux 的文章中找到了灵感，该文章涉及“死亡”之外的不安全读取，特别解决了有效处理小字符串的问题。作者指出，他们对短字符串在内存和向量寄存器之间传输时引起问题感到沮丧。然而，在进一步阅读后，他们发现某些 SIMD 指令集确实提供了适合字符串处理的有用屏蔽加载和存储，即具有字节粒度。其中包括 ARM SVE，可在最新的大型 ARM Neoverse 核心（例如 Amazon Graviton）上使用，但不能在 Apple Silicon 上使用。另一种选择是 AVX-512-BW，在最新的 AMD Zen 处理器上提供。虽然 AVX-512 具有多个扩展且可用性各不相同，但对 Intel 系统的支持可能不一致。拥有 AMD Zen 4 处理器后，作者决定探索 AVX-512-BW 功能，编写一个能够同时处理 64 字节的基本 tolower() 函数。诸如“mm512\*”之类的关键字用于通过在搜索查询中使用“*”通配符来定位相关函数。该实现涉及加载具有 64 个特定字节（A、Z）实例的多个寄存器、添加数字以将大写转换为小写、将输入字符与 A 和 Z 进行比较、应用掩码以及执行掩码加法。后续步骤包括将功能包装在方便的函数中，例如复制字符串，同时将其转换为小写，对较长的字符串使用未对齐的向量加载和存储指令，以及对较小的字符串和较长字符串的尾部屏蔽未对齐的加载和存储。作者得出的结论是，与其他测试方法相比，AVX-512-BW 的性能低谷极小，因此在处理短字符串方面被证明非常高效。此外，他们还表示希望 AVX-512-BW 和 SVE 得到更广泛的采用，并期望整体字符串处理性能得到改善。该项目的完整源代码可以在作者的 Git 存储库中找到。

给定的文本讨论了使用 AVX512 指令以 x86 架构编写的汇编代码，将小到大的 ASCII 字符串转换为大写或小写。它解释了两种不同的编译器（GCC 和 Clang）如何在执行相同的代码时产生不同的结果。 GCC 生成一种更简单但效率较低的代码，一次处理一个字符串，在 Ice Lake 处理器上每个周期实现约 32 个字节。 Clang 生成更复杂但更高效的代码，导致每个时钟周期同时处理多个字符串大约 42.67 字节。它还提到了此处可用的等效 C# 实现 ()，该实现使用向量指令 (SSE)，但由于限制仅支持最大 256 位 Zen 3 和 4 等现有 CPU 的性能。最后，它指出了 C# 版本与汇编代码的性能之间的一些差异，特别是在较短的字符串方面。最后，它涉及许可问题，询问用于发布内容的许可类型。

原文

A couple of years ago I wrote about tolower() in bulk at speed using SWAR tricks. A couple of days ago I was interested by Olivier Giniaux’s article about unsafe read beyond of death, an optimization for handling small strings with SIMD instructions, for a fast hash function written in Rust.

I’ve long been annoyed that SIMD instructions can easily eat short strings whole, but it’s irritatingly difficult to transfer short strings between memory and vector registers. Oliver’s post caught my eye because it seemed like a fun way to avoid the problem, at least for loads. (Stores remain awkward!)

Actually, to be frank, Olivier nerdsniped me.

Reading more around the topic, I learned that some SIMD instruction sets do, in fact, have useful masked loads and stores that are suitable for string processing, that is, they have byte granularity. They are:

ARM SVE, which is available on recent big-ARM Neoverse cores, such as Amazon Graviton, but not Apple Silicon.
AVX-512-BW, the bytes and words extension, which is available on recent AMD Zen processors. AVX-512 is a complicated mess of extensions that might or might not be available; support on Intel is particularly random.

I have an AMD Zen 4 box, so I thought I would try a little AVX-512-BW.

Using the Intel intrinsics guide I wrote a basic tolower() function that can munch 64 bytes at once.

Top tip: You can use * as a wildcard in the search box, so I made heavy use of mm512*epi8 to find byte-wise AVX-512 functions (epi8 is an obscure alias for byte).

First, we fill a few registers with 64 copies of some handy bytes.

We need the letters A and Z:

    __m512i A = _mm512_set1_epi8('A');
    __m512i Z = _mm512_set1_epi8('Z');

We need a number to add to uppercase letters to make them lowercase:

    __m512i to_lower = _mm512_set1_epi8('a' - 'A');

We compare our input characters c with A and Z. The result of each comparison is a 64 bit mask which has bits set for the bytes where the comparison is true:

    __mmask64 ge_A = _mm512_cmpge_epi8_mask(c, A);
    __mmask64 le_Z = _mm512_cmple_epi8_mask(c, Z);

If it’s greater than or equal to A, and less than or equal to Z, then it is upper case. (AVX mask registers have names beginning with k.)

    __mmask64 is_upper = _kand_mask64(ge_A, le_Z);

Finally, we do a masked add. We pass c twice: bytes from the first c are copied to the result when is_upper is false, and when is_upper is true the result is c + to_lower.

    return _mm512_mask_add_epi8(c, is_upper, c, to_lower);

The tolower64() kernel in the previous section needs to be wrapped up in more convenient functions such as copying a string while converting it to lower case.

For long strings, the bulk of the work uses unaligned vector load and store instructions:

	__m512i src_vec = _mm512_loadu_epi8(src_ptr);
	__m512i dst_vec = tolower64(src_vec);
	_mm512_storeu_epi8(dst_ptr, dst_vec);

Small strings and the stub end of long strings use masked unaligned loads and stores.

This is the magic! Here is the reason I wrote this blog post!

The mask has its lowest len bits set (its first len bits in little-endian order). I wrote these two lines with perhaps more ceremony than required, but I thought it was helpful to indicate that the mask is not any old 64 bit integer: it has to be loaded into one of the SIMD unit’s mask registers.

	uint64_t len_bits = (~0ULL) >> (64 - len);
	__mmask64 len_mask =  _cvtu64_mask64(len_bits);

The load and store look fairly similar to the full-width versions, but with the mask stuff added. The z in maskz means zero the destination register when the mask is clear, as opposed to copying from another register (like in mask_add above).

	__m512i src_vec = _mm512_maskz_loadu_epi8(len_mask, src_ptr);
	__m512i dst_vec = tolower64(src_vec);
	_mm512_mask_storeu_epi8(dst_ptr, len_mask, dst_vec);

That’s the essence of it: you can see the complete version of copytolower64() in my git repository.

To see how well it works, I benchmarked several similar functions. Here’s a chart of the results, compiled with Clang 16 on Debian 11, and run on an AMD Ryzen 9 7950X.

The benchmark measures the time to copy about 1 MiByte, in chunks of various lengths from 1 byte to 1 kilobyte. I wanted to take into account differences in alignment in the source and destination strings, so there are a few bytes between each source and destination string, which are not counted as part of the megabyte.

On this CPU the L2 cache is 1 MiB per core, so I expect each run of the test spills into the L3 cache.

To be sure I was measuring what I thought I was, I compiled each function separately to avoid interference from inlining and code motion. In real code it’s more likely that you would want to encourage inlining, not prevent it!

benchmark results

The pink tolower64 line is the code described in this blog post. It is consistently near the fastest of all the functions under test. (It drops a little at 65 bytes long, where it spills into a second vector.)

The interesting feature of the line for my purposes is that it rises fast and lacks deep troughs, showing that the masked loads and stores were effective at handling small string fragments quickly.
The green copybytes64 line is a version of memcpy using AVX-512 in a similar manner to tolower64. It is (maybe surprisingly) not much faster. I had to compile copybytes64 with Clang 11 because more recent versions are able to recognise what the function does and rewrite it completely.
The orange copybytes1 line is a byte-by-byte version of memcpy again compiled using Clang 11. It illustrates that Clang 11 had relatively poor autovectorizer heuristics and was pretty bad for string fragments less than 256 bytes long.
The very slow red tolower line calls the standard tolower() from <ctype.h> to provide a baseline.
The purple tolower1 line is a simple byte-by-byte version of tolower() compiled with Clang 16. It shows that Clang 16 has a much better autovectorizer than Clang 11, but it is slower and much more complicated than my hand-written version. It is very spiky because the autovectorizer did not handle short string fragments as well as tolower64 does.
The brown tolower8 line is the SWAR tolower() from my previous blog post. Clang valiantly tries to autovectorize it, but the result is not great because the function is too complicated. (It has the Clang-11-style 256-byte performance cliffs despite being compiled with Clang 16.)
The blue memcpy line calls glibc’s memcpy. There’s something curious going on here: it starts off fast but drops off to about half the speed of copybytes64. Dunno why!

So, AVX-512-BW is very nice indeed for working with strings, especially short strings. On Zen 4 it’s very fast, and the intrinsic functions are reasonably easy to use.

The most notable thing is AVX-512-BW’s smooth performance: there’s very little sign of the performance troughs that the autovectorizer suffers from as it shifts to scalar code for small string fragments.

I don’t have convenient access to an ARM box with SVE support, so I have not investigated it in detail. It’ll be interesting to see how well SVE works for short strings.

I would like both of these instruction set extensions to be much more widely available. They should improve the performance of string handling tremendously.

The code for this blog post is available from my web site.

Thanks to LelouBil on Hacker News for pointing out a variable was named backwards. Ooops!

tolower() 与 AVX-512 tolower() with AVX-512

tolower() 与 AVX-512
tolower() with AVX-512