SIMD城市:自动矢量化
SIMD City: Auto-Vectorisation

原始链接: https://xania.org/202512/20-simd-city

## 自动矢量化:释放SIMD力量 现代CPU利用SIMD(“单指令多数据”)同时对多个数据点执行相同的操作,从而大大加速基于数组的计算。 编译器现在可以自动*自动矢量化*代码以利用此功能,但需要一些指导。 有效的自动矢量化依赖于结构良好的数据——连续的数组是理想的,并且将数据字段分离到不同的数组中也有帮助。 最初,对两个数组执行简单的逐元素最大值运算并不能从矢量化中受益。 然而,将优化级别提高到-O3并指定目标CPU架构(-march=skylake)使编译器能够使用矢量指令一次处理8个整数。 这涉及到巧妙的技术,如“掩码移动”来处理条件更新,并且通过重写代码以无条件更新值,从而获得了更高的效率,允许编译器利用内置的矢量“最大”指令。 最终的优化涉及处理数组之间潜在的数据重叠,这可能会阻止矢量化。 编译器会为此添加检查,并在必要时回退到较慢的逐元素方法。 最终,通过正确的标志和数据布局,编译器可以将简单的循环转换为强大的批量处理代码。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 SIMD City: 自动向量化 (xania.org) 5 分,来自 brewmarche 1 小时前 | 隐藏 | 过去 | 收藏 | 讨论 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

Written by me, proof-read by an LLM.
Details at end.

It’s time to look at one of the most sophisticated optimisations compilers can do: autovectorisation. Most “big data” style problems boil down to “do this maths to huge arrays”, and the limiting factor isn’t the maths itself, but the feeding of instructions to the CPU, along with the data it needs.

To help with this problem, CPU designers came up with SIMD: “Single Instruction, Multiple Data”. One instruction tells the CPU what to do with a whole chunk of data. These chunks could be 2, 4, 8, 16 or similar units of integers or floating point values, all treated individually. Initially the only way to use this capability was to write assembly language directly, but luckily for us, compilers are now able to help.

To take advantage of SIMD we need to ensure our data is laid out in a nice long line, like an array, or vector. It’s also much better to store different “fields” of the data in separate arrays. We’re going to start with some integer maths - let’s update an array x to be element-wise the max of x and y:

In this -O2-compiled code, we don’t see any vectorisation, just the nice tight loop code we’ve come to expect:

.L3:
  mov edx, DWORD PTR [rsi+rax*4]    ; read y[i]
  cmp edx, DWORD PTR [rdi+rax*4]    ; compare with x[i]
  jle .L2                           ; if less, skip next instr
  mov DWORD PTR [rdi+rax*4], edx    ; x[i] = y[i]
.L2:
  add rax, 1                        ; ++i
  cmp rax, 65536                    ; are we at 65536?
  jne .L3                           ; keep looping if not

To ramp up to the next level, we’ll need to add two flags: first we need to turn up the optimiser to -O3, and then we need to tell the compiler to target a CPU that has the appropriate SIMD instruction with something like -march=skylake.

Our inner loop has gotten a little more complex, but spot the cool part:

.L4:
  ; Reads 8 integers into ymm1, y[i..i+7]
  vmovdqu ymm1, YMMWORD PTR [rsi+rax]
  ; Compares that with 8 integers x[i..i+7]
  vpcmpgtd ymm0, ymm1, YMMWORD PTR [rdi+rax]
  ; Were all 8 values less or equal?
  vptest ymm0, ymm0
  ; if so, skip the write
  je .L3
  ; mask move x[i..i+7] with the y values that were greater
  vpmaskmovd YMMWORD PTR [rdi+rax], ymm0, ymm1
.L3:
  add rax, 32           ; move forward 32 bytes
  cmp rax, 262144       ; are we at the end? (32*65536)
  jne .L4               ; keep looping if not

With a very similar number of instructions we’re now processing 8 integers at a time! This is the power of SIMD - we’d likely be limited by the bandwidth to memory rather than the housekeeping of working out what to do with them.

There are a couple of things to talk about though. Firstly, what’s all this “mask move” stuff? Well, with SIMD we can act on “multiple data” but only with a “single instruction”. Our original code has a conditional update: not every array element is necessarily processed in the same way. The compiler has cleverly used a “mask move” to let it conditionally write back only the elements that are “greater than”, which is pretty clever of it. We can help it if we don’t mind unconditionally writing to memory:

By unconditionally writing to x[i] we can reduce the loop to:

.L3:
  ; Read 8 integers from x[i..i+7]
  vmovdqu ymm0, YMMWORD PTR [rdi+rax]
  ; "max" the integers with y[i..i+7]
  vpmaxsd ymm0, ymm0, YMMWORD PTR [rsi+rax]
  ; Store them back
  vmovdqu YMMWORD PTR [rdi+rax], ymm0
  add rax, 32       ; move forward 32 bytes
  cmp rax, 262144   ; are we at the end?
  jne .L3           ; keep looping if not

The compiler has spotted our ternary operator is actually exactly the same as the built-in vector “max” instruction! No explicit comparisons or branches now - fantastic.

There’s one last thing to cover - the eagle-eyed amongst you might have spotted the fact there are actually two loops, one using vector instructions, and one doing regular one-at-a-time operations. The regular loop is jumped to conditionally after this check:

  lea rax, [rdi-4]      ; rax = x - 4
  sub rax, rsi          ; rax = x - 4 - y
  cmp rax, 24           ; compare rax with 24
  mov eax, 0            ; (set up loop counter for below)
  jbe .L2               ; if rax <= 24; jump to the slow case
; falls through to the vectorised loop here

So what’s going on here? We’re checking x and y - not the values of the two arrays, but the addresses. That somewhat awkward sequence is essentially saying “do the two arrays overlap with each other in a way that would prevent us being able to vectorise”. If they overlap by 28 or more bytes, then we can’t read and write in 8 int (32-byte) chunks: working in batches like this would give different results as the overlap means writing to x[i] might affect y[i+2] (or similar). So, the compiler has to add this check and generate a fall-back case that does one at a time.

With some non-standard trickery we tell the compiler to ignore vector dependencies for the loop:

That’s it for today: with the right flags and a little care about data layout, the compiler can turn your one-at-a-time loops into batch-processing powerhouses! Tomorrow we’ll look at floating point vectorisation and its own peculiar quirks.

See the video that accompanies this post.


This post is day 20 of Advent of Compiler Optimisations 2025, a 25-day series exploring how compilers transform our code.

Chasing your tail | When SIMD Fails: Floating Point Associativity

This post was written by a human (Matt Godbolt) and reviewed and proof-read by LLMs and humans.

Support Compiler Explorer on Patreon or GitHub, or by buying CE products in the Compiler Explorer Shop.

联系我们 contact @ memedata.com