字节数优先于浮点运算次数:你的算法(大致)没问题,问题在于你的数据。
Bytes before FLOPS: your algorithm is (mostly) fine, your data isn't

原始链接: https://www.bitsdraumar.is/bytes-before-flops/

## 数据导向优化:摘要 本文详细介绍了一种实用的性能优化方法,强调以数据为中心的心态。作者基于编译器和向量化代码的经验,提倡在尝试算法改进之前,理解数据在程序中*如何*流动。 核心过程包含五个关键步骤:**1. 剖析(Profile):** 识别瓶颈——通常往往不在最初预期的位置。**2. 算法特化(Specialize the Algorithm):** 通过将算法调整到正在处理的特定数据,超越通用实现,做到“减少工作量”。**3. 提高缓存友好性(Make it Cache Friendly):** 优化数据访问模式,以实现时间和空间局部性,可能对不常访问的数据使用流式存储,并重构数据布局以提高缓存利用率。**4. 利用SIMD(Make it SIMD):** 利用单指令多数据(SIMD)指令进行核心内的并行处理,可能借助编译器自动向量化。**5. 实现并行化(Make it Parallel):** 利用多核处理器并行化独立的数据流,同时注意伪共享问题。 作者强调,过早优化不如*不恰当*粒度的优化更具危害。虽然系统语言(C/C++/Rust)提供最大的控制权,但优化原则适用于所有语言,尽管效果各不相同。最终,性能提升源于有效地管理和转换数据,认识到所有编程本质上都涉及数据操作。

## 字节优先于 FLOPS:性能优化总结 最近 Hacker News 的讨论围绕一篇博客文章展开,该文章认为,对于性能而言,算法效率(FLOPS)通常不如高效的数据处理(字节)重要。作者认为,许多性能问题源于次优的内存访问模式,而并非算法速度慢。 在 C# 中使用 `Vector` 和 `Span` 等优化手段可以显著提高性能,因为它们利用了高效的内存布局。虽然最初的改进来自通用启发式方法(缓存大小、顺序访问),但真正实现性能最大化需要根据特定的 CPU 缓存特性调整代码。然而,作者指出,这种级别的优化通常是不必要的——解决基本的内存效率问题通常能带来最大的收益。 有趣的是,“扁平”的性能剖析(时间均匀分布)可能表明更具挑战性的优化场景,表明存在根本性问题,例如低效的内存管理,而不是明确的瓶颈。这与“尖峰”剖析形成对比,后者提供了直接的改进路径。
相关文章

原文

Written by David Miličević
on 

A small, but deep dive into performance and data-oriented optimization.

Foreword

So, earlier this year I did a presentation at a cool(get it?) little Norwegian company called Airthings.

It was a two-parter. First part was how the compilation pipeline works - from ASCII to assembly, greatly simplified of course. It was from an old presentation I did many years ago. The second part is the interesting. It's about performance and what are some of the low-hanging fruits that you can pick off to speed up stuff as much as possible.

Now, that part actually takes a while, but in short it's basically a distillation of my experience working on compilers, assembly and writing vectorized code. The gist of it is - data.

It occurred to me that a small summary would make a good blog post. There's plenty of material on data-oriented design and performance, but this might be a nice overview.

I'll style it as a written representation of how I approach optimization.

When I approach an optimization effort, first thing I do is I try to visualize the data, where each bit is going from and to, what operations are being done, what's the difference between the initial and final state, can we get there with fewer or simpler steps?

If the data is coming in in a suboptimal form, transform it.
There is no algorithm in the world that will save you from inefficiently coded information.

Usually the steps, in my experience, go like this:

1. Profile

Before anything else, profile.

Sometimes you can intuit where the bottleneck is, but most of the time, you would be surprised at the real place the algorithm falls apart. Sometimes it won't be a single bottleneck, maybe it'll be spread out over multiple places (worst case scenario being the flat profile where program time is roughly evenly distributed).

A lot of it depends on not only understanding low-level programming but the domain that you're working with.
There's only so much you can optimize without enough information on what that specific algorithm/code/subsystem is supposed to be doing, rather than what it is actually (slowly) doing.

Also, once the lowest-hanging fruit has been picked, we're not talking just function, but instruction-level profiling as well where, depending on the machine, you can be frontend-bound, backend-bound, memory-bound... a whole new world opens up.
See VTune metrics and analysis guides, lots of useful stuff in there.

2. Specialize the algorithm

This step is not directly related to data, but changing data often requires changing the algorithm. Not only do we need to change access patterns and the shape of the data, but often a different algorithm more suited to the new form of data we're processing.

Regardless, this is a crucial part as this is where you identify one of the most important parts of optimization - "do less". Usually when we implement algorithms, we just try to make them work, aka. the "make it work->make it fast->make it beautiful" paradigm. In the process of development we change the data, the algorithm, the code goes through many iterations, a lot is left on the table.

Specializing an algorithm means taking away the genericity and only leaving in the parts that are relevant to the data being processed. If it means specializing the data, then that's fair game as well, garbage in/garbage out. As part of development we tend to use existing algorithms, whether it is from a standard library or an external one, or even just different module in the program.

These existing algorithms are by definition not bespoke and don't take into account the data or the manipulation done on it. In many cases you can rewrite them to suit your use-case, in others you can simply use a more specific algorithm, e.g. radixsort instead of quicksort.

People are often afraid of peeking into the libraries, thinking it's magic. It's not, just look at the code, learn what's being used and adapt it to your own case.

The specific example given in the presentation was specializing a generic image downscaling algorithm to specifically handle integer-based, 32bit argb images, which was used to great effect. Sped up the implementation as well as allowed for further optimizations.

Small and (very) trivial example would be something like this:

// we're trying to parse a validated, 2-digit, positive integer

// the generic way (atoi)
// overhead: function call, loop setup, branch prediction, locale checks.
int val = atoi(ptr);

// the specialized way
// overhead: none really, just a few instructions that the CPU can blaze through
int val = (ptr[0] - '0') * 10 + (ptr[1] - '0');

3. Make it cache friendly

This is where the meat of the matter is. Think about the data writes and reads, both on temporal and spatial axes. Every load and store needs to be accounted for and studied. Is it needed, where does it write, when should it write, are there any similarly pointed writes/reads? Similarly for reads, consider not just the location in the code, but the memory location it concerns.

To put it plainly, try to access data that's either close to data you've already accessed or data that you've recently accessed.

If your algorithm is just accessing a part of your data, e.g. you have an array of structs and you're only accessing the first 8 bytes of a 128 byte struct, you're wasting a lot of cpu. Separate out the relevant data and process it independently. The relevant data stays hot in cache and you reap the benefits.

One nice trick is if you're writing to a location that you know you won't access soon, you can use streaming non-temporal writes/stores (_mm_stream_XX intrinsics in x86). Using them allows the CPU to bypass the data caches and speed up the whole process. Useful for constricted memory bandwidth as well.

Once you've done all that (or ideally before, premature optimization is not a thing here in this blog. Improperly granulated optimization on the other hand, is), you can see what the actual values will be for each piece of data and compress the types if possible. Saves space, you can keep more data in cache and thus speed up the whole thing.

Structure padding can also be a significant factor. Take a look at this example, generated by a very useful tool called pahole.

struct bad_order {
	char                       c;                    /*     0     1 */

	/* XXX 7 bytes hole, try to pack */

	double                     d;                    /*     8     8 */
	int                        i;                    /*    16     4 */

	/* size: 24, cachelines: 1, members: 3 */
	/* sum members: 13, holes: 1, sum holes: 7 */
	/* padding: 4 */
	/* last cacheline: 24 bytes */
};
struct good_order {
	double                     d;                    /*     0     8 */
	int                        i;                    /*     8     4 */
	char                       c;                    /*    12     1 */

	/* size: 16, cachelines: 1, members: 3 */
	/* padding: 3 */
	/* last cacheline: 16 bytes */
};

As seen above, even though bad_order has the same members as good_order, it is 8 bytes larger simply due to the ordering of members.
In practice, on x86-64, this means that you can fit 4 good_order structs in a 64bit cache line, whereas you can fit only 2 bad_order structs (with one spilling over).
Which is a bit bonkers if you think about it, simply reordering a few lines in an editor can take off a good chunk of a runtime (depending on the algorithm of course).

You can use __attribute__((packed)) instead of manually reordering members, but I'd only use it for protocol/storage use as unaligned members can cause issues on some architectures, so just one more thing to keep in mind.

Finally, if you're in a loop where you know exactly which next addresses you're going to access (provided it's not a simple strided memory access pattern), you can use software prefetching instructions (__builtin_prefetch intrinsic in x86). But again, unless it's a highly unpredictable (for the pretty good hardware prefetchers of the modern CPUs), you should just trust the CPU.

4. Make it SIMD

Vectorizing an algorithm can in some simpler cases come as a result of the step above, as the result is usually a serial processing of data, which is highly amenable to vectorization (and parallelization). In that case, simply grouping the data in 8/16/32 wide groups and using vector instructions to process them can be enough.
In those cases, what can also happen as well is autovectorization, where the compiler will happily recognize that a loop vectorizable and do the whole process for you.

In other cases, you have to take a good hard look at what your algorithm is doing, separate out independent streams of data processing, rearrange the data to fit the vectorized version so it's laid out nicely and contiguously for the SIMD instructions to speed through, and try to vectorize the whole loop(s). If you have some complex data movement going on, it can be a challenge to shuffle and permute all the bytes around, but a fun challenge nonetheless.

One thing to be aware of is that not all instruction sets are available on all CPUs microarchitectures (looking at you Intel with AVX512) and thus code that you write can just segfault on some machines. To avoid it, just simply branch based on the CPU feature set with a common fallback.

For autovectorization, you can use function multiversioning which automatically does all that for you. It makes multiple copies of a function for each target and branches on each call, very nifty.
Example:

__attribute__((target_clones("avx512f", "avx2", "default")))
void process_data(float* data, int count) {
    for (int i = 0; i < count; ++i) {
        data[i] = data[i] * 0.5f;
    }
}

For more information on this (and data-oriented design in general), research SoA(structures of arrays) and AoSoA (arrays of structures of arrays). This could honestly be a whole series of blog posts, so just for the sake of terseness, take a look at the materials section at the bottom for sources.

5. Make it parallel

So now, parallelization at this stage should come at no cost, you've already separated out the data processing in independent streams, parallelizing should be just a matter of using OpenMP and rayon in case of C/C++ and Rust. Combined with the above steps, this should easily net you two orders of magnitude worth of speedup.

One thing you should be aware of however, is false sharing.
In short, if you have multiple threads accessing a single cache line, it'll trigger a memory stall and a load from RAM.
Ironically, it can be fixed by adding padding, which makes less data fit in the cache, but enables the parallelization to do its thing. Thus a net positive. Small example:

struct packed {
    std::atomic<int64_t> a;
    std::atomic<int64_t> b; 
};

struct padded {
    std::atomic<int64_t> a;
    // force "b" onto a new cache line
    alignas(64) std::atomic<int64_t> b; 
};

In the packed struct, both members fit in a single cache line and will therefore exhibit false sharing in a multithreaded scenario.

Conclusion

That is more or less it. Note that this list is not applicable in every case (how on earth would you parallelize an interpreter) nor does it have every optimization worth doing (rewriting a subroutine in assembly for example).

It should cover most of it though as, with programming, all we're doing is shuffling data from end to the other with some transformation in between.

How do GC/scripting languages fit in?

Short answer - they don't.

Long answer - it depends. It depends on the language and its support for value types and the ability to control the data layout.

  • Java has Project Valhalla, which is an unreleased experimental project to add value types to Java.

  • Go is solid all-around, but writing vectorized code often (ideally) requires writing go assembly.

  • Python defers any perf-related code to C/C++.

  • Lua similarly should not be used to implement perf-sensitive parts of a program, which its embeddable nature supports quite naturally.

  • C# has an awesome situation in here with its support for value types (ref structs), slices (spans), stack allocation, SIMD intrinsics (including AVX512!). You can even go bare-metal and GC-free with bflat.

Despite the outliers, if you truly have non-IO-bottlenecked, data-intensive code that needs to run as fast as possible, systems languages like C/C++/Rust are your best bet.

  • hyperfine - benchmarking
  • perf - profiling (ETW for windows)
  • VTune - in-depth profiling
  • DTrace - profiling, debugging, tracing
  • Cachegrind - cache tracing
  • Perfetto - tracing visualization
  • godbolt.org - a fantastic website where you can see in-detail how you code gets compiled down to assembly (plus analysis and various tools)

The now legendary intro to data-oriented design: CppCon 2014: Mike Acton "Data-Oriented Design and C++"

Much more about data-oriented design: www.dataorienteddesign.com/dodbook

An in-depth overview of performance analysis and optimization: Performance Analysis and Tuning on Modern CPUs by Denis Bakhvalov

联系我们 contact @ memedata.com