（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40870345

本文强调，与大多数其他软件相比，可以以最小的努力实现显着的性能增强，特别是在数学矩阵乘法 (matmul) 库中。常见的改进包括： 1. 选择有效的算法：在实施复杂的优化技术之前消除冗余计算。 2. 最小化内核间通信：用自定义分配器替换大量“malloc”调用，以减少内核之间的往返和类似的繁重操作。 3.向量化：通过显式向量指令或操作数据结构来有效地组织数据。 4. 缓存效率：优化数据布局和访问模式以最大化缓存命中率。 5.特定于硬件的优化：利用编译器内在函数并编写针对特定体系结构定制的低级汇编代码。作者认为，从设计不良的查询中消除不必要的工作会产生显着的性能提升，并且是整体改进的主要方法。提供了一个数据库查询的示例，该数据库查询由于不需要的数据检索而产生过多的工作，同时未触及有用的数据。尽管 OpenBLAS 没有经过大量优化，但这些简单的调整带来了显着的性能增强。提到的其他潜在问题包括 Numpy 的开销、次优基准测试和低效的掩码生成方法。

If the point of this article is that there's generally performance left on the table, if anything it's understating how much room there generally is for improvement considering how much effort goes into matmul libraries compared to most other software.

Getting a 10-1000x or more improvement on existing code is very common without putting in a ton of effort if the code was not already heavily optimized. These are listed roughly in order of importance, but performance is often such a non-consideration from most developers that a little effort goes a long way.

1. Most importantly, is the algorithm a good choice? Can we eliminate some work entirely? (this is what algo interviews are testing for)

2. Can we eliminate round trips to the kernel and similar heavy operations? The most common huge gain here is replacing tons of malloc calls with a custom allocator.

3. Can we vectorize? Explicit vector intrinsics like in the blog post are great, but you can often get the same machine code by reorganizing your data into arrays / struct of arrays rather than arrays of structs.

4. Can we optimize for cache efficiency? If you already reorganized for vectors this might already be handled, but this can get more complicated with parallel code if you can't isolate data to one thread (false sharing, etc.)

5. Can we do anything else that's hardware specific? This can be anything from using intrinsics to hand-coding assembly.

Don't forget the impact of network. I managed to get a several hundred times performance improvement on one occasion because I found a distributed query that was pulling back roughly 1M rows over the network and then doing a join that dropped all but 5 - 10 of them. I restructured the query so the join occurred on the remote server and only 5 - 10 rows were sent over the network and, boom, suddenly it's fast.

There's always going to be some fixed overhead and latency (and there's a great article about the impact of latency on performance called "It's the latency, stupid" that's worth a read: http://www.stuartcheshire.org/rants/latency.html) but sending far more data than is needed over a network connection will sooner or later kill performance.

Overall though, I agree with your considerations, and in roughish terms the order of them.

I meant this kind of thing to fall under #1. Don't do work that can be avoided includes pulling 1M rows * a bunch of columns you don't need over the network.

From your description though, it doesn't sound like something I'd classify as a network issue. That's just classic orm nonsense. I guess I don't know what you mean by "distributed query", but it sounds terrible.

The most classic network performance issue is forgetting to disable nagle's algorithm.

The most classic sql performance issue is not using an index.

A distributed query is something you execute over multiple instances of your DBMS. I actually would disagree with you that this specific issue was an instance of (1). There wasn't anything wrong with the query per se but rather the issue was with where the bulk of the work in the query was being done: move that work to the right place and the query becomes fast.

When considering performance issues, in my experience it's a mistake not to explicitly consider the network.

> I actually would disagree with you that this specific issue was an instance of (1). There wasn't anything wrong with the query per se

I think OP's #1 agrees that there's nothing "technically" wrong with such a query (or an algo). It just generated work you didn't have to do. Work takes time. So you used time you didn't have to use. I also think this is the number 1 way of improving performance in general (computer, life).

A perfectly valid query of fetching 1M rows turned into 99.xxx% unnecessary work when you only needed a handful of rows. The query wasn't slow, it was just generating more work than you actually needed. The network also wasn't slow, it simply had to transfer (even at peak theoretical efficiency) a lot of data you never used.

You then used an equally valid query that wasn't even necessarily fast, it just generated much less work. This query (quote from #1) "eliminate[d] some work entirely", the work of carrying over unnecessary data.

The papers referenced in the BLIS repo are the authoritative reference to understand this stuff. I've no idea why people think optimized BLASes aren't performant; you should expect >90% of CPU peak for sufficiently large matrices, and serial OpenBLAS was generally on a par with MKL last I saw. (BLAS implements GEMM, not matmul, as the basic linear algebra building block.) I don't understand using numpy rather than the usual benchmark frameworks, and on Zen surely you should compare with AMD's BLAS (based on BLIS). BLIS had a better parallelization story than OpenBLAS last I knew. AMD BLIS also has the implementation switch for "small" dimensions, and I don't know if current OpenBLAS does.

SIMD intrinsics aren't needed for micro-kernel vectorization, as a decent C compiler will fully vectorize and unroll it. BLIS' pure C micro-kernel gets >80% of the performance of the hand-optimized implementation on Haswell with appropriate block sizes. The difference is likely to be due to prefecth, but I don't properly understand it.

Good writeup and commendable of you to make your benchmark so easily repeatable. On my 16-core Xeon(R) W-2245 CPU @ 3.90GHz matmul.c takes about 1.41 seconds to multiply 8192x8192 matrices when compiled with gcc -O3 and 1.47 seconds when compiled with clang -O2, while NumPy does it in 1.07 seconds. I believe an avx512 kernel would be significantly faster. Another reason for the lackluster performance may be omp. IME, you can reduce overhead by managing the thread pool explicitly with pthreads (and use sysconf(_SC_NPROCESSORS_ONLN) instead of hard-coding).

This looks like a nice write-up and implementation. I'm left wondering what is the "trick"? How does it manage to beat OpenBLAS, which is assembly+C optimized over decades for this exact problem? It goes into detail about caching, etc - is BLAS is not taking advantage of these things, or is this more tuned to this specific processor, etc?

- OpenBLAS isn't _that_ optimized for any specific modern architecture.

- The matrices weren't that big. Numpy has cffi overhead.

- The perf difference was much more noticeable with _peak_ throughput rather than _mean_ throughput, which matters for almost no applications (a few, admittedly, but even where "peak" is close to the right measure you usually want something like the mean of the top-k results or the proportion with under some latency, ...).

- The benchmarking code they displayed runs through Python's allocator for numpy and is suggestive of not going through any allocator for the C implementation. Everything might be fine, but that'd be the first place I checked for microbenchmarking errors or discrepancies (most numpy routines allow in-place operations; given that that's known to be a bottleneck in some applications of numpy, I'd be tempted to explicitly examine benchmarks for in-place versions of both).

- Numpy has some bounds checking and error handling code which runs regardless of the underlying implementation. That's part of why it's so bleedingly slow for small matrices compared to even vanilla Python lists (they tested bigger matrices too, so this isn't the only effect, but I'll mention it anyway). It's hard to make something faster when you add a few thousand cycles of pure overhead.

- This was a very principled approach to saturating the relevant caches. It's "obvious" in some sense, but clear engineering improvements are worth highlighting in discussions like this, in the sense that OpenBLAS, even with many man-hours, likely hasn't thought of everything.

And so on. Anyone can rattle off differences. A proper explanation requires an actual deep-dive into both chunks of code.

np.matmul just uses whatever blas library your NumPy distribution was configured for/shipped with.

Could be MKL (i believe the conda version comes with it) but it could also be an ancient version of OpenBLAS you already had installed. So yeah, being faster than np.matmul probably just means your NumPy is not installed optimally.

There is no reason to burden one side with Python while the other side is C, when they could have just as easily perform an apples-to-apples comparison where both sides are written in C, one calling a BLAS library while the other calls this other implementation.

Python is the right thing to compare to here, because it is easily the most popular way to perform these computations in the modern day. Specifically using numpy. The overhead isn't that high, but as mentioned elsewhere in this thread, calling it correctly is important. Pitting naive numpy code against tuned C code is definitely not a fair comparison.

> Python is the right thing to compare to here, because it is easily the most popular way to perform these computations in the modern day. Specifically using numpy.

By that reasoning, wouldn't it make more sense to wrap their C code and maybe even make it operate on numpy's array representation, so it can be called from Python?

Popular does not mean best.

Suppose that this blog post were part of a series that questions the axiom (largely bolstered by academic marketing) that one needs Python to do array computing. Then it is valid to compare C directly to NumPy.

It isn't even far fetched. The quality of understanding something after having implemented it in C is far greater than the understanding gained by rearranging PyTorch or NumPY snippets.

That said, the Python overhead should not be very high if M=1000, N=1000, K=1000 was used. The article is a bit silent on the array sizes, this is somewhere from the middle of the article.

Python is popular precisely because non-programmers are able to rearrange snippets and write rudimentary programs without a huge time investment in learning the language or tooling. It’s a very high level language with syntactic sugar that has a lot of data science libraries which can call C code for performance, which makes it great for data scientists.

It would be a huge detriment and time sink for these data scientist to take the time to learn to write an equivalent C program if their ultimate goal is to do data science.

I think it’s okay to say “This is the benchmark, now I’m going to compare it against something else.” It’s up to the reader to decide if a 3% (or 300%) improvement is worth the investment if it involves learning a whole other language.

It's a muddy comparison given that NumPy is commonly used with other BLAS implementations, which the author even lists, but doesn't properly address. Anaconda defaults to Intel oneAPI MKL, for example, and that's a widely used distribution. Not that I think MKL would do great on AMD hardware, BLIS is probably a better alternative.

The author also says "(...) implementation follows the BLIS design", but then proceeds to compare *only* with OpenBLAS. I'd love to see a more thorough analysis, and using C directly would make it easier to compare multiple BLAS libs.

If that was the goal, they should have compared NumPy to BLAS. What they did was comparing OpenBLAS wrapped in NumPy with their C code. It is not a reasonable comparison to make.

Look, I'm trying to be charitable to the authors, hard as that might be.

There is some reason in this comparison. You might want to answer the question: "if I pick the common approach to matrix multiplication in the world of Data Science (numpy), how far off is the performance from some potential ideal reference implementation?"

I actually do have that question niggling in the back of my mind when I use something like NumPy. I don't necessarily care exactly _where_ the overhead comes from, I might just be interested whether it's close to ideal or not.

If that was your question, you would compare against a number of BLAS libraries, which are already well optimized.

What they are doing here is patting themselves on the back after handicapping the competition. Not to mention that they have given themselves the chance to cherry pick the very best hyperparameters for this particular comparison while BLAS is limited to using heuristics to guess which of their kernels will suit this particular combination of hardware and parameters.

The authors need to be called out for this contrived comparison.

Though not at all part of the hot path, the inefficiency of the mask generation ('bit_mask' usage) nags me. Some more efficient methods include creating a global constant array of {-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1, 0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0} and loading from it at element offsets 16-m and 8-m, or comparing constant vector {0,1,2,3,4,...} with broadcasted m and m-8.

Very moot nitpick though, given that this is for only one column of the matrix, the following loops of maskload/maskstore will take significantly more time (esp. store, which is still slow on Zen 4[1] despite the AVX-512 instruction (whose only difference is taking the mask in a mask register) being 6x faster), and clang autovectorizes the shifting anyways (maybe like 2-3x slower than my suggestions).

[1]: https://uops.info/table.html?search=vmaskmovps&cb_lat=on&cb_...

Hi! I'm the author of the article. It's my really first time optimizing C code and using intrinsics, so I'm definitely not an expert in this area, but Im willing to learn more! Many thanks for your feedback; I truly appreciate comments that provide new perspectives.

Regarding "creating a constant global array and loading from it" - if I recall correctly, I've tested this approach and it was a bit slower than bit mask shifting. But let me re-test this to be 100% sure.

"Comparing a constant vector {0, 1, 2, 3, 4, ...} with broadcasted m and m-8" - good idea, I will try it!

> creating a global constant array

Note you can keep int8_t elements in that array, and sign extend bytes into int32_t while loading. The _mm_loadu_si64 / _mm256_cvtepi8_epi32 combo should compile into a single vpmovsxbd instruction with a memory operand. This way the entire constant array fits in a single cache line, as long as it’s aligned properly with alignas(32)

This is good fit for the OP’s use case because they need two masks, the second vpmovsxbd instruction will be a guaranteed L1D cache hit.

vpmovsxbd ymm,[…] still presumably decomposes back into two uops (definitely does on intel, but uops.info doesn't show memory uops for AMD); still better than broadcast+compare though (which does still have a load for the constant; and, for that matter, the original shifting version also has multiple loads). Additionally, the int8_t elements mean no cacheline-crossing loads. (there's the more compressed option of only having a {8×-1, 8×0} array, at the cost of more scalar offset computation)

What is the point of making the matrix multiplication itself multithreaded (other than benchmarking)? Wouldn't it be more beneficial in practice to have the multithreadedness in the algorithm that use the multiplication?

That's indeed what's typically done in HPC. However, substituting a parallel BLAS can help the right sort of R code simply, for instance, but HPC codes typically aren't bottlenacked on GEMM.

I've only skimmed so far, but this post has a lot of detail and explanation. Looks like a pretty great description of how fast matrix multiplications are implemented to take into account architectural concerns. Goes on my reading list!

I don’t save articles often, but maybe one time in a few months I see something I know I will enjoy reading again even after 1 or 2 years. Keep up the great work OP!

In the README, it says:

> Important! Please don’t expect peak performance without fine-tuning the hyperparameters, such as the number of threads, kernel and block sizes, unless you are running it on a Ryzen 7700(X). More on this in the tutorial.

I think I'll need a TL;DR on what to change all these values to.

I have a Ryzen 7950X and as a first test I tried to only change NTHREADS to 32 in benchmark.c, but matmul.c performed worse than NumPy on my machine.

So I took a look at the other values present in the benchmark.c, but MC and NC are already calculated via the amount of threads (so these are probably already 'fine-tuned'?), and I couldn't really understand how KC = 1000 fits for the 7700(X) (the author's CPU) and how I'd need to adjust it for the 7950X (with the informations from the article).

Very nice write up. Those are the kind of matrix sizes that MKL is fairly good at, might be worth a comparison as well?

Also, if you were designing for smaller cases, say MNK=16 or 32, how would you approach it differently? I'm implementing neural ODEs and this is one point I've been considering.

For very small sizes on amd64, you can, and likely should, use libxsmm. MKL's improved performance in that region was due to libxsmm originally showing it up, but this isn't an Intel CPU anyway.

This was a great read, thanks a lot! One a side note, any one has a good guess what tool/software they used to create the visualisations for matrix multiplications or memory outline?

Does it make sense to compare a C executable with an interpreted Python program that calls a compiled library? Is the difference due to the algorithm or the call stack?

Python can efficiently call C libs if you use ctypes and native pointers, which numpy uses. Of course depends on expected layout.

If you want to convert to Python lists its is going to take time. Not sure about Python arrays.

> #define min(x, y) ((x)

cursed code. I checked and every single invocation is just `int`, so why do this? you can just write a function:

    func min(x, y int) int {
       if x < y { return x }
       return y
    }

and keep type safety

> This is my first time writing a blog post. If you enjoy it, please subscribe and share it!

Great job! Self publishing things like this were a hallmark of the early internet I for one sorely miss.

The article claims this is portable C. Given the use of intel intrinsics, what happens if you try to compile it for ARM64?

（评论） (comments)

（评论）
(comments)