![]() |
|
![]() |
| Don't forget the impact of network. I managed to get a several hundred times performance improvement on one occasion because I found a distributed query that was pulling back roughly 1M rows over the network and then doing a join that dropped all but 5 - 10 of them. I restructured the query so the join occurred on the remote server and only 5 - 10 rows were sent over the network and, boom, suddenly it's fast.
There's always going to be some fixed overhead and latency (and there's a great article about the impact of latency on performance called "It's the latency, stupid" that's worth a read: http://www.stuartcheshire.org/rants/latency.html) but sending far more data than is needed over a network connection will sooner or later kill performance. Overall though, I agree with your considerations, and in roughish terms the order of them. |
![]() |
| Though not at all part of the hot path, the inefficiency of the mask generation ('bit_mask' usage) nags me. Some more efficient methods include creating a global constant array of {-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1, 0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0} and loading from it at element offsets 16-m and 8-m, or comparing constant vector {0,1,2,3,4,...} with broadcasted m and m-8.
Very moot nitpick though, given that this is for only one column of the matrix, the following loops of maskload/maskstore will take significantly more time (esp. store, which is still slow on Zen 4[1] despite the AVX-512 instruction (whose only difference is taking the mask in a mask register) being 6x faster), and clang autovectorizes the shifting anyways (maybe like 2-3x slower than my suggestions). [1]: https://uops.info/table.html?search=vmaskmovps&cb_lat=on&cb_... |
![]() |
| That's indeed what's typically done in HPC. However, substituting a parallel BLAS can help the right sort of R code simply, for instance, but HPC codes typically aren't bottlenacked on GEMM. |
![]() |
| I don’t save articles often, but maybe one time in a few months I see something I know I will enjoy reading again even after 1 or 2 years. Keep up the great work OP! |
![]() |
| For very small sizes on amd64, you can, and likely should, use libxsmm. MKL's improved performance in that region was due to libxsmm originally showing it up, but this isn't an Intel CPU anyway. |
![]() |
| This was a great read, thanks a lot! One a side note, any one has a good guess what tool/software they used to create the visualisations for matrix multiplications or memory outline? |
![]() |
| Does it make sense to compare a C executable with an interpreted Python program that calls a compiled library? Is the difference due to the algorithm or the call stack? |
![]() |
| > This is my first time writing a blog post. If you enjoy it, please subscribe and share it!
Great job! Self publishing things like this were a hallmark of the early internet I for one sorely miss. |
![]() |
| The article claims this is portable C. Given the use of intel intrinsics, what happens if you try to compile it for ARM64? |
Getting a 10-1000x or more improvement on existing code is very common without putting in a ton of effort if the code was not already heavily optimized. These are listed roughly in order of importance, but performance is often such a non-consideration from most developers that a little effort goes a long way.
1. Most importantly, is the algorithm a good choice? Can we eliminate some work entirely? (this is what algo interviews are testing for)
2. Can we eliminate round trips to the kernel and similar heavy operations? The most common huge gain here is replacing tons of malloc calls with a custom allocator.
3. Can we vectorize? Explicit vector intrinsics like in the blog post are great, but you can often get the same machine code by reorganizing your data into arrays / struct of arrays rather than arrays of structs.
4. Can we optimize for cache efficiency? If you already reorganized for vectors this might already be handled, but this can get more complicated with parallel code if you can't isolate data to one thread (false sharing, etc.)
5. Can we do anything else that's hardware specific? This can be anything from using intrinsics to hand-coding assembly.