For basically any high-performance computation, memory layout and access pattern are critical. Common wisdom is that linear, contiguous memory performs best and should almost always be preferred. However, it should be intuitively clear that this has diminishing returns: processing a single 32 GB block vs processing two 16 GB blocks will not meaningfully differ in performance. Working with smaller blocks enables some interesting data structures, so I've set out to experimentally determine what block size is needed to effectively capture the full performance.
Findings
Setup and detailed analysis below, but my personal takeaway is:
- 1 MB blocks are enough for basically any workload of this kind
- 128 kB blocks suffice once you have at least ~1 cycle per processed byte
- 4 kB blocks are already enough once you're above roughly ~10 cycles per processed byte
(for raw data processing, not necessarily if there are other per-block costs)
This is the full results chart for my Ryzen 9 7950X3D, effectively showing the block sizes needed for peak performance across different workloads. The rest of this post will go over the setup and discuss a few isolated graphs.
Code and results are available here: github.com/solidean/bench-linear-access
Setup
The memory hierarchy of modern CPUs is famously complex, so I've tried to create an experimental setup where we can isolate and control the effects well enough to make our results generalize.
Our main question is: when we have to process a data set in blocks of linear memory, how large do the individual blocks have to be so that any penalty in jumping between blocks is amortized?
For that, our "processing kernel" takes a span<span<float const> const> (or float**) and returns an opaque uint64_t result (to make sure the optimizer cannot elide our computation).
This is effectively the non-owning version of "vector of blocks".
For example, the scalar_stats kernel is
// Scalar stats kernel: computes running stats over all blocks.
// Returns a mini hash (no quality requirements) to prevent elision.
uint64_t kernel_scalar_stats(std::span<std::span<float const> const> data)
{
struct stats
{
float m0 = 0, m1 = 0, m2 = 0;
float min = +std::numeric_limits<float>::max();
float max = -std::numeric_limits<float>::max();
};
stats s;
for (auto block : data)
for (auto d : block)
{
s.m0 += 1;
s.m1 += d;
s.m2 += d * d;
if (d < s.min)
s.min = d;
if (d > s.max)
s.max = d;
}
auto b = [](float f) { return std::bit_cast<uint32_t>(f); };
return b(s.m0) ^ b(s.m1) ^ b(s.m2) ^ b(s.min) ^ b(s.max);
}
This represents a lightweight but scalar computation, reaching ~7 GB/s on my system (for large blocks). So roughly 3 cycles per float, or 0.75 cycles per byte.
The return value is simply some value dependent on all computation.
We do a similar aggregation over all experiments and write the resulting "hash" into the .csv.
That constraints the optimizer enough without going through volatile.
A single run is a single call of the kernel function on a set of blocks. Each block is contiguous and has a fixed "block size". The total size of all blocks that we execute on is the "working set". We can partition the same working set using different block sizes, which is the degree of freedom we have in practice.
Benchmarking on a modern system is notoriously noisy, so we need to repeat the same run multiple times. We chose to report medians to guard against outliers on either end. However, if the working set fits into a cache, we would measure warm-ish timings while our real use case might work on cold data.
To counteract this, we place the blocks in random order and random positions inside a sufficiently large "backing memory" (4 GB here).
Or, to put it graphically:
We still clobber the cache before each run:
// folds a 256 MB block of memory into the seed and returns a dependent result
uint64_t clobber_cache(uint64_t seed)
{
static std::vector<uint64_t> clobber_data;
if (clobber_data.empty())
{
clobber_data.resize((256uLL << 20) / sizeof(uint64_t));
std::default_random_engine rng;
for (auto& d : clobber_data)
d = rng();
}
// slightly dependency heavy but very hard to optimize away
for (auto d : clobber_data)
seed = seed * d + 7;
return seed;
}
(Not the highest-perf way to clobber, but a sufficiently obvious dependency chain that compilers don't really have a choice but read the whole 256 MB between the input seed and the result value)
Clobbering is not part of the timing. Its purpose is simply to ensure no useful data remains in the cache hierarchy before each run.
The scalar_stats kernel across block sizes
The first experiment runs the above scalar_stats kernel across varying block sizes (32 byte to 2 MB) and working sets (1 MB to 64 MB).
We plot throughput in MB/s by block size, each working set is a new line.
While there is some noise in the measurements, this graph shows a few things:
- Our layout randomization and clobbering is effective. There is basically no effect of the cache hierarchy here and the results are effectively independent of working set size.
- Beyond 128 kB there is little benefit to increasing the block size.
- We converge to slightly beyond 7 GB/s here, which is ~3 cycles per float (or ~0.75 cycles per byte).
Each individual run starts with a randomized layout and a cold cache, so the curve shape cleanly captures the contiguity vs. jumping around in memory tradeoff.
This also is a certain kind of worst case behavior. So any less antagonistic scenario (more predictable layout, partially prewarmed cache, etc.) will reach peak performance earlier.
The repeated variant of the previous experiment
We get one of these scenarios by relaxing the "coldness" a bit.
In the original setup, we execute each run K times (e.g. K = 11) and report the median, clobbering and randomizing the block layout before each individual run.
The "repeated" (instead of "randomized") scenario only clobbers and chooses a random layout at the start of each set of K runs (with the same block and working set size).
This models a scenario where we hit the data repeatedly and any data or TLB cache that fits can be reused.
The median will thus report a pre-warmed state:
Unsurprisingly, smaller working sets can be (partially) kept in the cache hierarchy, which leads to earlier peaking. Thus the graph shapes now depend on the working set size. Smaller sizes fit into faster caches and reach peak performance earlier. The conservative 128 kB works for all sizes but you might get away with 4..16 kB blocks if your total data size is small (and nothing else competes for the caches).
Comparing different workloads
Up until now we only have hard values for the scalar_stats kernel, which is neither the fastest nor the slowest kernel.
This begs the question how the curve looks like for the most extreme SIMD kernel.
And what would be a guideline for heavier processing?
~3 cycles per float is extremely fast for a lot of our geometry processing for example.
To represent the high-end SIMD processing, we have simd_sum, an AVX2 (or NEON) sum over all floats:
uint64_t kernel_simd_sum(std::span<std::span<float const> const> data)
{
__m256 acc = _mm256_setzero_ps();
for (auto block : data)
{
auto const* ptr = block.data();
auto const count = block.size();
for (size_t i = 0; i < count; i += 8)
acc = _mm256_add_ps(acc, _mm256_load_ps(ptr + i));
}
// reduce: bitcast to uint32 array and XOR
auto words = std::bit_cast<std::array<uint32_t, 8>>(acc);
uint32_t h = 0;
for (auto w : words)
h ^= w;
return uint64_t(h);
}
The experimental setup ensures all blocks are multiple of 8 floats (aka 32 B) and aligned to 32 B.
The NEON implementation does two vaddq_f32 per iteration.
This reaches 50+ GB/s on my Ryzen (55 GB/s on a Macbook Air M4), which is very good for a singlethreaded program.
That's roughly 0.1 cycles/byte.
On the other side of the spectrum we have heavy_sin, a repeated v = sin(v + data[i]) loop:
uint64_t kernel_heavy_sin(std::span<std::span<float const> const> data)
{
float v = 0.f;
for (auto block : data)
for (auto d : block)
v = std::sin(v + d);
return uint64_t(std::bit_cast<uint32_t>(v));
}
This reaches ~450 MB/s on my system, which is ~40 cycles per float or ~10 cycles per byte. (Amazing that we reached close-to-correctly-rounded sin/cos on the order of 1/5th of the latency of a single uncached main-memory load)
Here is what all three kernels look side-by-side:
Qualitatively, the curves are similar but with different thresholds.
The simd_sum needs at least 1 MB for peak performance, scalar_stats only 128 kB, and heavy_sin is already great for 4 kB blocks.
It's a bit hard to read due to the logarithmic y-axis but we'll see a better plot next section.
Even though I called heavy_sin a "heavy" kernel, depending on your domain, 40 cycles per float can be quite cheap.
That simply means those workloads need to worry even less about large block sizes.
Putting it all together
We'll now combine all experiments (all 3 kernels, randomized and repeated, different working sets) in a single graph. To make it easier to compare visually, we group the experiments by (kernel, randomized/repeated, working set) and normalize throughput (1.0 means largest achieved throughput in that group). Randomized is solid line, repeated is dashed. Kernels are color coded and the working set is coded light-to-dark.
These are the results on my Ryzen 9 7950X3D (this is the plot from the beginning again):
On the Macbook Air M4 of a colleague:
The Macbook has an overall similar shape (with more jitter because we couldn't control power modes).
simd_sum required 1 MB blocks for both but scalar_stats peaked earlier on the mac.
Interestingly, while the heavy_sin still has a visible drop before 1 kB or so, even at the lowest measured size of 32 B, it had 90% of its peak performance already.
Conclusion
I originally wanted to investigate ideal block sizes for some internals of our mesh booleans. While the user-provided inputs are contiguous, we have to omit, emit, invertedly-emit, and emit-cut-up geometry a lot. All these different sources are unpredictable and are later merged and fed to post-processing (e.g. for restoring topology). This makes "streams of chunks" the only sensible data structure.
In the end, I didn't really get a universal answer (that would have been quite surprising considering how "hedged" everything is in performance engineering). But, the takeaways from the start are quite useful already.
The hyper-efficient SIMD sum is a good upper bound. That means more than 1 MB block sizes are probably overkill in most applications (of per-block linear computation). Most of the processing in our case takes more than 1 cycle per byte, so the 128 kB size is already plenty for me.
To put this all into perspective a bit:
A "workload" for these experiments means going linearly over the data in each block, not skipping any bytes.
You'll get different characteristics if you're striding over each block or if you need to pull in a lot of other data during processing.
We also only measured processing overhead and each block was provided as a span<float const>.
If you need to allocate blocks on demand or have other fixed per-block costs, this will shift the curves.
Finally, there are benefits to having the whole data set in a single contiguous block.
Because now every API can accept span<T const> and subsetting as well as random access is trivial.
But once you're forcing into chunks/blocks, the required block size for peak performance is quite a bit less than I expected.
Contributing
That being said, I haven't tested this on many systems yet.
The code (and my raw results) are hosted at github.com/solidean/bench-linear-access.
Feel free to run the benchmark on your system (uv run create-charts.py can create the graphs for your results).
It would be nice if we get a bit more coverage and see if the guideline generalizes to as many relevant modern systems as possible.
Regarding different experiments, I think it would be interesting to see:
- Does this stay the same when processing from multiple threads?
- What's the impact of striding?
- What's the impact of pulling from multiple blocks at the same time (think vertex attributes stored in different buffers)?
- What about combined read-and-write workloads?
Also, right now the measurement is relatively noisy. Is that inherent or can we make this more stable (without just running everything 100x as often)? Right now the whole benchmark takes 1-3 minutes.