线性内存访问量：多少才够？

线性内存访问量：多少才够？
How Much Linear Memory Access Is Enough?

原始链接: https://solidean.com/blog/2026/how-much-linear-memory-access-is-enough/

## 高性能内存访问的最佳块大小本研究通过实验确定在处理连续内存块中的数据时，理想的块大小，旨在平衡性能和数据结构灵活性。核心问题是：块必须有多大才能避免因在它们之间跳转而产生的性能损失？在Ryzen 9 7950X3D（并在M4 Macbook Air上验证）上进行的测试表明，**对于大多数工作负载，通常1MB的块就足够了**，特别是那些使用高效SIMD指令的工作负载。**当处理需要至少每字节一个周期时，128kB的块就足够了**，甚至**对于需要每字节10+周期的工作负载，4kB的块也是足够的**。实验包括使用不同块大小和工作负载处理数据——从简单的标量统计到优化的SIMD求和和复杂的计算——同时仔细控制缓存效应。结果表明，随着块大小的增加，性能很快会达到平稳期，最佳大小在很大程度上取决于工作负载的计算强度。作者得出结论，为了达到峰值性能，不一定总是需要将数据强制放入大块中，较小的块大小为灵活的数据结构提供了好处。代码和原始数据已在[github.com/solidean/bench-linear-access](https://github.com/solidean/bench-linear-access)上公开提供，以供进一步研究和在不同系统上进行基准测试。未来的研究领域包括多线程性能、步长访问以及组合读/写工作负载。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录线性内存访问需要多少？ (solidean.com) 7 分，PhilipTrettner 1小时前 | 隐藏 | 过去 | 收藏 | 1 条评论帮助 PhilipTrettner 1小时前 [–] 我研究这个问题是因为我们的流水线的一部分必须被分块。大多数建议归结为“更多的连续性=更好”，但没有数字，或者至少没有可推广的数字。我的具体任务在 128 kB 之前就会达到峰值性能，而且我找不到纯处理工作负载能从超过 1 MB 的块大小中显著受益。代码链接在文章中，希望能在更多系统上看到结果。回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

原文

For basically any high-performance computation, memory layout and access pattern are critical. Common wisdom is that linear, contiguous memory performs best and should almost always be preferred. However, it should be intuitively clear that this has diminishing returns: processing a single 32 GB block vs processing two 16 GB blocks will not meaningfully differ in performance. Working with smaller blocks enables some interesting data structures, so I've set out to experimentally determine what block size is needed to effectively capture the full performance.

Findings

Setup and detailed analysis below, but my personal takeaway is:

1 MB blocks are enough for basically any workload of this kind
128 kB blocks suffice once you have at least ~1 cycle per processed byte
4 kB blocks are already enough once you're above roughly ~10 cycles per processed byte

(for raw data processing, not necessarily if there are other per-block costs)

normalized throughput vs block size chart

This is the full results chart for my Ryzen 9 7950X3D, effectively showing the block sizes needed for peak performance across different workloads. The rest of this post will go over the setup and discuss a few isolated graphs.

Code and results are available here: github.com/solidean/bench-linear-access

Setup

The memory hierarchy of modern CPUs is famously complex, so I've tried to create an experimental setup where we can isolate and control the effects well enough to make our results generalize.

Our main question is: when we have to process a data set in blocks of linear memory, how large do the individual blocks have to be so that any penalty in jumping between blocks is amortized?

For that, our "processing kernel" takes a span<span<float const> const> (or float**) and returns an opaque uint64_t result (to make sure the optimizer cannot elide our computation). This is effectively the non-owning version of "vector of blocks".

For example, the scalar_stats kernel is

// Scalar stats kernel: computes running stats over all blocks.
// Returns a mini hash (no quality requirements) to prevent elision.
uint64_t kernel_scalar_stats(std::span<std::span<float const> const> data)
{
    struct stats
    {
        float m0 = 0, m1 = 0, m2 = 0;
        float min = +std::numeric_limits<float>::max();
        float max = -std::numeric_limits<float>::max();
    };

    stats s;
    for (auto block : data)
        for (auto d : block)
        {
            s.m0 += 1;
            s.m1 += d;
            s.m2 += d * d;
            if (d < s.min)
                s.min = d;
            if (d > s.max)
                s.max = d;
        }

    auto b = [](float f) { return std::bit_cast<uint32_t>(f); };
    return b(s.m0) ^ b(s.m1) ^ b(s.m2) ^ b(s.min) ^ b(s.max);
}

This represents a lightweight but scalar computation, reaching ~7 GB/s on my system (for large blocks). So roughly 3 cycles per float, or 0.75 cycles per byte.

The return value is simply some value dependent on all computation. We do a similar aggregation over all experiments and write the resulting "hash" into the .csv. That constraints the optimizer enough without going through volatile.

A single run is a single call of the kernel function on a set of blocks. Each block is contiguous and has a fixed "block size". The total size of all blocks that we execute on is the "working set". We can partition the same working set using different block sizes, which is the degree of freedom we have in practice.

Benchmarking on a modern system is notoriously noisy, so we need to repeat the same run multiple times. We chose to report medians to guard against outliers on either end. However, if the working set fits into a cache, we would measure warm-ish timings while our real use case might work on cold data.

To counteract this, we place the blocks in random order and random positions inside a sufficiently large "backing memory" (4 GB here).

Or, to put it graphically:

experiment setup

We still clobber the cache before each run:

// folds a 256 MB block of memory into the seed and returns a dependent result
uint64_t clobber_cache(uint64_t seed)
{
    static std::vector<uint64_t> clobber_data;

    if (clobber_data.empty())
    {
        clobber_data.resize((256uLL << 20) / sizeof(uint64_t));

        std::default_random_engine rng;
        for (auto& d : clobber_data)
            d = rng();
    }

    // slightly dependency heavy but very hard to optimize away
    for (auto d : clobber_data)
        seed = seed * d + 7;

    return seed;
}

(Not the highest-perf way to clobber, but a sufficiently obvious dependency chain that compilers don't really have a choice but read the whole 256 MB between the input seed and the result value)

Clobbering is not part of the timing. Its purpose is simply to ensure no useful data remains in the cache hierarchy before each run.

The `scalar_stats` kernel across block sizes

The first experiment runs the above scalar_stats kernel across varying block sizes (32 byte to 2 MB) and working sets (1 MB to 64 MB). We plot throughput in MB/s by block size, each working set is a new line.

scalar_stats kernel, randomized

While there is some noise in the measurements, this graph shows a few things:

Our layout randomization and clobbering is effective. There is basically no effect of the cache hierarchy here and the results are effectively independent of working set size.
Beyond 128 kB there is little benefit to increasing the block size.
We converge to slightly beyond 7 GB/s here, which is ~3 cycles per float (or ~0.75 cycles per byte).

Each individual run starts with a randomized layout and a cold cache, so the curve shape cleanly captures the contiguity vs. jumping around in memory tradeoff.

This also is a certain kind of worst case behavior. So any less antagonistic scenario (more predictable layout, partially prewarmed cache, etc.) will reach peak performance earlier.

The `repeated` variant of the previous experiment

We get one of these scenarios by relaxing the "coldness" a bit. In the original setup, we execute each run K times (e.g. K = 11) and report the median, clobbering and randomizing the block layout before each individual run. The "repeated" (instead of "randomized") scenario only clobbers and chooses a random layout at the start of each set of K runs (with the same block and working set size). This models a scenario where we hit the data repeatedly and any data or TLB cache that fits can be reused. The median will thus report a pre-warmed state:

scalar_stats kernel, repeated

Unsurprisingly, smaller working sets can be (partially) kept in the cache hierarchy, which leads to earlier peaking. Thus the graph shapes now depend on the working set size. Smaller sizes fit into faster caches and reach peak performance earlier. The conservative 128 kB works for all sizes but you might get away with 4..16 kB blocks if your total data size is small (and nothing else competes for the caches).

Comparing different workloads

Up until now we only have hard values for the scalar_stats kernel, which is neither the fastest nor the slowest kernel. This begs the question how the curve looks like for the most extreme SIMD kernel. And what would be a guideline for heavier processing? ~3 cycles per float is extremely fast for a lot of our geometry processing for example.

To represent the high-end SIMD processing, we have simd_sum, an AVX2 (or NEON) sum over all floats:

uint64_t kernel_simd_sum(std::span<std::span<float const> const> data)
{
    __m256 acc = _mm256_setzero_ps();
    for (auto block : data)
    {
        auto const* ptr = block.data();
        auto const count = block.size();
        for (size_t i = 0; i < count; i += 8)
            acc = _mm256_add_ps(acc, _mm256_load_ps(ptr + i));
    }

    // reduce: bitcast to uint32 array and XOR
    auto words = std::bit_cast<std::array<uint32_t, 8>>(acc);
    uint32_t h = 0;
    for (auto w : words)
        h ^= w;
    return uint64_t(h);
}

The experimental setup ensures all blocks are multiple of 8 floats (aka 32 B) and aligned to 32 B. The NEON implementation does two vaddq_f32 per iteration. This reaches 50+ GB/s on my Ryzen (55 GB/s on a Macbook Air M4), which is very good for a singlethreaded program. That's roughly 0.1 cycles/byte.

On the other side of the spectrum we have heavy_sin, a repeated v = sin(v + data[i]) loop:

uint64_t kernel_heavy_sin(std::span<std::span<float const> const> data)
{
    float v = 0.f;
    for (auto block : data)
        for (auto d : block)
            v = std::sin(v + d);
    return uint64_t(std::bit_cast<uint32_t>(v));
}

This reaches ~450 MB/s on my system, which is ~40 cycles per float or ~10 cycles per byte. (Amazing that we reached close-to-correctly-rounded sin/cos on the order of 1/5th of the latency of a single uncached main-memory load)

Here is what all three kernels look side-by-side:

different kernels, randomized, 64 MB working set

Qualitatively, the curves are similar but with different thresholds. The simd_sum needs at least 1 MB for peak performance, scalar_stats only 128 kB, and heavy_sin is already great for 4 kB blocks. It's a bit hard to read due to the logarithmic y-axis but we'll see a better plot next section.

Even though I called heavy_sin a "heavy" kernel, depending on your domain, 40 cycles per float can be quite cheap. That simply means those workloads need to worry even less about large block sizes.

Putting it all together

We'll now combine all experiments (all 3 kernels, randomized and repeated, different working sets) in a single graph. To make it easier to compare visually, we group the experiments by (kernel, randomized/repeated, working set) and normalize throughput (1.0 means largest achieved throughput in that group). Randomized is solid line, repeated is dashed. Kernels are color coded and the working set is coded light-to-dark.

These are the results on my Ryzen 9 7950X3D (this is the plot from the beginning again):

normalized throughput vs block size chart (Ryzen 9 7950X3D)

On the Macbook Air M4 of a colleague:

normalized throughput vs block size chart (Macbook Air M4)

The Macbook has an overall similar shape (with more jitter because we couldn't control power modes). simd_sum required 1 MB blocks for both but scalar_stats peaked earlier on the mac. Interestingly, while the heavy_sin still has a visible drop before 1 kB or so, even at the lowest measured size of 32 B, it had 90% of its peak performance already.

Conclusion

I originally wanted to investigate ideal block sizes for some internals of our mesh booleans. While the user-provided inputs are contiguous, we have to omit, emit, invertedly-emit, and emit-cut-up geometry a lot. All these different sources are unpredictable and are later merged and fed to post-processing (e.g. for restoring topology). This makes "streams of chunks" the only sensible data structure.

In the end, I didn't really get a universal answer (that would have been quite surprising considering how "hedged" everything is in performance engineering). But, the takeaways from the start are quite useful already.

The hyper-efficient SIMD sum is a good upper bound. That means more than 1 MB block sizes are probably overkill in most applications (of per-block linear computation). Most of the processing in our case takes more than 1 cycle per byte, so the 128 kB size is already plenty for me.

To put this all into perspective a bit: A "workload" for these experiments means going linearly over the data in each block, not skipping any bytes. You'll get different characteristics if you're striding over each block or if you need to pull in a lot of other data during processing. We also only measured processing overhead and each block was provided as a span<float const>. If you need to allocate blocks on demand or have other fixed per-block costs, this will shift the curves.

Finally, there are benefits to having the whole data set in a single contiguous block. Because now every API can accept span<T const> and subsetting as well as random access is trivial. But once you're forcing into chunks/blocks, the required block size for peak performance is quite a bit less than I expected.

Contributing

That being said, I haven't tested this on many systems yet.

The code (and my raw results) are hosted at github.com/solidean/bench-linear-access. Feel free to run the benchmark on your system (uv run create-charts.py can create the graphs for your results). It would be nice if we get a bit more coverage and see if the guideline generalizes to as many relevant modern systems as possible.

Regarding different experiments, I think it would be interesting to see:

Does this stay the same when processing from multiple threads?
What's the impact of striding?
What's the impact of pulling from multiple blocks at the same time (think vertex attributes stored in different buffers)?
What about combined read-and-write workloads?

Also, right now the measurement is relatively noisy. Is that inherent or can we make this more stable (without just running everything 100x as often)? Right now the whole benchmark takes 1-3 minutes.