内存级并行性：Apple M2与Apple M4

内存级并行性：Apple M2与Apple M4
Memory-Level Parallelism: Apple M2 vs. Apple M4

原始链接: https://lemire.me/blog/2025/07/09/memory-level-parallelism-apple-m2-vs-apple-m4/

该分析比较了Apple M2（2022）和M4（2024）处理器的单核随机访问记忆性能，这些处理器都是基于ARM的SOC，均利用统一的内存。基准涉及一个“指针追逐”方案，以许多指针模拟复杂的数据结构。这测试了处理器处理多个同时内存负载（内存级并行性）的能力。通过将追逐的指针划分为多个独立的“通道”，测试测量了每个处理器如何使用并行内存请求。多达28个通道用于最大程度地减少噪声。尽管与M2的LPDDR5相比，M4使用更快的LPDDR5X内存，但基准中观察到的性能差异很小，大约15％，有利于M4。两个处理器都证明了有效维持28个平行内存访问渠道的能力。通过假设每个内存访问加载一个缓存线（128个字节）来估计有效带宽。

这个黑客新闻线程讨论了lemire.me的博客文章，以比较苹果M2和M4芯片之间的内存级并行性（MLP）。评论者发现该帖子有趣但简洁，更像是形式良好的“沉思”，展示了供读者解释的数据。一位评论者认为，基准（在GitHub上可用）是有价值的，应包含在诸如Phoronix之类的更宽的基准套件中。另一位评论者强调了他们自己的平行减少基准，重点是CPU/GPU吞吐量，并提议将MLP风格的基准扩展到GPU，以更好地了解非煤粉，指针拨片的内存访问模式的成本。关于该图表的有限范围提出了一个问题，但另一位用户可以回答作者对噪声的解释，而不是经过测试的范围阻碍分析。

原文

The Apple M2, introduced in 2022, and the Apple M4, launched in 2024, are both ARM-based system-on-chip (SoC) designs featuring unified memory architecture. That is, they use the same memory for both graphics (GPU) and main computations (CPU). The M2 processor relies on LPDDR5 memory whereas the M4 relies on LPDDR5X which should provide slightly more bandwidth.

The exact bandwidth you get from an Apple system depends on your configuration. But I am interested in single-core random access performance. To measure this performance, I construct a large array of indexes. These indexes form a random loop: starting from any element, if you read its value, treat it as an index, move to this index and so forth, you will visit each and every element in the large array. This type of benchmark is often described as ‘pointer chasing’ since it simulates what happens when your software is filled with pointers to data structures which themselves are made of pointers, and so forth.

When loading any value from memory, there is a latency of many cycles. Thankfully, modern processors can sustain many such loads at the same time. How many depends on the processor but modern processors can sustain tens of memory requests at any given time. This phenomenon is part of what we call memory-level parallelism : the ability of the memory subsystem to sustain many tasks at once.

Thus we can split the pointer-chasing benchmark into channels. Instead of starting at just one place, you can start at two locations at once, one at the ‘beginning’ and the other at the midpoint. And so forth. I refer the number of such divisions as a ‘channel’. So it is one channel, two channels and so forth. Obviously, the more channels you have, the faster you can go. From how fast you can go, you can estimate the effective bandwidth by assuming that each hit in the array is equivalent to loading a cache line (128 bytes).

I run my benchmarks on two processors (Apple M2 and Apple M4). I have to limit the number of channels since beyond a certain point, there is too much noise. A maximum of 28 channels works well.

Maybe unsurprisingly, I find that the difference between the M4 and the M2 is not enormous (about 15%). Both processors can visibly sustain 28 channels.

内存级并行性：Apple M2与Apple M4 Memory-Level Parallelism: Apple M2 vs. Apple M4

内存级并行性：Apple M2与Apple M4
Memory-Level Parallelism: Apple M2 vs. Apple M4