对 Arm CPU 上 TSO 内存模型的支持 (2024)

对 Arm CPU 上 TSO 内存模型的支持 (2024)
Support for the TSO memory model on Arm CPUs (2024)

## Arm CPU、x86 模拟和内存模型 LWN.net 报道了 Linux 内核开发社区内关于在 Arm CPU 上模拟 x86 内存模型（总存储顺序 - TSO）的争论。CPU 处理内存操作时具有不同程度的优化自由度，而 x86 的 TSO 简化了并发编程。Arm 的模型更灵活，但需要更谨慎的编程。一些 Arm CPU（NVIDIA、富士通、Apple）提供硬件级别的 TSO 模拟，这对于运行 x86 模拟器，特别是游戏，是有益的。一个补丁系列旨在通过新的 `prctl()` 操作将此功能暴露给用户空间，允许程序请求 TSO。然而，核心内核维护者，如 Will Deacon，强烈反对，担心这会分割用户空间代码。他们认为开发者会启用 TSO 作为一种变通方法，而没有完全理解其含义，从而导致在非 TSO Arm 系统上出现错误。支持者，如 Hector Martin，认为分割不太可能发生，并指出现有的硬件差异（如页面大小）。他们还指出该功能已经在 Asahi Linux 等下游发行版中被广泛使用。讨论中的替代方案包括将该功能保留在内核之外、无条件地在 Apple CPU 上启用 TSO（但会带来性能成本），或将 TSO 限制在虚拟机中。这场争论凸显了最大化硬件利用率与维护内核稳定性和可移植性之间的紧张关系。

## Arm CPU 与 TSO 内存模型 - Hacker News 摘要 Hacker News 上正在讨论 Arm CPU 最近对总存储顺序 (TSO) 内存模型的支持。核心争论在于，考虑到潜在的碎片化以及对苹果公司未文档化的实现的依赖，添加 TSO 是否值得。许多评论员认为最初的碎片化问题被夸大了，因为主要用例现在是支持 x86 模拟（如苹果芯片上的 Rosetta2）。然而，一个重要的担忧是苹果公司缺乏对其 TSO 实现的正式文档，这引发了对长期稳定性和兼容性的质疑。如果没有详细的规范，依赖此功能可能会导致不可预测的行为并破坏用户空间应用程序。讨论强调了更广泛的问题，即首字母缩略词 (TLAs) 以及计算中对清晰、标准化定义的需求。“TSO”并非通用标准，不同的 CPU 供应商可能会以不同的方式实现它。最终，共识倾向于谨慎，质疑建立在未文档化的硬件功能之上的做法是否明智。

原文

Did you know...?
LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

By Jonathan Corbet
April 26, 2024

At the CPU level, a memory model describes, among other things, the amount of freedom the processor has to reorder memory operations. If low-level code does not take the memory model into account, unpleasant surprises are likely to follow. Naturally, different CPUs offer different memory models, complicating the portability of certain types of concurrent software. To make life easier, some Arm CPUs offer the ability to emulate the x86 memory model, but efforts to make that feature available in the kernel are running into opposition.

CPU designers will do everything they can to improve performance. With regard to memory accesses, "everything" can include caching operations, executing them out of order, combining multiple operations into one, and more. These optimizations do not affect a single CPU running in isolation, but they can cause memory operations to be visible to other CPUs in a surprising order. Unwary software running elsewhere in the system may see memory operations in an order different from what might be expected from reading the code; this article describes one simple scenario for how things can go wrong, and this series on lockless algorithms shows in detail some of the techniques that can be used to avoid problems related to memory ordering.

The x86 architecture implements a model that is known as "total store ordering" (TSO), which guarantees that writes (stores) will be seen by all CPUs in the order they were executed. Reads, too, will not be reordered, but the ordering of reads and writes relative to each other is not guaranteed. Code written for a TSO architecture can, in many cases, omit the use of expensive barrier instructions that would otherwise be needed to force a specific ordering of operations.

The Arm memory model, instead, is weaker, giving the CPU more freedom to move operations around. The benefits from this design are a simpler implementation and the possibility for better performance in situations where ordering guarantees are not needed (which is most of the time). The downsides are that concurrent code can require a bit more care to write correctly, and code written for a stricter memory model (such as TSO) will have (possibly subtle) bugs when run on an Arm CPU.

The weaker Arm model is rarely a problem, but it seems there is one situation where problems arise: emulating an x86 processor. If an x86 emulator does not also emulate the TSO memory model, then concurrent code will likely fail, but emulating TSO, which requires inserting memory barriers, creates a significant performance penalty. It seems that there is one type of concurrent x86 code — games — that some users of Arm CPUs would like to be able to run; those users, strangely, dislike the prospect of facing the orc hordes in the absence of either performance or correctness.

TSO on Arm

As it happens, some Arm CPU vendors understand this problem and have, as Hector Martin described in this patch series, implemented TSO memory models in their processors. Some NVIDIA and Fujitsu CPUs run with TSO at all times; Apple's CPUs provide it as an optional feature that can be enabled at run time. Martin's purpose is to make this capability visible to, and controllable by, user space.

The series starts by adding a couple of new prctl() operations. PR_GET_MEM_MODEL will return the current memory model implemented by the CPU; that value can be either PR_SET_MEM_MODEL_DEFAULT or PR_SET_MEM_MODEL_TSO. The PR_SET_MEM_MODEL operation will attempt to enable the requested memory model, with the return code indicating whether it was successful; it is allowed to select a stricter memory model than requested. For the always-TSO CPUs, requesting TSO will obviously succeed. For Apple CPUs, requesting TSO will result in the proper CPU bits being set. Asking for TSO on a CPU that does not support it will, as expected, fail.

Martin notes that the code is not new: "This series has been brewing in the downstream Asahi Linux tree for a while now, and ships to thousands of users". Interestingly, Zayd Qumsieh had posted a similar patch set one day earlier, but that version only implemented the feature for Linux running in virtual machines on Apple CPUs.

Unfortunately for people looking forward to faster games on Apple CPUs, neither patch set is popular with the maintainers of the Arm architecture code in the kernel. Will Deacon expressed his "strong objection", saying that this feature would result in a fragmentation of user-space code. Developers, he said, would just enable the TSO bit if it appears to make problems go away, resulting in code that will fail, possibly in subtle ways, on other Arm CPUs. Catalin Marinas, too, indicated that he would block patches making this sort of implementation-defined feature available.

Martin responded that fragmentation is unlikely to be a problem, and pointed to the different page sizes supported by some processors (including Apple's) as an example of how these incompatibilities can be dealt with. He said that, so far, nobody has tried to use the TSO feature for anything that is not an emulator, so abuse in other software seems unlikely. Keeping it out, he said, will not improve the situation:

There's a pragmatic argument here: since we need this, and it absolutely will continue to ship downstream if rejected, it doesn't make much difference for fragmentation risk does it? The vast majority of Linux-on-Mac users are likely to continue running downstream kernels for the foreseeable future anyway to get newer features and hardware support faster than they can be upstreamed. So not allowing this upstream doesn't really change the landscape vis-a-vis being able to abuse this or not, it just makes our life harder by forcing us to carry more patches forever.

Deacon, though, insisted that, once a feature like this is merged, it will find uses in other software "and we'll be stuck supporting it".

If this patch is not acceptable, it is time to think about alternatives. One is to, as Martin described, just keep it out-of-tree and ship it on the distributions that actually run on that hardware. A long history of addition by distributions can, at times, eventually ease a patch's way past reluctant maintainers. Another might be to just enable TSO unconditionally on Apple CPUs, but that comes with an overall performance penalty — about 9%, according to Martin. Another possibility was mentioned by Marc Zyngier, who suggested that virtual machines could be started with TSO enabled, making it available to applications running within while keeping the kernel out of the picture entirely.

This seems like the kind of discussion that does not go away quickly. One of the many ways in which Linux has stood out over the years is in its ability to allow users to make full use of their hardware; refusing to support a useful hardware feature runs counter to that history. The concerns about potential abuse of this feature are also based in long experience, though. This is a case where the development community needs to repeat another part of its long history by finding a solution that makes the needed functionality available in a supportable way.

对 Arm CPU 上 TSO 内存模型的支持 (2024) Support for the TSO memory model on Arm CPUs (2024)

TSO on Arm

对 Arm CPU 上 TSO 内存模型的支持 (2024)
Support for the TSO memory model on Arm CPUs (2024)