在 dsymutil 中采用并行 DWARF 链接器

在 dsymutil 中采用并行 DWARF 链接器
Adopting the Parallel DWARF linker in dsymutil

原始链接: https://jonasdevlieghere.com/post/dsymutil-parallel-linker/

Apple 的 `dsymutil` 工具对于创建独立的调试包至关重要，它负责执行复杂的 DWARF 链接和类型去重。从历史上看，由于需要输出二进制一致的结果以便于简单验证，其性能一直受限于单线程设计。为了提高性能，Apple 正在转向并行 DWARF 链接器。这一转变带来了两个主要挑战：**非确定性**和**验证（Qualification）**。并行处理会导致类型去重过程中的竞态条件，从而产生不可复现的构建结果。Apple 通过引入基于链接顺序的确定性优先级系统解决了这一问题。验证则更为复杂；由于新的链接器无法产生二进制一致的输出，标准的 diff 工具无法使用。为此，Apple 开发了一种语义 DWARF 比对工具，用以比较输出结果的底层图结构，而非原始字节。向并行链接器的过渡正在进行中，并辅以三层验证策略：执行现有的测试套件、运行 LLDB 调试测试，以及通过语义比对进行人工检查。尽管并行链接器带来了显著的性能提升——将 Clang 的编译时间从三分钟缩短至 40 秒——但核心重点依然是确保所生成调试信息的准确性和可靠性。

抱歉。

原文

On Apple platforms, the development experience was designed around making the compile-link-debug cycle as fast as possible. For debugging, that means that rather than processing large amounts of DWARF to link it into the final binary, the linker leaves the debug info in the object files and records a debug map that tells the debugger where to find it. When you’re debugging locally, that’s all you need. But if you want to archive the debug info for crash reporting or remote debugging, you need a way to produce a self-contained bundle. That’s where dsymutil comes in.

dsymutil is more than a DWARF concatenator. It’s an optimizing linker that leverages the One Definition Rule (ODR) to deduplicate types across compilation units. In a C++ project, every translation unit that includes a header gets its own copy of the DWARF for every type defined in that header.^{dsymutil identifies equivalent types and keeps only one canonical copy. For large C++ projects, this makes the difference between fitting within Mach-O’s 4GB limit or not. In order to do these optimizations, dsymutil needs to parse and semantically analyze the DWARF, which is where most of the time goes.}

The classic DWARF linking algorithm was single-threaded by design. Debug info for a large project can easily reach hundreds of gigabytes. To avoid loading all of it into memory at once, dsymutil processes one compile unit at a time and streams the output. That constraint makes parallelizing the core linking loop nontrivial. Over the years we’ve made incremental improvements, like processing architectures in parallel and running the analysis and cloning phases in lockstep on separate threads, but the fundamental bottleneck remained: the ODR uniquing happens on one thread. A parallel DWARF linker that can unique types across threads has existed in LLVM for a while. Building that was an incredible effort, but unfortunately it wasn’t quite production-ready due to some major limitations.

The Qualification Problem

The biggest challenge with dsymutil has always been qualification. When we upstreamed dsymutil to LLVM, we qualified it by generating bug-for-bug identical DWARF. We did the same thing when we rewrote the cloning phase to use the lockstep algorithm. Having binary-identical output meant we could run diff on two dSYMs to convince ourselves a change was truly NFC.

The parallel linker can’t produce binary-identical output. It processes compile units concurrently, so the order in which types are encountered and deduplicated is different. The output is semantically the same (or should be), but the DWARF structure, and hence the bytes, differ. That means the binary compatibility approach that qualified every previous dsymutil change doesn’t apply here.

Without a way to compare the output semantically, we had no way to confirm the correctness of the DWARF generated by the parallel linker. It’s relatively easy to spot-check small things in tests, but that doesn’t scale to even medium-sized projects. The really tricky issues only surface at debug-time when the debugger starts misbehaving. In order to even consider the parallel linker in dsymutil, we needed a tool that could tell us, concretely, how the parallel linker’s output differs from the classic linker’s.

Semantic DWARF Diffing

Although DWARF looks like a tree of tags and attributes, it’s really a directed acyclic graph. Attributes can reference DIEs in other parts of the tree. A variable references its type, a type references its members’ types, a subprogram references its parameter types, and so on. Comparing two DWARF outputs means matching nodes across two graphs and verifying that their attributes and reachable subgraphs are equivalent.

You can’t do this by diffing dwarfdump text. The offsets are different, the ordering of DIEs may differ, and the cross-references point to different positions. That’s without even considering that the dwarfdump output for any real-world project is too big to handle for most tools. What you need is to anchor the comparison on stable identifiers like linkage names, declaration coordinates, and type signatures, then walk the graph from there, comparing attributes and children structurally.

We prototyped a semantic diffing tool and ran it on clang, comparing the classic and parallel linker output. Out of roughly 5 million DIEs, it identified about 50,000 differences. We haven’t verified all of those results, and the tool itself is far from production-ready, but it was sufficient to give us a concrete picture of where the two linkers diverge.

Determinism

The single biggest blocker towards adopting the parallel linker was its non-determinism. Reproducible builds are non-negotiable for any serious build tool. Without them, you lose the ability to cache, bisect, and verify your artifacts.

The non-determinism came from how the parallel linker selects canonical DIEs during ODR uniquing. When multiple compile units define the same type, threads race to claim the canonical copy. Whichever thread gets there first wins. Since thread scheduling varies between runs, different runs pick different canonical DIEs.

The fix assigns each compile unit a priority based on its position in the link order. When a thread wants to register a canonical DIE, it only overwrites the current one if its priority is strictly higher, meaning it appears earlier in the input. This guarantees the same canonical DIE regardless of scheduling, while preserving the parallelism.

Switching the Default

The differ is one piece of the qualification strategy, but not the whole picture. It tells us whether the two linkers agree, not whether either one is correct. It’s also the least automatable part of the process. In order to switch the default, we need a comprehensive qualification strategy to ensure correctness.

The first step is running dsymutil’s existing test suite with --linker=parallel. These tests exercise specific DWARF constructs and edge cases, and any test that passes with the classic linker should also pass with the parallel one. Getting all of these to pass, or understanding why they don’t, is a prerequisite for switching the default.

The second step is running the LLDB test suite against dSYMs produced by the parallel linker. LLDB’s tests cover a wide range of debugging scenarios. By default, it runs multiple variants of every test: once using the DWARF in the object files and once with a dSYM. This will allow us to both identify regressions between the classic and parallel linker and between DWARF in the object files and the dSYM generated by the parallel linker.

The final step is using the diffing tool to manually inspect the differences between the dSYMs generated by the classic and parallel linker for increasingly large projects.

This work is tracked and ongoing.

Performance

To put all of this in perspective: running dsymutil on clang takes roughly 3 minutes with the classic linker compared to about 40 seconds with the parallel linker. The speedup was always the expected outcome. The parallel linker was designed to be faster. The work described here was about building the confidence to actually use it.