Handling Long Branches

原始链接: https://maskray.me/blog/2026-01-25-handling-long-branches

## 软件工具链中的长跳转问题 大多数计算机架构限制了单个分支指令可达的距离。当函数调用或跳转超过此范围(例如,在AArch64上调用距离超过128MB的函数时),工具链必须采用策略以确保正确执行。这涉及编译器、汇编器和链接器的协同工作。 **编译器**处理函数内部过远的跳转,通常通过反转条件并添加无条件跳转来实现。**汇编器**根据已知节内的距离优化指令编码。然而,**链接器**面临着最复杂的任务,它需要解析跨节和目标文件中的跳转,而这些距离在最终布局确定之前是未知的。 当跳转仍然超出范围时,链接器会生成**范围扩展跳转(thunk)**——作为中间媒介的小段代码,以到达目标位置。像RISC-V这样的架构提供**链接器放松(linker relaxation)**,在可能的情况下*缩小*指令序列,优化代码大小。 不同的架构具有不同的分支范围限制(AArch64:±128MiB,RISC-V:`jal`的±1MiB),这会影响thunk生成的频率。像lld/ELF和mold这样的链接器采用不同的算法来创建thunk,以平衡复杂性和效率。诊断“超出范围的重定位”错误涉及检查链接器输出、映射文件,并考虑函数节划分和链接时优化等因素。

一个Hacker News讨论围绕着maskray.me的一篇博客文章,详细介绍了编译器中“处理长跳转”的技术。该文章(现位于[https://maskray.me/blog/2026-01-25-long-branches-in-compiler](https://maskray.me/blog/2026-01-25-long-branches-in-compiler))讨论了分支指令范围的限制,这在较旧的架构(如Z80)中尤其相关。 用户分享了克服这些限制的经验,使用了诸如反转条件和使用“thunk”(小型跳转指令以重定向执行)等策略。有趣的是,一些链接器会自动为超出指令范围的跳转插入这些thunk,而不是在编译期间需要更大的代码模型。讨论还强调了不同链接器和架构中thunk创建算法的变化。该文章和后续评论提供了关于低级代码优化以及编译器/链接器行为的宝贵见解。
相关文章

原文

Branch instructions on most architectures use PC-relative addressing with a limited range. When the target is too far away, the branch becomes "out of range" and requires special handling.

Consider a large binary where main() at address 0x10000 calls foo() at address 0x8010000-over 128MiB away. On AArch64, the bl instruction can only reach ±128MiB, so this call cannot be encoded directly. Without proper handling, the linker would fail with an error like "relocation out of range." The toolchain must handle this transparently to produce correct executables.

This article explores how compilers, assemblers, and linkers work together to solve the long branch problem.

  • Compiler (IR to assembly): Handles branches within a function that exceed the range of conditional branch instructions
  • Assembler (assembly to relocatable file): Handles branches within a section where the distance is known at assembly time
  • Linker: Handles cross-section and cross-object branches discovered during final layout

Branch range limitations

Different architectures have different branch range limitations. Here's a quick comparison of unconditional branch/call ranges:

Architecture Unconditional Branch Conditional Branch Notes
AArch64 ±128MiB ±1MiB Range extension thunks
AArch32 (A32) ±32MiB ±32MiB Range extension and interworking veneers
AArch32 (T32) ±16MiB ±1MiB Thumb has shorter ranges
PowerPC64 ±32MiB ±32KiB Range extension and TOC/NOTOC interworking thunks
RISC-V ±1MiB (jal) ±4KiB Linker relaxation
x86-64 ±2GiB ±2GiB Code models or thunk extension

The following subsections provide detailed per-architecture information, including relocation types relevant for linker implementation.

AArch32

In A32 state:

  • Branch (b/b<cond>), conditional branch and link (bl<cond>) (R_ARM_JUMP24): ±32MiB
  • Unconditional branch and link (bl/blx, R_ARM_CALL): ±32MiB

Note: R_ARM_CALL is for unconditional bl/blx which can be relaxed to BLX inline; R_ARM_JUMP24 is for branches which require a veneer for interworking.

In T32 state:

  • Conditional branch (b<cond>, R_ARM_THM_JUMP8): ±256 bytes
  • Short unconditional branch (b, R_ARM_THM_JUMP11): ±2KiB
  • ARMv5T branch and link (bl/blx, R_ARM_THM_CALL): ±4MiB
  • ARMv6T2 wide conditional branch (b<cond>.w, R_ARM_THM_JUMP19): ±1MiB
  • ARMv6T2 wide branch (b.w, R_ARM_THM_JUMP24): ±16MiB
  • ARMv6T2 wide branch and link (bl/blx, R_ARM_THM_CALL): ±16MiB. R_ARM_THM_CALL can be relaxed to BLX.

AArch64

  • Test and compare branches (tbnz/tbz/cbnz/cbz): ±32KiB
  • Conditional branches (b.<cond>): ±1MiB
  • Unconditional branches (b/bl): ±128MiB

PowerPC

  • Conditional branch (bc/bcl, R_PPC64_REL14): ±32KiB
  • Unconditional branch (b/bl, R_PPC64_REL24/R_PPC64_REL24_NOTOC): ±32MiB

RISC-V

  • Compressed c.beqz: ±256 bytes
  • Compressed c.jal: ±2KiB
  • jalr (I-type immediate): ±2KiB
  • Conditional branches (beq/bne/blt/bge/bltu/bgeu, B-type immediate): ±4KiB
  • jal (J-type immediate, PseudoBR): ±1MiB
  • PseudoJump (using auipc + jalr): ±2GiB

Qualcomm uC Branch Immediate extension (Xqcibi):

  • qc.beqi/qc.bnei/qc.blti/qc.bgei/qc.bltui/qc.bgeui (32-bit, 5-bit compare immediate): ±4KiB
  • qc.e.beqi/qc.e.bnei/qc.e.blti/qc.e.bgei/qc.e.bltui/qc.e.bgeui (48-bit, 16-bit compare immediate): ±4KiB

Qualcomm uC Long Branch extension (Xqcilb):

  • qc.e.j/qc.e.jal (48-bit, R_RISCV_VENDOR(QUALCOMM)+R_RISCV_QC_E_CALL_PLT): ±2GiB

SPARC

  • Compare and branch (cxbe, R_SPARC_5): ±64 bytes
  • Conditional branches (bcc, R_SPARC_WDISP19): ±1MiB
  • call (R_SPARC_WDISP30): ±2GiB

Note: lld does not implement range extension thunks for SPARC.

x86-64

  • Short conditional jump (Jcc rel8): -128 to +127 bytes
  • Short unconditional jump (JMP rel8): -128 to +127 bytes
  • Near conditional jump (Jcc rel32): ±2GiB
  • Near unconditional jump (JMP rel32): ±2GiB

With a ±2GiB range for near jumps, x86-64 rarely encounters out-of-range branches in practice. A single text section would need to exceed 2GiB before thunks become necessary. For this reason, most linkers (including lld) do not implement range extension thunks for x86-64.

Compiler: branch relaxation

The compiler typically generates branches using a form with a large range. However, certain conditional branches may still go out of range within a function.

The compiler measures branch distances and relaxes out-of-range branches. In LLVM, this is handled by the BranchRelaxation pass, which runs just before AsmPrinter.

Different backends have their own implementations:

  • BranchRelaxation: AArch64, AMDGPU, AVR, RISC-V
  • HexagonBranchRelaxation: Hexagon
  • PPCBranchSelector: PowerPC
  • SystemZLongBranch: SystemZ
  • MipsBranchExpansion: MIPS
  • MSP430BSel: MSP430

For a conditional branch that is out of range, the pass typically inverts the condition and inserts an unconditional branch:

1
2
3
4
5
6
7
# Before relaxation (out of range)
beq .Lfar_target # ±4KiB range on RISC-V

# After relaxation
bne .Lskip # Inverted condition, short range
j .Lfar_target # Unconditional jump, ±1MiB range
.Lskip:

Assembler: instruction relaxation

The assembler converts assembly to machine code. When the target of a branch is within the same section and the distance is known at assembly time, the assembler can select the appropriate encoding. This is distinct from linker thunks, which handle cross-section or cross-object references where distances aren't known until link time.

Assembler instruction relaxation handles two cases (see Clang -O0 output: branch displacement and size increase for examples):

  • Span-dependent instructions: Select a larger encoding when the displacement exceeds the range of the smaller encoding. For x86, a short jump (jmp rel8) can be relaxed to a near jump (jmp rel32).
  • Conditional branch transform: Invert the condition and insert an unconditional branch. On RISC-V, a blt might be relaxed to bge plus an unconditional branch.

The assembler uses an iterative layout algorithm that alternates between fragment offset assignment and relaxation until all fragments become legalized. See Integrated assembler improvements in LLVM 19 for implementation details.

Linker: range extension thunks

When the linker resolves relocations, it may discover that a branch target is out of range. At this point, the instruction encoding is fixed, so the linker cannot simply change the instruction. Instead, it generates range extension thunks (also called veneers, branch stubs, or trampolines).

A thunk is a small piece of linker-generated code that can reach the actual target using a longer sequence of instructions. The original branch is redirected to the thunk, which then jumps to the real destination.

Range extension thunks are one type of linker-generated thunk. Other types include:

Short range vs long range thunks

A short range thunk (see lld/ELF's AArch64 implementation) contains just a single branch instruction. Since it uses a branch, its reach is also limited by the branch range—it can only extend coverage by one branch distance. For targets further away, multiple short range thunks can be chained, or a long range thunk with address computation must be used.

Long range thunks use indirection and can jump to (practically) arbitrary locations.

1
2
3
4
5
6
7
8
9
// Short range thunk: single branch, 4 bytes
__AArch64AbsLongThunk_dst:
b dst // ±128MiB range

// Long range thunk: address computation, 12 bytes
__AArch64ADRPThunk_dst:
adrp x16, dst // Load page address (±4GiB range)
add x16, x16, :lo12:dst // Add page offset
br x16 // Indirect branch

Thunk examples

AArch32 (PIC) (see Linker notes on AArch32):

1
2
3
4
5
__ARMV7PILongThunk_dst:
movw ip, :lower16:(dst - .) ; ip = intra-procedure-call scratch register
movt ip, :upper16:(dst - .)
add ip, ip, pc
bx ip

PowerPC64 ELFv2 (see Linker notes on Power ISA):

1
2
3
4
5
__long_branch_dst:
addis 12, 2, .branch_lt@ha # Load high bits from branch lookup table
ld 12, .branch_lt@l(12) # Load target address
mtctr 12 # Move to count register
bctr # Branch to count register

Thunk impact on debugging and profiling

Thunks are transparent at the source level but visible in low-level tools:

  • Stack traces: May show thunk symbols (e.g., __AArch64ADRPThunk_foo) between caller and callee
  • Profilers: Samples may attribute time to thunk code; some profilers aggregate thunk time with the target function
  • Disassembly: objdump or llvm-objdump will show thunk sections interspersed with regular code
  • Code size: Each thunk adds bytes; large binaries may have thousands of thunks

lld/ELF's thunk creation algorithm

lld/ELF uses a multi-pass algorithm in finalizeAddressDependentContent:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
assignAddresses();
for (pass = 0; pass < 30; ++pass) {
if (pass == 0)
createInitialThunkSections();
bool changed = false;
for (relocation : all_relocations) {
if (pass > 0 && normalizeExistingThunk(rel))
continue;
if (!needsThunk(rel)) continue;
Thunk *t = getOrCreateThunk(rel);
ts = findOrCreateThunkSection(rel, src);
ts->addThunk(t);
rel.sym = t->getThunkTargetSym();
changed = true;
}
mergeThunks();
if (!changed) break;
assignAddresses();
}

Key details:

  • Multi-pass: Iterates until convergence (max 30 passes). Adding thunks changes addresses, potentially putting previously-in-range calls out of range.
  • Pre-allocated ThunkSections: On pass 0, createInitialThunkSections places empty ThunkSections at regular intervals (thunkSectionSpacing). For AArch64: 128 MiB - 0x30000 ≈ 127.8 MiB.
  • Thunk reuse: getThunk returns existing thunk if one exists for the same target; normalizeExistingThunk checks if a previously-created thunk is still in range.
  • ThunkSection placement: getISDThunkSec finds a ThunkSection within branch range of the call site, or creates one adjacent to the calling InputSection.

lld/MachO's thunk creation algorithm

lld/MachO uses a single-pass algorithm in TextOutputSection::finalize:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
for (callIdx = 0; callIdx < inputs.size(); ++callIdx) {

while (finalIdx < endIdx && fits_in_range(inputs[finalIdx]))
finalizeOne(inputs[finalIdx++]);


for (Relocation &r : reverse(isec->relocs)) {
if (!isBranchReloc(r)) continue;
if (targetInRange(r)) continue;
if (existingThunkInRange(r)) { reuse it; continue; }

createThunk(r);
}
}

Key differences from lld/ELF:

  • Single pass: Addresses are assigned monotonically and never revisited
  • Slop reservation: Reserves slopScale * thunkSize bytes (default: 256 × 12 = 3072 bytes on ARM64) to leave room for future thunks
  • Thunk naming: <function>.thunk.<sequence> where sequence increments per target

Thunk starvation problem: If many consecutive branches need thunks, each thunk (12 bytes) consumes slop faster than call sites (4 bytes apart) advance. The test lld/test/MachO/arm64-thunk-starvation.s demonstrates this edge case. Mitigation is increasing --slop-scale, but pathological cases with hundreds of consecutive out-of-range callees can still fail.

mold's thunk creation algorithm

mold uses a two-pass approach: first pessimistically over-allocate thunks, then remove unnecessary ones.

Intuition: It's safe to allocate thunk space and later shrink it, but unsafe to add thunks after addresses are assigned (would create gaps breaking existing references).

Pass 1 (create_range_extension_thunks): Process sections in batches using a sliding window. The window tracks four positions:

1
2
3
4
5
6
7
8
9
Sections:   [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] ...
^ ^ ^ ^
A B C D
| |_______| |
| batch |
| |
earliest thunk
reachable placement
from C
  • [B, C) = current batch of sections to process (size ≤ branch_distance/5)
  • A = earliest section still reachable from C (for thunk expiration)
  • D = where to place the thunk (furthest point reachable from B)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

while (b < sections.size()) {

while (d < size && thunk_at_d_reachable_from_b)
assign_address(sections[d++]);


c = b + 1;
while (c < d && sections[c] < sections[b] + batch_size) c++;


while (a < b && sections[a] + branch_distance < sections[c]) a++;

for (; t < thunks.size() && thunks[t].offset < sections[a]; t++)
for (sym in thunks[t].symbols) sym->flags = 0;



auto &thunk = thunks.emplace_back(new Thunk(offset));
parallel_for(b, c, [&](i64 i) {
for (rel in sections[i].relocs) {
if (requires_thunk(rel)) {
Symbol &sym = rel.symbol;
if (!sym.flags.test_and_set()) {
lock_guard lock(mu);
thunk.symbols.push_back(&sym);
}
}
}
});
offset += thunk.size();
b = c;
}

Pass 2 (remove_redundant_thunks): After final addresses are known, remove thunk entries for symbols actually in range.

Key characteristics:

  • Pessimistic over-allocation: Assumes all out-of-section calls need thunks; safe to shrink later
  • Batch size: branch_distance/5 (25.6 MiB for AArch64, 3.2 MiB for AArch32)
  • Parallelism: Uses TBB for parallel relocation scanning within each batch
  • Single branch range: Uses one conservative branch_distance per architecture. For AArch32, uses ±16 MiB (Thumb limit) for all branches, whereas lld/ELF uses ±32 MiB for A32 branches.
  • Thunk size not accounted in D-advancement: The actual thunk group size is unknown when advancing D, so the end of a large thunk group may be unreachable from the beginning of the batch.
  • No convergence loop: Single forward pass for address assignment, no risk of non-convergence

Comparing thunk algorithms

Aspect lld/ELF lld/MachO mold
Passes Multi-pass (max 30) Single-pass Two-pass
Strategy Iterative refinement Greedy Greedy
Thunk placement Pre-allocated at intervals Inline with slop reservation Batch-based at intervals
Convergence Always (bounded iterations) Almost always Almost always
Range handling Per-relocation type Single conservative range Single conservative range
Parallelism Sequential Sequential Parallel (TBB)

Linker relaxation (RISC-V)

RISC-V takes a different approach: instead of only expanding branches, it can also shrink instruction sequences when the target is close enough.

Consider a function call using the call pseudo-instruction, which expands to auipc + jalr:

1
2
3
4
5
# Before linking (8 bytes)
call ext
# Expands to:
# auipc ra, %pcrel_hi(ext)
# jalr ra, ra, %pcrel_lo(ext)

If ext is within ±1MiB, the linker can relax this to:

1
2
# After relaxation (4 bytes)
jal ext

This is enabled by R_RISCV_RELAX relocations that accompany R_RISCV_CALL relocations. The R_RISCV_RELAX relocation signals to the linker that this instruction sequence is a candidate for shrinking.

Example object code before linking:

1
2
3
4
5
6
7
8
9
0000000000000006 <foo>:
6: 97 00 00 00 auipc ra, 0
R_RISCV_CALL ext
R_RISCV_RELAX *ABS*
a: e7 80 00 00 jalr ra
e: 97 00 00 00 auipc ra, 0
R_RISCV_CALL ext
R_RISCV_RELAX *ABS*
12: e7 80 00 00 jalr ra

After linking with relaxation enabled, the 8-byte auipc+jalr pairs become 4-byte jal instructions:

1
2
3
4
5
6
0000000000000244 <foo>:
244: 41 11 addi sp, sp, -16
246: 06 e4 sd ra, 8(sp)
248: ef 00 80 01 jal ext
24c: ef 00 40 01 jal ext
250: ef 00 00 01 jal ext

When the linker deletes instructions, it must also adjust:

  • Subsequent instruction offsets within the section
  • Symbol addresses
  • Other relocations that reference affected locations
  • Alignment directives (R_RISCV_ALIGN)

This makes RISC-V linker relaxation more complex than thunk insertion, but it provides code size benefits that other architectures cannot achieve at link time.

Diagnosing out-of-range errors

When you encounter a "relocation out of range" error, here are some diagnostic steps:

  1. Check the error message: lld reports the source location, relocation type, and the distance. For example:

    1
    ld.lld: error: a.o:(.text+0x1000): relocation R_AARCH64_CALL26 out of range: 150000000 is not in [-134217728, 134217727]

  2. Use --verbose or -Map: Generate a link map to see section layout and identify which sections are far apart.

  3. Consider -ffunction-sections: Splitting functions into separate sections gives the linker more flexibility in placement, potentially reducing distances.

  4. Check for large data in .text: Embedded data (jump tables, constant pools) can push functions apart. Some compilers have options to place these elsewhere.

  5. LTO considerations: Link-time optimization can dramatically change code layout. If thunk-related issues appear only with LTO, the optimizer may be creating larger functions or different inlining decisions.

Summary

Handling long branches requires coordination across the toolchain:

Stage Technique Example
Compiler Branch relaxation pass Invert condition + add unconditional jump
Assembler Instruction relaxation Short jump to near jump
Linker Range extension thunks Generate trampolines
Linker Linker relaxation Shrink auipc+jalr to jal (RISC-V)

The linker's thunk generation is particularly important for large programs where cross-compilation-unit calls may exceed branch ranges. Different linkers use different algorithms with various tradeoffs between complexity, optimality, and robustness.

RISC-V's linker relaxation is unique in that it can both expand and shrink code, optimizing for both correctness and code size.

联系我们 contact @ memedata.com