二进制文件

二进制文件
Binaries

原始链接: https://fzakaria.com/2025/12/28/huge-binaries

## 2GiB 二进制障碍在博士研究期间，作者在发表仅在极大型代码库（如谷歌）中出现的问题的研究时遇到了困难，原因是缺乏公开的可引用证据。一个关键问题是二进制文件大小的增长，由于静态链接以提高启动速度和简化部署，导致其达到 25GiB 以上。这导致了“2GiB 障碍”。x86_64 架构使用 32 位相对偏移量进行函数调用。这限制了跳转距离约为 2GiB。当函数距离更远时，链接器会失败，导致“重定位溢出”错误。存在解决方案，例如切换到“大型代码模型”，它使用绝对跳转而不是相对跳转。然而，这会由于更长的指令而增加二进制文件的大小，并给 CPU 寄存器带来压力。作者暗示将探索更多策略来克服这一障碍，并在大型项目中保持高效的代码大小，并承诺在未来的文章中提供更多细节。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录二进制文件 (fzakaria.com) 4 分，by todsacerdoti 58 分钟前 | 隐藏 | 过去 | 收藏 | 讨论指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

原文

A problem I experienced when pursuing my PhD and submitting academic articles was that I had built solutions to problems that required dramatic scale to be effective and worthwhile. Responses to my publication submissions often claimed such problems did not exist; however, I had observed them during my time within industry, such as at Google, but I couldn’t cite it!

One problem that is only present at these mega-codebases is massive binaries. What’s the largest binary (ELF file) you’ve ever seen? I had observed binaries beyond 25GiB, including debug symbols. How is this possible? These companies prefer to statically build their services to speed up startup and simplify deployment. Statically including all code in some of the world’s largest codebases is a recipe for massive binaries.

Similar to the sound barrier, there is a point at which code size becomes problematic and we must re-think how we link and build code. For x86_64, that is the 2GiB “Relocation Barrier.”

Why 2GiB? 🤔

Well let’s take a look at how position independent code is put-together.

Let’s look at a simple example.

extern void far_function();

int main() {
    far_function();
    return 0;
}

If we compile this gcc -c simple-relocation.c -o simple-relocation.o we can inspect it with objdump.

> objdump -dr simple-relocation.o

0000000000000000 <main>:
   0:	55                   	push   %rbp
   1:	48 89 e5             	mov    %rsp,%rbp
   4:	b8 00 00 00 00       	mov    $0x0,%eax
   9:	e8 00 00 00 00       	call   e <main+0xe>
			a: R_X86_64_PLT32	far_function-0x4
   e:	b8 00 00 00 00       	mov    $0x0,%eax
  13:	5d                   	pop    %rbp
  14:	c3                   	ret

There’s a lot going on here, but one important part is e8 00 00 00 00. e8 is the CALL opcode [ref] and it takes a 32bit signed relative offset, which happens to be 0 (four bytes of 0) right now. objdump also lets us know there is a “relocation” necessary to fixup this code when we finalize it. We can view this relocation with readelf as well.

Note If you are wondering why we need -0x4, it’s because the offset is relative to the instruction-pointer which has already moved to the next instruction. The 4 bytes is the operand it has skipped over.

> readelf -r simple-relocation.o -d

Relocation section '.rela.text' at offset 0x170 contains 1 entry:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
00000000000a  000400000004 R_X86_64_PLT32    0000000000000000 far_function - 4

This is additional information embedded in the binary which tells the linker in susbsequent stages that it has code that needs to be fixed. Here we see the address 00000000000a, and a is 9 + 1, which is the offset of the start of the operand for our CALL instruction.

Let’s now create the C file for our missing function.

We will now compile it and link the two object files together using our linker.

> gcc simple-relocation.o far-function.o -o simple-relocation

Let’s now inspect that same callsite and see what it has.

> objdump -dr simple-relocation

0000000000401106 <main>:
  401106:	55                   	push   %rbp
  401107:	48 89 e5             	mov    %rsp,%rbp
  40110a:	b8 00 00 00 00       	mov    $0x0,%eax
  40110f:	e8 07 00 00 00       	call   40111b <far_function>
  401114:	b8 00 00 00 00       	mov    $0x0,%eax
  401119:	5d                   	pop    %rbp
  40111a:	c3                   	ret

000000000040111b <far_function>:
  40111b:	55                   	push   %rbp
  40111c:	48 89 e5             	mov    %rsp,%rbp
  40111f:	90                   	nop
  401120:	5d                   	pop    %rbp
  401121:	c3                   	ret

We can see that the linker did the right thing with the relocation and calculated the relative offset of our symbol far_function and fixed the CALL instruction.

Okay cool…🤷 What does this have to do with huge binaries?

Notice that this call instruction, e8, only takes 32bits signed which means it’s limited to 2^31 bits. This means a callsite can only jump roughly 2GiB forward or 2GiB backward. The “2GiB Barrier” represents the total reach of a single relative jump.

What happens if our callsite is over 2GiB away?

Let’s build a synthetic example by asking our linker to place far_function really really far away. We can do this using a “linker script”.

SECTIONS
{
    /* 1. Start with standard low-address sections */
    . = 0x400000;
    
    /* Catch everything except our specific 'far' object */
    .text : { 
        simple-relocation.o(.text.*) 
    }
    .rodata : { *(.rodata .rodata.*) }
    .data   : { *(.data .data.*) }
    .bss    : { *(.bss .bss.*) }

    /* 2. Move the cursor for the 'far' island */
    . = 0x120000000; 
    
    .text.far : { 
        far-function.o(.text*) 
    }
}

If we now try to link our code we will a “relocation overflow”. I used lld from LLVM because the error messages are a bit prettier.

> gcc simple-relocation.o far-function.o -T overflow.lds -o simple-relocation-overflow -fuse-ld=lld

ld.lld: error: <internal>:(.eh_frame+0x6c):
relocation R_X86_64_PC32 out of range:
5364513724 is not in [-2147483648, 2147483647]; references section '.text'
ld.lld: error: simple-relocation.o:(function main: .text+0xa):
relocation R_X86_64_PLT32 out of range:
5364514572 is not in [-2147483648, 2147483647]; references 'far_function'
>>> referenced by simple-relocation.c
>>> defined in far-function.o

When we hit this problem what solutions do we have? Well this is a complete other subject on “code models”, and it’s a little more nuanced depending on whether we are accessing data (i.e. static variables) or code that is far away. A great blog post that goes into this is the following by @maskray who wrote lld.

The simplest solution however is to use -mcmodel=large which changes all the relative CALL instructions to absolute JMP.

> gcc simple-relocation.o far-function.o -T overflow.lds -o simple-relocation-overflow

> gcc -c simple-relocation.c -o simple-relocation.o -mcmodel=large -fno-asynchronous-unwind-tables

> gcc simple-relocation.o far-function.o -T overflow.lds -o simple-relocation-overflow

./simple-relocation-overflow

Note I needed to add -fno-asynchronous-unwind-tables to disable some additional data that might cause overflow for the purpose of this demonstration.

What does the disassembly look like now?

> objdump -dr simple-relocation-overflow 

0000000120000000 <far_function>:
   120000000:	55                   	push   %rbp
   120000001:	48 89 e5             	mov    %rsp,%rbp
   120000004:	90                   	nop
   120000005:	5d                   	pop    %rbp
   120000006:	c3                   	ret

00000000004000e6 <main>:
  4000e6:	55                   	push   %rbp
  4000e7:	48 89 e5             	mov    %rsp,%rbp
  4000ea:	b8 00 00 00 00       	mov    $0x0,%eax
  4000ef:	48 ba 00 00 00 20 01 	movabs $0x120000000,%rdx
  4000f6:	00 00 00 
  4000f9:	ff d2                	call   *%rdx
  4000fb:	b8 00 00 00 00       	mov    $0x0,%eax
  400100:	5d                   	pop    %rbp
  400101:	c3                   	ret

There is no longer a sole CALL instruction, it has become MOVABS & CALL 😲. This changed the instructions from 5 (opcode + 4 bytes for 32bit relative offset) to a whopping 12 bytes (2 bytes for ABS opcode + 8 bytes for absolute 64 bit address + 2 bytes for CALL).

This has notable downsides among others:

Instruction Bloat: We’ve gone from 5 bytes per call to 12. In a binary with millions of callsites, this can add up.
Register Pressure: We’ve burned a general-purpose register, %rdx, to perform the jump.

Caution I had a lot of trouble building a benchmark that demonstrated a worse lower IPC (instructions per-cycle) for the large mcmodel, so let’s just take my word for it. 🤷

We would like to keep our small code-model. What other strategies can we pursue?

More to come in subsequent writings.

二进制文件 Binaries

二进制文件
Binaries