编译器的静态分配
Static Allocation for Compilers

原始链接: https://matklad.github.io/2025/12/23/static-allocation-compilers.html

作者探讨了将“静态分配”技术——虎蜂数据库用于避免运行时内存分配的技术——应用于编译器设计。虎蜂数据库的方法并非真正静态,而是基于命令行参数在启动时分配所有必要的内存,然后不再进行进一步分配。 关键见解在于,当处理有限的输入和输出时,静态分配是可行的。虽然编译器处理任意大小的代码和输出,但作者建议将编译视为流处理:处理有限的代码块并将结果累积到预先分配的“输出区域”。 即使处理大文件,这也能限制*进程*本身的内存占用(潜在上限可配置,例如4MiB)。可以使用索引代替指针来处理输出数据,从而简化持久化。作者认为,将O(N)输出与O(1)中间处理分离,可以带来更简单、更健壮的编译器架构,从而体现出与虎蜂数据库等数据库系统相同的优势。

## 编译器静态分配 - 摘要 最近 Hacker News 的讨论集中在编译器设计中的静态分配概念上——旨在消除动态内存分配,以实现更快的、更可预测的性能。核心概念是为编译器进程预定义内存限制,处理适合这些限制的代码块。 然而,评论者指出了一些挑战。简单地限制大小(例如将字符串长度限制为 256 字节)会导致浪费空间(如果大多数字符串更短),或者在字符串更长时导致失败。现代编译器通常采用“短字符串优化”和分配池来缓解这个问题。 人们对在没有动态分配的情况下处理文件间依赖和语义分析的复杂性表示担忧,并提到了历史上较慢的编译器方法。还有人质疑这种方法是否真正简化了代码,认为它增加了内存管理和记录的开销。 讨论还涉及实际限制——有人建议代码库限制为 2GB,而另一些人则强调使用大型编译单元(8GB+)在 AAA 游戏引擎等项目中的性能优势。最终,这个想法被探索为一种潜在的优化,但同时也承认了权衡和复杂性。
相关文章

原文

TigerBeetle famously uses “static allocation”. Infamously, the use of the term is idiosyncratic: what is meant is not static arrays, as found in embedded development, but rather a weaker “no allocation after startup” form. The amount of memory TigerBeetle process uses is not hard-coded into the Elf binary. It depends on the runtime command line arguments. However, all allocation happens at startup, and there’s no deallocation. The long-lived event loop goes round and round happily without alloc.

I’ve wondered for years if a similar technique is applicable to compilers. It seemed impossible, but today I’ve managed to extract something actionable from this idea?

Static allocation depends on the physics of the underlying problem. And distributed databases have surprisingly simple physics, at least in the case of TigerBeetle.

The only inputs and outputs of the system are messages. Each message is finite in size (1MiB). The actual data of the system is stored on disk and can be arbitrarily large. But the diff applied by a single message is finite. And, if your input is finite, and your output is finite, it’s actually quite hard to need to allocate extra memory!

This is worth emphasizing — it might seem like doing static allocation is tough and requires constant vigilance and manual accounting for resources. In practice, I learned that it is surprisingly compositional. As long as inputs and outputs of a system are finite, non-allocating processing is easy. And you can put two such systems together without much trouble. routing.zig is a good example of such an isolated subsystem.

The only issue here is that there isn’t a physical limit on how many messages can arrive at the same time. Obviously, you can’t process arbitrary many messages simultaneously. But in the context of a distributed system over an unreliable network, a safe move is to drop a message on the floor if the required processing resources are not available.

Counter-intuitively, not allocating is simpler than allocating, provided that you can pull it off!

Alas, it seems impossible to pull it off for compilers. You could say something like “hey, the largest program will have at most one million functions”, but that will lead to both wasted memory and poor user experience. You could also use a single yolo arena of a fixed size, like I did in Hard Mode Rust, but that isn’t at all similar to “static allocation”. With arenas, the size is fixed explicitly, but you can OOM. With static allocation it is the opposite — no OOM, but you don’t know how much memory you’ll need until startup finishes!

The “problem size” for a compiler isn’t fixed — both the input (source code) and the output (executable) can be arbitrarily large. But that is also the case for TigerBeetle — the size of the database is not fixed, it’s just that TigerBeetle gets to cheat and store it on disk, rather than in RAM. And TigerBeetle doesn’t do “static allocation” on disk, it can fail with ENOSPACE at runtime, and it includes a dynamic block allocator to avoid that as long as possible by re-using no longer relevant sectors.

So what we could say is that a compiler consumes arbitrarily large input, and produces arbitrarily large output, but those “do not count” for the purpose of static memory allocation. At the start, we set aside an “output arena” for storing finished, immutable results of compiler’s work. We then say that this output is accumulated after processing a sequence of chunks, where chunk size is strictly finite. While limiting the total size of the code-base is unreasonable, limiting a single file to, say, 4 MiB (runtime-overridable) is fine. Compiling then essentially becomes a “stream processing” problem, where both inputs and outputs are arbitrary large, but the filter program itself must execute in O(1) memory.

With this setup, it is natural to use indexes rather than pointers for “output data”, which then makes it easy to persist it to disk between changes. And it’s also natural to think about “chunks of changes” not only spatially (compiler sees a new file), but also temporally (compiler sees a new version of an old file).

Is there any practical benefits here? I don’t know! But seems worth playing around with! I feel that a strict separation between O(N) compiler output and O(1) intermediate processing artifacts can clarify compiler’s architecture, and I won’t be too surprised if O(1) processing in compilers would lead to simpler code the same way it does for databases?

联系我们 contact @ memedata.com