我构建了一个速度快两倍的词法分析器，然后发现真正的瓶颈在于输入/输出。

我构建了一个速度快两倍的词法分析器，然后发现真正的瓶颈在于输入/输出。
I built a 2x faster lexer, then discovered I/O was the real bottleneck

原始链接: https://modulovalue.com/blog/syscall-overhead-tar-gz-io-performance/

## 词法分析器加速与Tar.gz的故事我开发了一个ARM64汇编词法分析器（由解析器生成），最初性能比官方Dart扫描器快2.17倍。然而，在104,000个Dart文件上的基准测试显示，词法分析器不是瓶颈——I/O是。读取这些文件比词法分析花费的时间长5倍，并且SSD由于过多的系统调用开销（超过300,000次`open`/`read`/`close`调用）仅以其理论速度的1.5%运行。尝试内存映射和直接系统调用带来的改进微乎其微。解决方案？模仿pub.dev和package文件，使用`tar.gz`归档。这减少了需要处理的文件数量，从104,000个降至1,351个，将1.13GB压缩至169MB。I/O时间下降了42.85倍，最终实现了2.27倍的整体加速，即使*包含*解压缩开销。这表明，只有通过对较大文件的顺序访问才能实现现代SSD的速度。这个实验解释了为什么像pub.dev这样的包管理器使用归档：更少的系统调用、减少的带宽和更快的提取速度。虽然像`zstd`压缩和并行解压缩这样的进一步优化是可能的，但核心教训是，在处理大量小文件时，系统调用开销很容易掩盖存储速度。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录我构建了一个速度快两倍的词法分析器，然后发现真正的瓶颈在于I/O (modulovalue.com) 5 分，来自 modulovalue 2 小时前 | 隐藏 | 过去 | 收藏 | 讨论指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系方式搜索：

I built an ARM64 assembly lexer (well, I generated one from my own parser generator, but this post is not about that) that processes Dart code 2x faster than the official scanner, a result I achieved using statistical methods to reliably measure small performance differences. Then I benchmarked it on 104,000 files and discovered my lexer was not the bottleneck. I/O was. This is the story of how I accidentally learned why pub.dev stores packages as tar.gz files.

The setup

I wanted to benchmark my lexer against the official Dart scanner. The pub cache on my machine had 104,000 Dart files totaling 1.13 GB, a perfect test corpus. I wrote a benchmark that:

Reads each file from disk
Lexes it
Measures time separately for I/O and lexing

Simple enough.

The first surprise: lexing is fast

Here are the results:

Metric	ASM Lexer	Official Dart
Lex time	2,807 ms	6,087 ms
Lex throughput	402 MB/s	185 MB/s

My lexer was 2.17x faster. Success! But wait:

Metric	ASM Lexer	Official Dart
I/O time	14,126 ms	14,606 ms
Total time	16,933 ms	20,693 ms
Total speedup	1.22x	-

The total speedup was only 1.22x. My 2.17x lexer improvement was being swallowed by I/O. Reading files took 5x longer than lexing them.

The second surprise: the SSD is not the bottleneck

My MacBook has an NVMe SSD that can read at 5-7 GB/s. I was getting 80 MB/s. That is 1.5% of the theoretical maximum.

The problem was not the disk. It was the syscalls.

For 104,000 files, the operating system had to execute:

104,000 open() calls
104,000 read() calls
104,000 close() calls

That is over 300,000 syscalls. Each syscall involves:

A context switch from user space to kernel space
Kernel bookkeeping and permission checks
A context switch back to user space

Each syscall costs roughly 1-5 microseconds. Multiply that by 300,000 and you get 0.3-1.5 seconds of pure overhead, before any actual disk I/O happens. Add filesystem metadata lookups, directory traversal, and you understand where the time goes.

I tried a few things that did not help much. Memory-mapping the files made things worse due to the per-file mmap/munmap overhead. Replacing Dart's file reading with direct FFI syscalls (open/read/close) only gave a 5% improvement. The problem was not Dart's I/O layer, it was the sheer number of syscalls.

The hypothesis

I have mirrored pub.dev several times in the past and noticed that all packages are stored as tar.gz archives. I never really understood why, but this problem reminded me of that fact. If syscalls are the problem, the solution is fewer syscalls. What if instead of 104,000 files, I had 1,351 files (one per package)?

I wrote a script to package each cached package into a tar.gz archive:

104,000 individual files -> 1,351 tar.gz archives
1.13 GB uncompressed     -> 169 MB compressed (6.66x ratio)

The results

Metric	Individual Files	tar.gz Archives
Files/Archives	104,000	1,351
Data on disk	1.13 GB	169 MB
I/O time	14,525 ms	339 ms
Decompress time	-	4,507 ms
Lex time	2,968 ms	2,867 ms
Total time	17,493 ms	7,713 ms

The I/O speedup was 42.85x. Reading 1,351 sequential files instead of 104,000 random files reduced I/O from 14.5 seconds to 339 milliseconds.

The total speedup was 2.27x. Even with decompression overhead, the archive approach was more than twice as fast.

Breaking down the numbers

I/O: 14,525 ms to 339 ms

This is the syscall overhead in action. Going from 300,000+ syscalls to roughly 4,000 syscalls (open/read/close for 1,351 archives) eliminated most of the overhead.

Additionally, reading 1,351 files sequentially is far more cache-friendly than reading 104,000 files scattered across the filesystem. The OS can prefetch effectively, the SSD can batch operations, and the page cache stays warm.

Decompression: 4,507 ms

gzip decompression ran at about 250 MB/s using the archive package from pub.dev. This is now the new bottleneck. I did not put much effort into optimizing decompression, an FFI-based solution using native zlib could be significantly faster. Modern alternatives like lz4 or zstd might also help.

Compression ratio: 6.66x

Source code compresses well. The 1.13 GB of Dart code compressed to 169 MB. This means less data to read from disk, which helps even on fast SSDs.

Why pub.dev uses tar.gz

pub.dev versions page with download button

pub.dev package download showing flame-1.34.0.tar.gz

This experiment accidentally explains the pub.dev package format. When you run dart pub get, you download tar.gz archives, not individual files. The reasons are now obvious:

Fewer HTTP requests. One request per package instead of hundreds.
Bandwidth savings. 6-7x smaller downloads.
Faster extraction. Sequential writes beat random writes.
Reduced syscall overhead. Both on the server (fewer files to serve) and the client (fewer files to write).
Atomicity. A package is either fully downloaded or not. No partial states.

The same principles apply to npm (tar.gz), Maven (JAR/ZIP), PyPI (wheel/tar.gz), and virtually every package manager.

The broader lesson

Modern storage is fast. NVMe SSDs can sustain gigabytes per second. But that speed is only accessible for sequential access to large files. The moment you introduce thousands of small files, syscall overhead dominates.

This matters for:

Build systems. Compiling a project with 10,000 source files? The filesystem overhead might exceed the compilation time.
Log processing. Millions of small log files? Concatenate them. Claude uses JSONL for this reason.
Backup systems. This is why rsync and tar exist.

What I would do differently

If I were optimizing this further:

Use zstd instead of gzip. 4-5x faster decompression with similar compression ratios.
Use uncompressed tar for local caching. Skip decompression entirely, still get the syscall reduction.
Parallelize with isolates. Multiple cores decompressing multiple archives simultaneously.

Conclusion

I set out to benchmark a lexer and ended up learning about syscall overhead. The lexer was 2x faster. The I/O optimization was 43x faster.

Addendum: Reader Suggestions

Linux-Specific Optimizations

servermeta_net pointed out two Linux-specific approaches: disabling speculative execution mitigations (which could improve performance in syscall-heavy scenarios) and using io_uring for asynchronous I/O. I ran these benchmarks on macOS, which does not support io_uring, but these Linux capabilities are intriguing. A follow-up post exploring how I/O performance can be optimized on Linux may be in order.

macOS-Specific Optimizations

tsanderdev pointed out that macOS's kqueue could potentially improve performance for this workload. While kqueue is not equivalent to Linux's io_uring (it lacks the same syscall batching through a shared ring buffer), it may still offer some improvement over synchronous I/O. I have not benchmarked this yet.

SQLite as an Alternative

tsanderdev mentioned that this is also why SQLite can be much faster than a directory with lots of small files. I had completely forgotten about SQLite as an option. Storing file contents in a SQLite database would eliminate the syscall overhead while providing random access to individual files, something tar.gz does not offer.

This also explains something I have heard multiple times: Apple uses SQLite extensively for its applications, essentially simulating a filesystem within SQLite databases. If 100,000 files on a modern Mac with NVMe storage takes 14 seconds to read, imagine what it was like on older, slower machines. The syscall overhead would have been even more punishing. Using SQLite instead of the filesystem lets them avoid those syscalls entirely. This is worth exploring.

Skip the Cleanup Syscalls

matthieum suggested a common trick used by batch compilers: never call free, close, or munmap, and instead let the OS reap all resources when the process ends. For a one-shot batch process like a compiler (or a lexer benchmark), there is no point in carefully releasing resources that the OS will reclaim anyway.

GabrielDosReis added a caveat: depending on the workload, you might actually need to call close, or you could run out of file descriptors. On macOS, you can check your limits with:

$ launchctl limit maxfiles
maxfiles    256            unlimited

$ sysctl kern.maxfilesperproc
kern.maxfilesperproc: 61440

The first number (256) is the soft limit per process, the second is the hard limit. kern.maxfilesperproc shows the kernel's per-process maximum. With 104,000 files, skipping close calls would exhaust even the maximum limit.

There is even a further optimization: use a wrapper process. The wrapper launches a worker process that does all the work. When the worker signals completion (via stdout or a pipe), the wrapper terminates immediately without waiting for its detached child. Any script waiting on the wrapper can now proceed, while the OS asynchronously reaps the worker's resources in the background. I had not considered this approach before, but it seems worth trying.

Linker Strategies for Fast Exits

MaskRay added context about how production linkers handle this exact problem. The mold linker uses the wrapper process approach mentioned above, forking a child to do all the work while the parent exits immediately after the child signals completion. This lets build systems proceed without waiting for resource cleanup. The --no-fork flag disables this behavior for debugging. The wild linker follows the same pattern.

lld takes a different approach with two targeted hacks: async unlink to remove old output files in a background thread, and calling _exit instead of exit to skip the C runtime's cleanup routines (unless LLD_IN_TEST is set for testing).

MaskRay notes a tradeoff with the wrapper process approach: when the heavy work runs in a child process, the parent process of the linker (typically a build system) cannot accurately track resource usage of the actual work. This matters for build systems that monitor memory consumption or CPU time.

Why pub.dev Actually Uses tar.gz

Bob Nystrom from the Dart team clarified that my speculation about pub.dev's format choice was partially wrong. Fewer HTTP requests and bandwidth savings definitely factored into the decision, as did reduced storage space on the server. Atomicity is important too, though archives do not fully solve the problem since downloads and extracts can still fail. However, it is unlikely that the I/O performance benefits (faster extraction, reduced syscall overhead) were considered: pub extracts archives immediately after download, the extraction benefit only occurs once during pub get, that single extraction is a tiny fraction of a fairly expensive process, and pub never reads the files again except for the pubspec. The performance benefit I measured only applies when repeatedly reading from archives, which is not how pub works.

This raises an interesting question: what if pub did not extract archives at all? For a clean (non-incremental) compilation of a large project like the Dart Analyzer with hundreds of dependencies, the compiler needs to access thousands of files across many packages. If packages remained in an archive format with random access support (like ZIP), the syscall overhead from opening and closing all those files could potentially be reduced. Instead of thousands of open/read/close syscalls scattered across the filesystem, you would have one open call per package archive, then seeks within each archive. Whether the decompression overhead would outweigh the syscall savings is unclear, but it might be worth exploring for build systems where clean builds of large dependency trees are common.

Use dart:io for gzip Instead of package:archive

Simon Binder pointed out that dart:io already includes gzip support backed by zlib, so there is no need to use package:archive for decompression. Since dart:io does not support tar archives, I used package:archive for everything and did not think of mixing in dart:io's gzip support separately. Using dart:io's GZipCodec for decompression while only relying on package:archive for tar extraction could yield better performance. I will try this approach when I attempt to lex a bigger corpus.

TAR vs ZIP: Sequential vs Random Access

vanderZwan pointed out that ZIP files could provide SQLite-like random access benefits. This highlights a fundamental architectural difference between TAR and ZIP:

TAR was designed in 1979 for sequential tape drives. Each file's metadata is stored in a header immediately before its contents, with no central index. To find a specific file, you must read through the archive sequentially. When compressed as tar.gz, the entire stream is compressed together, so accessing any file requires decompressing everything before it. The format was standardized by POSIX (POSIX.1-1988 for ustar, POSIX.1-2001 for pax), is well-documented, and preserves Unix file attributes fully.

ZIP was designed in 1989 with a central directory stored at the end of the archive. This directory contains offsets to each file's location, enabling random access: read the central directory once, then seek directly to any file. Each file is compressed individually, so you can decompress just the file you need. This is why JAR files, OpenDocument files, and EPUB files all use the ZIP format internally.

Aspect	TAR	ZIP
Random access	No (sequential only)	Yes (central directory)
Standardization	POSIX standard	PKWARE-controlled specification
Unix permissions	Fully preserved	Limited support
Compression	External (gzip, zstd, etc.)	Built-in, per-file

There seems to be no widely-adopted Unix-native format that combines random access with proper Unix metadata support. TAR handles sequential access with full Unix semantics. ZIP handles random access but originated from MS-DOS and has inconsistent Unix permission support. What we lack is something like "ZIP for Unix": random access with proper ownership, permissions, extended attributes, and ACLs.

The closest answer is dar (Disk ARchive), designed explicitly as a tar replacement with modern features. It stores a catalogue index at the end of the archive for O(1) file extraction, preserves full Unix metadata including extended attributes and ACLs, supports per-file compression with choice of algorithm, and can isolate the catalogue separately for fast browsing without the full archive. However, dar has not achieved the ubiquity of tar or zip.

For my lexer benchmark, random access would not help since I process all files anyway. But for use cases requiring access to specific files within an archive, this architectural distinction matters.

Discuss on r/ProgrammingLanguages