Zig 中的 PDF 文本提取 – 比 MuPDF 快 5 倍

Zig 中的 PDF 文本提取 – 比 MuPDF 快 5 倍
Zpdf: PDF text extraction in Zig – 5x faster than MuPDF

## zpdf：一款用 Zig 编写的快速 PDF 文本提取库 zpdf 是一个用 Zig 编写的高性能 PDF 文本提取库，旨在实现速度和效率。它利用内存映射文件和流式提取来处理大型文档，而无需过度分配内存。主要功能包括支持多种解压缩过滤器（FlateDecode、ASCII85 等）、各种字体编码（WinAnsi、MacRoman、ToUnicode）以及 PDF 1.5+ 版本。zpdf 相比 MuPDF 1.26 具有**显著的速度提升**，在并行模式下可实现高达 **18 倍** 的提取速度（峰值吞吐量为 41,000 页/秒）。该库提供了一个 CLI，用于将文本提取到 stdout 或文件，显示文档信息以及进行基准测试。它采用模块化设计，涵盖解析、XRef 表、页面树、内容流解释等。zpdf 采用 MIT 许可证发布。

一个用 Zig 编程语言编写的新型 PDF 文本提取库，性能比 MuPDF 快 5 倍，峰值吞吐量达到约 41,000 页/秒。该库由 lulzx 开发，代码量约为 5,000 行，它通过内存映射 I/O（消除读取系统调用）、零拷贝解析和 SIMD 加速字符串搜索等关键优化来优先考虑速度。它利用 Zig 的线程池进行并行页面提取，并使用流式输出以避免不必要的内存分配。该库支持现代 PDF 功能，包括 XRef 表、增量更新、常见压缩过滤器和各种字体编码——包括复杂的 CID 字体。值得注意的是，它没有外部依赖，并且可以在 2 秒内快速编译。该项目可在 GitHub 上找到。

原文

A PDF text extraction library written in Zig.

Memory-mapped file reading for efficient large file handling
Streaming text extraction (no intermediate allocations)
Multiple decompression filters: FlateDecode, ASCII85, ASCIIHex, LZW, RunLength
Font encoding support: WinAnsi, MacRoman, ToUnicode CMap
XRef table and stream parsing (PDF 1.5+)
Configurable error handling (strict or permissive)
Multi-threaded parallel page extraction

Text extraction performance vs MuPDF 1.26 (mutool convert -F text):

Parallel (multi-threaded)

Document	Pages	Size	zpdf	MuPDF	Speedup
Adobe Acrobat Reference	651	19 MB	60 ms	512 ms	8.5x
C++ Standard Draft	2,134	8 MB	142 ms	1,020 ms	7.2x
Pandas Documentation	3,743	15 MB	233 ms	1,204 ms	5.2x
Intel SDM	5,252	25 MB	127 ms	2,260 ms	18x

Peak throughput: 41,000 pages/sec (Intel SDM, parallel)

Build with zig build -Doptimize=ReleaseFast for these results.

Note: MuPDF's threading (-T flag) is for rendering/rasterization only. Text extraction via mutool convert -F text is single-threaded by design. zpdf parallelizes text extraction across pages.

zig build              # Build library and CLI
zig build test         # Run tests

const zpdf = @import("zpdf");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    const doc = try zpdf.Document.open(allocator, "file.pdf");
    defer doc.close();

    var buf: [4096]u8 = undefined;
    var writer = std.fs.File.stdout().writer(&buf);
    defer writer.interface.flush() catch {};

    for (0..doc.pages.items.len) |page_num| {
        try doc.extractText(page_num, &writer.interface);
    }
}

zpdf extract document.pdf           # Extract all pages to stdout
zpdf extract -p 1-10 document.pdf   # Extract pages 1-10
zpdf extract -o out.txt document.pdf # Output to file
zpdf info document.pdf              # Show document info
zpdf bench document.pdf             # Run benchmark

src/
├── root.zig         # Document API and core types
├── parser.zig       # PDF object parser
├── xref.zig         # XRef table/stream parsing
├── pagetree.zig     # Page tree resolution
├── decompress.zig   # Stream decompression filters
├── encoding.zig     # Font encoding and CMap parsing
├── interpreter.zig  # Content stream interpreter
├── simd.zig         # SIMD string operations
└── main.zig         # CLI

Implemented:

XRef table and stream parsing
Incremental PDF updates (follows /Prev chain for modified documents)
Object parser
Page tree resolution
Content stream interpretation (Tj, TJ, Tm, Td, etc.)
Font encoding (WinAnsi, MacRoman, ToUnicode CMap)
CID font handling (Type0 composite fonts, Identity-H/V encoding, UTF-16BE)
Stream decompression (FlateDecode, ASCII85, ASCIIHex, LZW, RunLength)

MIT

Zig 中的 PDF 文本提取 – 比 MuPDF 快 5 倍 Zpdf: PDF text extraction in Zig – 5x faster than MuPDF

Parallel (multi-threaded)

Zig 中的 PDF 文本提取 – 比 MuPDF 快 5 倍
Zpdf: PDF text extraction in Zig – 5x faster than MuPDF