将 WebAssembly 文本格式解析器性能提高 350%

将 WebAssembly 文本格式解析器性能提高 350%
Improving the performance of WAT parser

原始链接: https://blog.gplane.win/posts/improve-wat-parser-perf.html

## WAT 解析器性能提升 wasm-language-tools (v0.5 及更早版本) 中的 WebAssembly 文本格式 (WAT) 解析器经历了一次重大的性能改进，基准测试显示速度提升了 **350%**。这一改进源于解析器的完全重写，重点在于优化技术。主要变化包括放弃解析器组合子 (如 `winnow`)，转而采用 **手写解析器** 以获得更大的控制力和速度。进一步的优化涉及 **预分配和克隆频繁使用的 token 和节点** (使用 `Arc` 和 `LazyLock`)，以避免重复创建。词法分析通过 **直接检查字节前缀来查找关键字** 而不是字符串比较来改进，并采取了防止部分匹配的安全措施。解析器还利用 `get_unchecked` 来创建 ASCII token，绕过不必要的 UTF-8 检查，并使用自定义 `Token` 类型来减少开销。一项关键的优化是使用 **单个共享的 `Vec` 来构建语法树**，避免了与为每个节点创建临时向量相关的许多分配和释放。这种方法受到 `rowan::GreenNodeBuilder` 的启发，使用类似堆栈的索引跟踪系统有效地管理节点子项。这些组合优化带来了显著的性能提升，明显提高了复杂 WAT 模块的解析速度。

## WAT 解析器性能改进 - 总结 rust-analyzer 团队一直在解决 WAT 解析器 (Rowan) 的性能问题，发现其双向链表结构是关键瓶颈。虽然该结构设计用于变异，但它比创建和替换整个解析树更慢，后者能利用更大的 L1/L2 缓存来提高缓存性能。讨论强调了由硬件进步驱动的解析器不断重构的循环。解析器性能的核心挑战在于高效地消耗 token，识别语法，并生成抽象语法树 (AST)，所有这些都严重依赖于特定实现的的数据类型。提到了诸如解析器生成器 (chumsky, LALRPOP, grmtools, antlr4rust) 之类的替代方案，但手写解析器通常提供更快的速度，尽管牺牲了可维护性。通过使用共享向量来存储节点，显著提高了性能，减少了分配开销。目前的解析速度为 115MiB/s，但有人认为这对于 Rust 程序来说仍然相对较慢，可能是由于语法复杂性所致。最后，该讨论涉及了 WebAssembly (WASM) 在性能关键型 Web 应用程序（如 Figma、游戏引擎和复杂工具）中日益增长的使用，以及它在服务器端代码沙箱中的应用。

原文

The WAT (WebAssembly Text Format) parser in wasm-language-tools v0.5 or before was not fast enough. Recently I have rewritten the parser from scratch, and the performance has been increased by 350% in the benchmark.
Let me share how I optimized it.

Use hand-written parser

The old parser was written with winnow which is a parser combinator library.
While it’s easy to create a parser with parser combinators, it’s generally slower than a hand-written parser,
so the first step is to write the parser by hands. Hand-written parser is not only faster but also allows to do more optimizations in the future.

Clone well-known green tokens and green nodes

There’re many parentheses and keywords in WAT. For these tokens and nodes, they shouldn’t be created again and again when parsing.
Looking into the implementation of rowan::GreenToken and rowan::GreenNode, there’s a Arc inside,
so we can prepare these well-known tokens and nodes in advance, then put them into LazyLock one by one, and clone them when needed.

Keyword matching

There’re many keywords in WAT such as module, func, param, result, etc.
When recognizing keywords in lexer, we don’t capture a word and then check it by comparing strings.

Instead, we check the prefix of source code in bytes:

self.input.as_bytes().starts_with(keyword.as_bytes())

However, there may be a word like function that starts with func but it isn’t a keyword, so we must check the next character is not an identifier character.

Use `get_unchecked` to create token

Except strings and comments, other kinds of tokens are just ASCII strings.
For these tokens, we can use get_unchecked to avoid unnecessary UTF-8 boundary check which get will do.

Use our own `Token` type

The lexer will produce tokens in our own Token type instead of rowan::GreenToken,
because creating rowan::GreenToken is much more expensive, and we should create it only when needed.

The Token type is simple as below:

struct Token<'s> {
    kind: SyntaxKind,
    text: &'s str,
}

For convenience, I added impl From<Token<'_>> for rowan::NodeOrToken<rowan::GreenNode, rowan::GreenToken>.

Use a shared single `Vec` to avoid allocations

When creating a rowan::GreenNode, an iterator whose items are rowan::NodeOrToken<rowan::GreenNode, rowan::GreenToken>s is required.

At first, we may create a new Vec for each node. However, this will create many temporary Vecs because a syntax tree contains many nodes, which causes many allocations and deallocations.

Instead, inspired by rowan::GreenNodeBuilder, we only create a single shared Vec in the parser.
When parsing, we push children tokens and nodes into that Vec;
when finishing a node, we use the drain method of Vec which can remove a range of children and return them as an iterator, and the iterator can be used to create a rowan::GreenNode.

But, how to know which range to drain? Certainly the end of the range is the current length of the Vec.
Inspired by rowan::GreenNodeBuilder, we use a usize to record the start index which is the length of the Vec when starting the node.

This is stack-like naturally, though we don’t need to create an explicit stack,
but this is unlike how rowan::GreenNodeBuilder implements, which avoids another Vec and unwrap calls.

There is nothing similar to rowan::GreenNodeBuilder::start_node_at in our implementation,
instead we need to manually pass the start index to other functions, but this is cheap since it’s just a usize.

Result

The code snippet below is used in the benchmark:

(module
    (func $f1 (param $p1 i32) (param $p2 i32) (result i32)
        (i32.add (local.get $p1) (local.get $p2))
    )
    (global $g1 f64 (f64.const 0))
    (func $f2 (result f64)
        (global.get $g1)
    )
    (type $t (func (result f64)))
    (func $f3 (type $t)
        (call $f2)
    )
    (func (export "f32.min_positive") (result i32) (i32.reinterpret_f32 (f32.const 0x1p-149)))
    (func (export "f32.min_normal") (result i32) (i32.reinterpret_f32 (f32.const 0x1p-126)))

    (rec (type $r (sub $t (struct (field (ref $r))))))
    (global  (mut f32) (f32.const -13))
    (rec
        (type $t1 (sub (func (param i32 (ref $t3)))))
        (type $t2 (sub $t1 (func (param i32 (ref $t2)))))
    )
    (global  (mut f64) (f64.const -14))

    (func (export "f32.max_finite") (result i32) (i32.reinterpret_f32 (f32.const 0x1.fffffep+127)))
    (func (export "f32.max_subnormal") (result i32) (i32.reinterpret_f32 (f32.const 0x1.fffffcp-127)))
)

Here is the benchmark result after all optimizations:

parser/old  time:   [59.473 µs 59.559 µs 59.648 µs]
            change: [-2.2351% -1.9312% -1.6268%] (p = 0.00 < 0.05)

parser/new  time:   [13.004 µs 13.120 µs 13.299 µs]
            change: [-1.6774% +0.4516% +4.8905%] (p = 0.82 > 0.05)