解析进展
Parsing Advances

原始链接: https://matklad.github.io/2025/12/28/parsing-advances.html

## 玩具解析器与无限循环调试 在一次圣诞假期中,作者受到Resilient LL Parsing Tutorial的启发,重新审视了玩具解析器的开发。目标并非错误恢复的弹性,而是更倾向于生成语法树和诊断信息,而非立即失败。遇到的一个关键挑战是潜在的无限循环/递归问题,尤其是在解析函数在错误情况下可能不消费token时。 传统上,这通过“燃料”(token消费计数器)和仔细地跟踪*总是*消费token的函数来解决。然而,一个更有效的解决方案出现了:断言解析器在每个预期步骤后都会前进。 这涉及到向`Parser`类添加`advance_push()`和`advance_pop()`方法来跟踪token的消费。解析函数中的断言随后验证是否调用了`advance_pop()`,从而确保进度。这种方法不仅能提供即时错误报告,还能将关键的函数前进映射直接*具象化*到代码中,从而使调试更容易,并降低无限循环的风险。作者用一个修正后的Pratt解析函数演示了这一点,展示了一个清晰的错误信息,准确地指出了问题所在。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 解析进展 (matklad.github.io) 8 分,来自 birdculture 30 分钟前 | 隐藏 | 过去 | 收藏 | 讨论 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索:
相关文章

原文

I find myself writing yet another toy parser, as one does during a Christmas break. It roughly follows Resilient LL Parsing Tutorial. Not because I need resilience, but mostly because I find producing a syntax tree and a collection of diagnostics a more natural fit for the problem than bailing out on the first error.

One practical pitfall with the approach is infinite loops/recursion. Resilience sometimes means not consuming a token, and, if you do that in a loop or a Pratt recursive call, you’ll get yourself an annoying to debug error:

running 1 test from ./src/corpus_test.ts
corpus ...Task test deno test --allow-read=./src/corpus --allow-write=./src/corpus "--" "--update"
Check src/corpus_test.ts

<--- Last few GCs --->
4,[26641:0x9d1574000]     7390 ms: Mark-Compact (reduce) 3924.9 (3927.3) -> 3924.9 (3926.3) MB, pooled: 0.0 MB, 1224.00 / 0.00 ms (+ 0.3 ms in 1 steps since start of marking, biggest step 0.3 ms, walltime since start of marking 1232 ms) (average mu = 0.200,[26641:0x9d1574000]     8804 ms: Mark-Compact (reduce) 4009.9 (4011.3) -> 4009.9 (4011.3) MB, pooled: 0.0 MB, 1294.67 / 0.00 ms (+ 0.2 ms in 1 steps since start of marking, biggest step 0.2 ms, walltime since start of marking 1302 ms) (average mu = 0.141,


#
# Fatal JavaScript out of memory: Ineffective mark-compacts near heap limit
#
==== C stack trace ===============================

    0   deno                                0x0000000102ce8404 v8::base::debug::StackTrace::StackTrace() + 24
    1   deno                                0x0000000102ceeb9c v8::platform::(anonymous namespace)::PrintStackTrace() + 24
    2   deno                                0x0000000102ce4094 v8::base::FatalOOM(v8::base::OOMType, char const*) + 68
    3   deno                                0x0000000102d3a7a8 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, v8::OOMDetails const&) + 296
    4   deno                                0x0000000102f37378 v8::internal::Heap::stack() + 0
    5   deno                                0x0000000102f3581c v8::internal::Heap::CheckMemoryPressure() + 0
    6   deno                                0x0000000102ead4f8 v8::internal::StackGuard::HandleInterrupts(v8::internal::StackGuard::InterruptLevel) + 504
    7   deno                                0x000000010335fe44 v8::internal::Runtime_HandleNoHeapWritesInterrupts(int, unsigned long*, v8::internal::Isolate*) + 304
    8   deno                                0x00000001043887b4 Builtins_CEntry_Return1_ArgvOnStack_NoBuiltinExit + 84
    9   ???                                 0x0000000126997874 0x0 + 4942559348
    10  ???                                 0x000000012698a758 0x0 + 4942505816
...

For a concrete example, you might parse function argument list using code like this:

const result: ast.Expression[] = [];
p.expect("(");
while (!p.eof() && !p.at(")")) {
  result.push(expression(p));
  if (!p.at(")")) p.expect(",");
}
p.expect(")");
return result;

The implicit contract here is that expression consumes at least one token, even if there are errors in the source code. If there’s some token that makes expression bail without consuming anything, the code loops forever, and you’ll need a debugger to get at the stack trace!

The way I solved this issue traditionally is via a combination of two techniques:

Fuel: parser has a fuel: Cell<u32> field, which is decremented even by “readonly” lookahead methods, and topped up every time the parser consumes a token. Fuel is useful to make you parser crash somewhat cleanly, though the crash is typically still removed from problematic function by several stack frames.

The second technique is to maintain a mental map of functions which always consume at least one token of input, and functions which might bail without consuming anything. And, whenever you write a loop or a recursive call, consult this map to be sure to call at least one token-consuming function. Hard and error prone!

Well, I think I’ve figured something better today! You can assert that parser did advance when you expect it to. The smaller benefit here is that if parser didn’t advance, you get an immediate error. The bigger benefit is that these asserts materialize the mental map of advancing functions in the source code, so it doesn’t have to be mental anymore!

This seems like an obvious idea in retrospect, but, well, took me more than one parser to figure it out!

Concretely, I came up with the following base parser API:

class Parser {
  private tokens: ast.Token[];
  private index: number = 0;
  private advances: number[] = [];

  advance_push() {
    this.advances.push(this.index);
  }
  advance_pop() {
    const advance = this.advances.pop();
    assert(advance !== undefined);
    assert(advance < this.index);
  }
  advance_drop() {
    const advance = this.advances.pop();
    assert(advance !== undefined);
  }
}

And here is the buggy function that lead to the error at the start of the article:

function expression_pratt(
  p: Parser,
  left: ast.TokenTag,
): ast.Expression {
  let lhs: ast.Expression = expression_delimited(p);

  while (p.at("(")) {
    lhs = expression_call(p, lhs);
  }

  while (true) {
    const right = p.token();
    if (expression_pratt_right_binds_tighter(left, right.tag)) {
      const rhs = expression_pratt(p, right.tag);
      lhs = {
        tag: "ExpressionBinary",
        location: right.location,
        operator: right.tag as ast.BinaryOperation,
        lhs,
        rhs,
      };
    } else {
      return lhs;
    }
  }
}

The same function, but with advanced assertions:

function expression_pratt(
  p: Parser,
  left: ast.TokenTag,
): ast.Expression {
  let lhs: ast.Expression = expression_delimited(p);

  while (p.at("(")) {
    lhs = expression_call(p, lhs);
  }

  while (true) {
    p.advance_push();
    const right = p.token();
    if (expression_pratt_right_binds_tighter(left, right.tag)) {
      const rhs = expression_pratt(p, right.tag);
      lhs = {
        tag: "ExpressionBinary",
        location: right.location,
        operator: rhs.tag as ast.BinaryOperation,
        lhs,
        rhs,
      };
    } else {
      p.advance_drop();
      return lhs;
    }
    p.advance_pop();
  }
}

The new error message:

running 1 test from ./src/corpus_test.ts
corpus ... FAILED (11ms)

 ERRORS

corpus => ./src/corpus_test.ts:47:6
error: Error: assertion failed
  if (!condition) throw new Error("assertion failed");
                        ^
    at assert (./src/stdx.ts:2:25)
    at Parser.advance_pop (./src/parse.ts:132:5)
    at expression_pratt (./src/parse_grammar.ts:169:7)
    at expression (./src/parse_grammar.ts:143:10)
    at expression_block (./src/parse_grammar.ts:305:21)
    at declaration_fun (./src/parse_grammar.ts:73:7)
    at declaration (./src/parse_grammar.ts:25:12)
    at Module.file (./src/parse_grammar.ts:10:15)
    at Module.parse (./src/parse.ts:13:18)
    at ast_dump (./src/corpus_test.ts:85:22)

and the fix:

  while (true) {
    p.advance_push();
    const right = p.token();
    if (expression_pratt_right_binds_tighter(left, right.tag)) {
      p.bump();
      const rhs = expression_pratt(p, right.tag);
      lhs = {
        tag: "ExpressionBinary",
        location: right.location,
        operator: rhs.tag as ast.BinaryOperation,
        lhs,
        rhs,
      };
    } else {
      p.advance_drop();
      return lhs;
    }
    p.advance_pop();
  }
联系我们 contact @ memedata.com