我重写了 PostHog 的 SQL 解析器，速度提升了 70 倍，而且几乎没怎么看代码。

我重写了 PostHog 的 SQL 解析器，速度提升了 70 倍，而且几乎没怎么看代码。
I rewrote PostHog's SQL parser, 70x faster, while barely looking at the code

原始链接: https://posthog.com/blog/sql-parser

PostHog 最近对其 SQL 解析器进行了彻底改造，用 Rust 编写的自定义“手写”版本取代了原先由 ANTLR 生成的 C++ 解析器。借助人工智能辅助开发，此次重写在保持实际查询功能完全一致的同时，实现了约 70 倍的速度提升（在生产环境中最高可达 454 倍）。作者利用多个并行的 Claude Code 会话生成了 1.6 万行的代码库，并以原始的 ANTLR 解析器作为验证的“预言机”。一套完善的开发循环是成功的关键： * **基于属性的测试：** 使用 `Hypothesis` 生成海量的 SQL 排列组合，自动发现边缘情况。 * **自动化反馈：** 将失败的测试持续集成到共享的回归测试套件中，确保修复方案稳健可靠。 * **上下文管理：** 在实现修复前，提示 AI 读取参考语法和源代码，避免了上下文窗口退化问题。 * **影子测试：** 新解析器在生产流量中运行，在全面部署前验证了数百万条查询，确保结果零偏差。该项目表明，人工智能驱动的开发结合严格的自动化测试，使得高性能、复杂基础设施的维护——过去仅限于专业领域专家——变得更加易于实现且高效。

抱歉。

After the success of using agents to improve query performance through autoresearch, I wanted to try something more ambitious.

I rewrote PostHog's SQL parser using multiple long-running Claude Code sessions in parallel. The result was 16K lines of "hand"-rolled parser code, 5K lines of tooling, a few more K of tests, and a ~70x speed up.

The new parser is equivalent to the previous one for all realistic queries, only differing for a tiny subset of queries written by an evil trickster deity (there’s a test for SELECT SELECT FROM FROM WHERE WHERE AND AND which is completely valid SQL).

Here's how I did it and what I learned along the way.

Why does PostHog even have an SQL parser?

PostHog lets you access your data directly with SQL. We transpile your SQL to raw ClickHouse SQL because:

We want to present a logical view of your data which is independent of the physical layout in the database.
This lets us change things at the database layer without breaking existing queries.
We can also add a bunch of performance optimizations and access controls.

The majority of PostHog tools (e.g. product analytics, session replay, error tracking) have queries written in SQL and they go through the exact same transpilation process. But before we can do this transpilation, we need to use a parser to turn the SQL into an AST (Abstract Syntax Tree) that then gets transpiled into ClickHouse SQL.

The parser is the first thing that touches a query, meaning it operates on untrusted input. Everything downstream, like access controls and optimizations, operate on the tree it produces.

We didn't write this parser by hand because, at least pre-AI-coding, parsers were extremely difficult to maintain. Writing one without AI would have taken months and likely not been worth it, even if it had dramatically improved our p95 response time.

Instead, we use ANTLR, a state-of-the-art, open source parser generator. You provide your grammar declaratively in a .g4 file and ANTLR generates most of the parser code for you. We use the C++ version, so it’s already in a “fast” language. Unlike our flags rewrite, the speedup wouldn't just come from moving to Rust.

ANTLR is extremely powerful and flexible, but the trade-off is that it does a lot more work for each token that it visits. It compiles your grammar into an ATN (essentially an NFA-with-a-stack) and has a generic interpreter walk a graph at runtime. There’s no hand-written parseExpression(); everything happens through an additional layer of abstraction and indirection.

Additionally, ANTLR supports arbitrary dynamic lookahead, so if there are multiple possible alternatives it has to simulate every interpretation in lockstep until only one interpretation is valid. It’s extremely well optimized but a graph-walking interpreter can never be as fast as a hand-rolled recursive-descent parser.

With AI, it is much more possible to write and maintain a hand-rolled parser. Sadly, it's not as easy as telling Claude to "write a new parser in Rust, make no mistakes." It did, in fact, make a lot of mistakes, kept doubting whether such a rewrite was even possible, and wanted to call it a day after each round of coding. To be honest, I didn’t really know if it was possible either.

I tested two approaches in parallel:

One focused on performance. I knew that, if it worked, the fastest possible parser would be recursive-descent with a Pratt expression loop, adding lookahead and backtracking only where necessary.
The other focused on an approach most likely to result in a successful parser. It followed ANTLR’s behavior as closely as possible, but implemented the transitions in explicit code rather than as generic graph traversal.

In the end, both of those approaches worked about as well as each other, but I wouldn’t know this until I’d been working on it for a couple of days.

My goal was complete agreement with the oracle (i.e. the existing C++ parser) for all realistic queries and to get as close as possible for contrived ones. Having an oracle was critical for how I developed the new parser, because I could essentially do test-driven-development by finding some SQL that the parsers disagreed on, fixing the new parser to agree, and repeating.

Generating disagreements, or test cases, was pretty easy to start with, because we already had many regression tests written while developing the original parser. Once those were all passing, that’s where things started to get interesting.

Property-based testing

I had previously found bugs in our SQL transpiler using Hypothesis, a PBT (property-based testing) library. You define some property of your code plus the inputs it takes and it will try to generate inputs where that property does not hold.

To give a specific example, the property of my new parser is that it agrees with the oracle. The input is an SQL query. This means that Hypothesis is going to try to find an SQL query where my new parser does not agree with the oracle.

I had to tell Hypothesis how to generate interesting SQL so I (with Claude) wrote a tool to codegen an SQL generator based on the ANTLR grammar file. I have to admit that I chuckled a bit when writing a new SQL parser led to writing a new parser for .g4 files too. Later on, I also added a step to add extra permutations to the generated SQL like swapping tokens or adding parentheses.

Prompt engineering against brittle fixes

PBT could reliably generate new test cases, and my development loop was working well, but Claude kept making brittle fixes. For example, it would fix one case by adding a one-token lookahead and later realize that it needed a two-token lookahead instead. I was regularly hitting a maxed context window and compacting, so I suspect it had just “forgotten” what the actual grammar or reference parser looked like.

This could be solved by some basic prompt engineering. I simply told it to load both the grammar file and the relevant C++ source code into context immediately before writing any code to fix a particular divergence. This took me longer than I’d like to admit to figure out.

Maxing out and thinking hard

At this point, I wanted to keep my CPU maxed on PBT and my Claude inference maxed writing the parser, so I wrote some tooling to have the PBT run constantly in the background, writing new failing test cases to a file rather only surfacing them. Claude could fetch them when it had nothing else to work on.

I had a few other ways of generating failing test cases such as pulling anonymized queries from our production query log. Hilariously one of the most effective was to tell Claude to “think really hard about edge cases" in a background agent.

The two parallel parser approaches shared their regression suites, so any failing test case found in one session was shared with the other.

Hypothesis will also "reduce" test cases for you, turning them into a minimal reproduction, but I couldn’t use that with SQL from other sources. For those I used ShrinkRay instead.

Later on, I added code-coverage-guided test case generation, which gives a better distribution of generated SQL. With coverage feedback, the generator can tell which constructs it hasn't exercised yet and bias towards those. This wasn't necessary to hit 100% accuracy on a production corpus, but it did help me find some very subtle test cases.

The final iteration of my loop looked something like this:

⁠Generate new test failures from PBT, real corpus, regression tests, and "think really hard about edge cases"
Add a shrunk version of the failures to an expanding list of regression tests
Think hard about the best way to fix this, prefer general solutions if possible, read the grammar and C++ source for how the reference parser handles it
Make the fix and print a one-paragraph summary for the human operator to read
Run the regression suite to make sure everything passes
Re-run the loop autonomously

Due to the new parser being so much faster, I could run this loop in "shadow mode" with our existing C++ parser in production and report if there are any divergences.

When comparing with the production query log, I only ever tested ~50K queries. In shadow mode, I was able to test millions of parses quickly and there were zero divergences. I’d planned to leave it running for a few days, but that was such a strong result that I switched over production traffic (with a 0.1% “reverse shadow”) after a couple of hours.

It now produced identical output (AST + source position) to the C++ ANTLR-based parser, and the performance results (in yellow) almost don't look real:

Benchmark results showing the new parser