展示 HN:SQL 解析器的工作原理
Show HN: How SQL Parsers Work

原始链接: https://nishchith.com/sql-parsers/

## SQL 解析器:深度剖析 尽管重点在于概率系统,启发式 SQL 解析仍然至关重要。本次探索源于 Atlan 在查询引擎和血缘追踪方面的 SQL 解析经验,提炼了评估众多解析器(SQLGlot, sqlparser-rs, Apache Calcite 等)的经验教训。 所有 SQL 解析器都遵循一个流程:**词法分析器 (lexer)** 将 SQL 分解为词元,**解析器 (parser)** 基于语法构建抽象语法树 (AST),**语义分析** 使用模式信息添加含义。解析具有挑战性,因为 SQL 方言差异很大——一个“标准”以及众多专有扩展。 解析器处理方言的方式各不相同:通过标志、参数化语法或每个数据库的独立语法。重要的是,解析器*不是*查询引擎;它专注于理解 SQL 结构,而不是执行。 理解解析器的能力至关重要。有些擅长血缘提取或转译,而另一些则为构建查询引擎奠定基础。选择合适的工具取决于具体需求——数据目录需要解析器,而不是完整的引擎。最终,SQL 解析器的激增反映了 SQL 领域的复杂性和碎片化。

最近 Hacker News 的讨论围绕一篇关于理解 SQL 解析器的博文(nishchith.com)。虽然该文章旨在比较不同数据库系统(Postgres、MySQL、SQL Server、BigQuery、Snowflake)中的 SQL 解析,但评论者认为它出乎意料地浅显。 许多用户表达了失望之情,指出缺乏关于解析器在代码层面*如何*运作的技术深度(词法分析、语法分析、AST)。一位评论员质疑该文章是否真的由人工撰写,还是很大程度上借助像 Claude 这样的 LLM 生成,理由是存在不一致性和表面分析。 有趣的是,讨论还强调了一个相关的挑战——“Vibe SQL”,参与者正在尝试*使用*agentic工具从头开始构建一个 SQL 实现,crates.io 上已经有一个解析器可用。这场对话引发了关于人工智能辅助提供的真正生产力提升与深入理解之间的权衡的问题。
相关文章

原文

It’s rare to talk about heuristic systems when there’s so much hype around probabilistic ones. This post is boring, but stay with me.

My work at Atlan has touched SQL parsing since the beginning. I made early contributions to the query engine with policy-based authorization powering Insights, and we generate SQL lineage by parsing queries to power column-level lineage. Along the way, my colleagues and I have evaluated a lot of SQL parsers, both open source and commercial: SQLGlot, sqlparser-rs, sqloxide, Apache Calcite, Gudusoft GSP, JSqlParser, and others. Each with different tradeoffs.

This post is my attempt to distill what I’ve learned. I’m not an expert, just someone curious enough to ask: why do so many SQL parsers exist? And what are they actually doing under the hood?

The idea for this writeup came during a drive back from SFO after dropping a friend off at the airport. That turned into about 10 hours of conversation with Claude to pull it all together.


Table of Contents


What is a SQL Parser?

A SQL parser reads SQL text and converts it into a structured representation, usually a tree, that computers can work with. It’s the “understanding” step.

INPUT:  "CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'"
                                            ↓
                                       [SQL PARSER]
                                            ↓
OUTPUT: A tree structure representing the query's meaning

Think of it like how your brain parses a sentence to extract meaning. The parser does the same for SQL.


The Full Pipeline

Every SQL parser follows the same fundamental pipeline. This isn’t a design choice. It’s a consequence of how language processing works.

Lexer: Breaks the SQL string into tokens. Keywords, identifiers, operators, literals. Like recognizing words in a sentence.

Parser: Takes tokens and builds a tree based on grammar rules. Like understanding “subject-verb-object” structure.

AST: The Abstract Syntax Tree. A clean, navigable representation of the query’s structure.

Semantic Analysis: Adds meaning. Does this table exist? What type is this column? This is where you need schema information.


Lexical Analysis (The Lexer)

The lexer converts a stream of characters into meaningful chunks called tokens.

Input:

CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'

Output:

Token(CREATE)
Token(TABLE)
Token(IDENTIFIER, "active_users")
Token(AS)
Token(SELECT)
Token(IDENTIFIER, "id")
Token(COMMA)
Token(IDENTIFIER, "name")
Token(COMMA)
Token(IDENTIFIER, "email")
Token(FROM)
Token(IDENTIFIER, "users")
Token(WHERE)
Token(IDENTIFIER, "status")
Token(EQUALS)
Token(STRING, "active")

The lexer handles dialect-specific decisions early:

Decision Standard SQL MySQL SQL Server PostgreSQL
Identifier quote "name" `name` [name] "name"
String quote 'text' 'text' or "text" 'text' 'text'
Line comment -- -- or # -- --
Case sensitivity Insensitive Insensitive Insensitive Insensitive (keywords)

Lexers are simple. They look at one character (or a few) at a time, don’t need to understand nesting or structure, and can be implemented with state machines or regex. This is why SQLGlot’s optional Rust tokenizer gives ~30% speedup. Tokenization is pure CPU-bound character scanning.


Syntactic Analysis (The Parser)

The parser reads tokens and builds a tree structure based on grammar rules.

Input tokens:

[CREATE, TABLE, active_users, AS, SELECT, id, COMMA, name, COMMA, email, FROM, users, WHERE, status, =, 'active']

Grammar rule:

ctas_stmt    := CREATE TABLE table_name AS select_stmt
select_stmt  := SELECT column_list FROM table_name [WHERE condition]

Parser thinks:

  • “I see CREATE TABLE… this must be a ctas_stmt”
  • “Next I need a table_name… I see ‘active_users’”
  • “Next I need AS… got it”
  • “Now I need a select_stmt…”
  • “I see SELECT… parsing the inner query”
  • “Column list: id, name, email”
  • “FROM users”
  • “WHERE status = ‘active’”

The parser is essentially a state machine following grammar rules, consuming tokens and building tree nodes.

Why Parsing is Harder Than Lexing

Lexing is pattern matching. Parsing is about structure.

SELECT * FROM (SELECT * FROM (SELECT * FROM t))

A lexer sees: (, (, (, ), ), )

A parser must match each ( with its corresponding ).

This is why regex can’t parse SQL. Regex can’t count balanced parentheses. It’s mathematically proven (regular languages vs context-free languages).


The Abstract Syntax Tree

The AST represents the query’s structure in a clean, navigable tree.

CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'
CreateTableAsSelect
├── table_name: "active_users"
└── query:
    └── SelectStatement
        ├── columns:
        │   ├── Column { name: "id" }
        │   ├── Column { name: "name" }
        │   └── Column { name: "email" }
        ├── from:
        │   └── Table { name: "users" }
        └── where:
            └── BinaryOp
                ├── left: Column { name: "status" }
                ├── op: Equals
                └── right: Literal { value: "active" }

The AST is the central data structure. Everything downstream (analysis, transformation, code generation) operates on it.

What You Can Do With an AST

Operation Description
Traversal Walk the tree, collect information
Transform Rewrite nodes (add filters, change structure)
Generate Convert back to SQL string (round-trip)
Transpile Generate SQL for a different dialect
Lineage Trace where data comes from

Syntax vs Semantics

This is where things get interesting.

Syntactic analysis (parsing): Is this valid SQL grammar?

Semantic analysis: Does this SQL make sense given the database schema?

CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'
Syntactic (Parser answers) Semantic (Analyzer answers)
✓ Valid CTAS structure Does ‘users’ table exist?
✓ Valid SELECT clause Does ‘users’ have ‘id’, ‘name’, ‘email’ columns?
✓ Valid WHERE condition Is ‘status’ a valid column?
✓ Correct keyword order Is comparing status to string valid?

Key insight: A parser with no schema information can only do syntactic analysis. Semantic analysis requires external knowledge about the database.

What Syntactic Analysis Catches

SELECT FROM users        -- Missing column list
FROM users SELECT *      -- Wrong keyword order
SELECT (a + b            -- Unbalanced parentheses

What Syntactic Analysis Cannot Catch

SELECT * FROM nonexistent    -- Parser doesn't know if table exists
SELECT foo FROM users        -- Parser doesn't know columns
WHERE name > 5               -- Parser doesn't know types

Column-Level Lineage

Lineage traces where data comes from and where it goes.

CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'

Table-level lineage (easy, no schema needed):

READ:  [users]
WRITE: [active_users]

Column-level lineage (needs schema):

active_users.id    ← users.id (direct)
active_users.name  ← users.name (direct)
active_users.email ← users.email (direct)

Data Flow vs Control Flow

This is where lineage gets nuanced.

CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'

Question: Does status contribute to the lineage of active_users?

Perspective status in lineage? Reasoning
Data Flow ❌ No status doesn’t appear in output columns
Control Flow ✅ Yes status affects which rows are included

Most “lineage” discussions mean data flow. But impact analysis needs both. Changing status column could break this query even though it’s not in the output.


SQL Dialects

SQL is a “standard” that nobody fully implements.

The Dialect Landscape

Every database vendor implements a subset of the SQL standard, adds proprietary extensions, and has different syntax for the same operations.

                         ┌─────────────────┐
                         │  SQL Standard   │
                         │  (SQL-92, 99,   │
                         │   2003, 2016)   │
                         └────────┬────────┘
                                  │
         ┌──────────┬─────────┬───┴───┬─────────┬──────────┐
         ▼          ▼         ▼       ▼         ▼          ▼
   ┌──────────┐┌──────────┐┌──────┐┌──────┐┌──────────┐┌──────────┐
   │PostgreSQL││  MySQL   ││Oracle││SQL   ││Snowflake ││ BigQuery │
   │          ││          ││      ││Server││          ││          │
   │ +ARRAY   ││ +LIMIT   ││+ROWNUM│ +TOP ││ +FLATTEN ││ +STRUCT  │
   │ +JSONB   ││ +BACKTICK││+DUAL ││ +[]  ││ +VARIANT ││ +UNNEST  │
   │ +::cast  ││ +AUTO_INC││+PLSQL││+T-SQL││ +$$      ││ +SAFE_   │
   └──────────┘└──────────┘└──────┘└──────┘└──────────┘└──────────┘

   Each dialect: ~80% common SQL + ~20% proprietary extensions

This fragmentation is why so many SQL parsers exist. A parser built for PostgreSQL won’t understand MySQL’s backtick identifiers. A parser built for standard SQL won’t handle Snowflake’s FLATTEN function.

Identifier Quoting

Dialect Quote Style Example
Standard SQL "double quotes" SELECT "Column" FROM "Table"
MySQL `backticks` SELECT `Column` FROM `Table`
SQL Server [brackets] SELECT [Column] FROM [Table]
BigQuery `backticks` SELECT `Column` FROM `Table`
-- MySQL, PostgreSQL, SQLite
SELECT * FROM users LIMIT 10 OFFSET 5

-- SQL Server
SELECT TOP 10 * FROM users
-- Or (SQL Server 2012+)
SELECT * FROM users ORDER BY id OFFSET 5 ROWS FETCH NEXT 10 ROWS ONLY

-- Oracle (traditional)
SELECT * FROM users WHERE ROWNUM <= 10
-- Oracle 12c+
SELECT * FROM users FETCH FIRST 10 ROWS ONLY

Type Casting

-- Standard SQL
CAST(x AS INTEGER)

-- PostgreSQL
x::INTEGER

-- MySQL
CAST(x AS SIGNED)  -- No INTEGER, use SIGNED/UNSIGNED

-- BigQuery
CAST(x AS INT64)
SAFE_CAST(x AS INT64)  -- Returns NULL instead of error

Function Name Differences

The same operation, different names across dialects:

Operation PostgreSQL MySQL SQL Server Snowflake
Current timestamp NOW() NOW() GETDATE() CURRENT_TIMESTAMP()
String length LENGTH() LENGTH() LEN() LENGTH()
If null COALESCE() IFNULL() ISNULL() NVL()
Date add + INTERVAL '1 day' DATE_ADD() DATEADD() DATEADD()

How Parsers Handle Dialects

Dialect flags (sqlparser-rs): Single parser with ~50 boolean flags like supports_filter_during_aggregation, supports_group_by_expr. Simple but can’t handle major syntax differences.

Parameterized grammar (SQLGlot): Base parser with overridable methods per dialect. Each dialect class inherits and overrides specific parsing methods. Flexible but has complex inheritance.

Separate grammars (Gudusoft): One complete grammar file per database. Complete accuracy but high maintenance burden. When a database releases a new version, you update that grammar file.


Parsers vs Query Engines

This distinction trips people up. A parser and a query engine are not the same thing.

┌─────────────────────────────┐    ┌─────────────────────────────────┐
│         PARSER              │    │         QUERY ENGINE            │
│                             │    │                                 │
│  • Lexical analysis         │    │  • Query planning               │
│  • Syntactic analysis       │    │  • Query optimization           │
│  • AST construction         │    │  • Physical execution           │
│  • (Optional) Semantic      │    │  • Data access                  │
│    analysis                 │    │  • Join algorithms              │
│                             │    │  • Aggregation                  │
│  INPUT: SQL string          │    │  • Sorting                      │
│  OUTPUT: AST or errors      │    │  • Result materialization       │
│                             │    │                                 │
│  NO data access             │    │  READS/WRITES data              │
│  NO execution               │    │  EXECUTES query                 │
└─────────────────────────────┘    └─────────────────────────────────┘

Examples:                          Examples:
• SQLGlot                          • PostgreSQL
• sqlparser-rs                     • DuckDB
• JSqlParser                       • Apache Spark
                                   • Presto/Trino

The Full Query Processing Pipeline

When you run a query in a database, it goes through many stages. Parsing is just the first.

┌─────────────────┐
│    SQL String   │
└─────────────────┘
         │
         ▼
┌─────────────────┐
│     PARSER      │  ◄── SQLGlot, sqlparser-rs stop here
└────────┬────────┘
         │ AST
         ▼
┌─────────────────┐
│    ANALYZER     │  ◄── Semantic analysis (name resolution, types)
└────────┬────────┘
         │ Analyzed AST
         ▼
┌─────────────────┐
│    PLANNER      │  ◄── Logical plan (relational algebra)
└────────┬────────┘
         │ Logical Plan
         ▼
┌─────────────────┐
│   OPTIMIZER     │  ◄── Rule-based and cost-based optimization
└────────┬────────┘
         │ Optimized Plan
         ▼
┌─────────────────┐
│   EXECUTOR      │  ◄── Physical operators, data access
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│    RESULTS      │
└─────────────────┘

What Parsers Do

Capability Parser Does
Syntax validation ✅ Detect SELECT FROM (missing columns)
AST construction ✅ Build tree structure
Transpilation ✅ Convert MySQL → PostgreSQL
Lineage extraction ✅ Find table/column dependencies
Query formatting ✅ Pretty-print SQL

What Parsers Don’t Do

Capability Parser Engine
Execute query
Return results
Optimize execution
Choose join order
Manage transactions

Why This Matters

If you’re building a data catalog and need lineage, you need a parser. You don’t need a query engine.

If you’re building an IDE with autocomplete, you need a parser. You don’t need to execute anything.

If you’re building a transpiler to migrate queries from Snowflake to Databricks, you need a parser with good dialect support. You don’t need to run those queries.

The tools are different because the problems are different.


At a Glance

Parser Language License Dialects One-liner
SQLGlot Python MIT 31 Most feature-complete open-source
sqlparser-rs Rust Apache 2.0 ~15 Fast, minimal, foundation for Rust engines
Apache Calcite Java Apache 2.0 ~10 Full query planning framework
Gudusoft GSP Java/C# Commercial 25+ Enterprise, stored procedure support
JSqlParser Java Apache/LGPL ~6 Mature, simple Java parser

Layer Support

Parser Lexer Parser AST Semantic Lineage
SQLGlot
sqlparser-rs
Calcite ⚠️
Gudusoft GSP
JSqlParser

Features

Parser Transpile Format Lineage Schema Round-trip
SQLGlot
sqlparser-rs ⚠️ ⚠️
Calcite ⚠️ ⚠️
Gudusoft GSP ⚠️
JSqlParser ⚠️

Quick Decision Matrix

Your Situation Recommended
Python, need transpilation SQLGlot
Python, need lineage SQLGlot
Rust query engine sqlparser-rs
Browser-based SQL tool sqlparser-rs (WASM)
Java, basic parsing JSqlParser
Java, query planning Apache Calcite
Enterprise, stored procs Gudusoft GSP

Further Reading

Still Interested? There’s more to read. Hopefully I have piqued your interest. For any comments or feedback, please reach out to and I’ll address them.

Parsing Algorithms

Parser Libraries

联系我们 contact @ memedata.com