展示 HN：SQL 解析器的工作原理

原文

It’s rare to talk about heuristic systems when there’s so much hype around probabilistic ones. This post is boring, but stay with me.

My work at Atlan has touched SQL parsing since the beginning. I made early contributions to the query engine with policy-based authorization powering Insights, and we generate SQL lineage by parsing queries to power column-level lineage. Along the way, my colleagues and I have evaluated a lot of SQL parsers, both open source and commercial: SQLGlot, sqlparser-rs, sqloxide, Apache Calcite, Gudusoft GSP, JSqlParser, and others. Each with different tradeoffs.

This post is my attempt to distill what I’ve learned. I’m not an expert, just someone curious enough to ask: why do so many SQL parsers exist? And what are they actually doing under the hood?

The idea for this writeup came during a drive back from SFO after dropping a friend off at the airport. That turned into about 10 hours of conversation with Claude to pull it all together.

What is a SQL Parser?

A SQL parser reads SQL text and converts it into a structured representation, usually a tree, that computers can work with. It’s the “understanding” step.

INPUT:  "CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'"
                                            ↓
                                       [SQL PARSER]
                                            ↓
OUTPUT: A tree structure representing the query's meaning

Think of it like how your brain parses a sentence to extract meaning. The parser does the same for SQL.

The Full Pipeline

Every SQL parser follows the same fundamental pipeline. This isn’t a design choice. It’s a consequence of how language processing works.

Lexer: Breaks the SQL string into tokens. Keywords, identifiers, operators, literals. Like recognizing words in a sentence.

Parser: Takes tokens and builds a tree based on grammar rules. Like understanding “subject-verb-object” structure.

AST: The Abstract Syntax Tree. A clean, navigable representation of the query’s structure.

Semantic Analysis: Adds meaning. Does this table exist? What type is this column? This is where you need schema information.

Lexical Analysis (The Lexer)

The lexer converts a stream of characters into meaningful chunks called tokens.

Input:

CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'

Output:

Token(CREATE)
Token(TABLE)
Token(IDENTIFIER, "active_users")
Token(AS)
Token(SELECT)
Token(IDENTIFIER, "id")
Token(COMMA)
Token(IDENTIFIER, "name")
Token(COMMA)
Token(IDENTIFIER, "email")
Token(FROM)
Token(IDENTIFIER, "users")
Token(WHERE)
Token(IDENTIFIER, "status")
Token(EQUALS)
Token(STRING, "active")

The lexer handles dialect-specific decisions early:

Decision	Standard SQL	MySQL	SQL Server	PostgreSQL
Identifier quote	`"name"`	`name`	`[name]`	`"name"`
String quote	`'text'`	`'text'` or `"text"`	`'text'`	`'text'`
Line comment	`--`	`--` or `#`	`--`	`--`
Case sensitivity	Insensitive	Insensitive	Insensitive	Insensitive (keywords)

Lexers are simple. They look at one character (or a few) at a time, don’t need to understand nesting or structure, and can be implemented with state machines or regex. This is why SQLGlot’s optional Rust tokenizer gives ~30% speedup^{. Tokenization is pure CPU-bound character scanning.}

Syntactic Analysis (The Parser)

The parser reads tokens and builds a tree structure based on grammar rules.

Input tokens:

[CREATE, TABLE, active_users, AS, SELECT, id, COMMA, name, COMMA, email, FROM, users, WHERE, status, =, 'active']

Grammar rule:

ctas_stmt    := CREATE TABLE table_name AS select_stmt
select_stmt  := SELECT column_list FROM table_name [WHERE condition]

Parser thinks:

“I see CREATE TABLE… this must be a ctas_stmt”
“Next I need a table_name… I see ‘active_users’”
“Next I need AS… got it”
“Now I need a select_stmt…”
“I see SELECT… parsing the inner query”
“Column list: id, name, email”
“FROM users”
“WHERE status = ‘active’”

The parser is essentially a state machine following grammar rules, consuming tokens and building tree nodes.

Why Parsing is Harder Than Lexing

Lexing is pattern matching. Parsing is about structure.

SELECT * FROM (SELECT * FROM (SELECT * FROM t))

A lexer sees: (, (, (, ), ), )

A parser must match each ( with its corresponding ).

This is why regex can’t parse SQL. Regex can’t count balanced parentheses. It’s mathematically proven (regular languages vs context-free languages).

The Abstract Syntax Tree

The AST represents the query’s structure in a clean, navigable tree.

CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'

CreateTableAsSelect
├── table_name: "active_users"
└── query:
    └── SelectStatement
        ├── columns:
        │   ├── Column { name: "id" }
        │   ├── Column { name: "name" }
        │   └── Column { name: "email" }
        ├── from:
        │   └── Table { name: "users" }
        └── where:
            └── BinaryOp
                ├── left: Column { name: "status" }
                ├── op: Equals
                └── right: Literal { value: "active" }

The AST is the central data structure. Everything downstream (analysis, transformation, code generation) operates on it.

What You Can Do With an AST

Operation	Description
Traversal	Walk the tree, collect information
Transform	Rewrite nodes (add filters, change structure)
Generate	Convert back to SQL string (round-trip)
Transpile	Generate SQL for a different dialect
Lineage	Trace where data comes from

Syntax vs Semantics

This is where things get interesting.

Syntactic analysis (parsing): Is this valid SQL grammar?

Semantic analysis: Does this SQL make sense given the database schema?

CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'

Syntactic (Parser answers)	Semantic (Analyzer answers)
✓ Valid CTAS structure	Does ‘users’ table exist?
✓ Valid SELECT clause	Does ‘users’ have ‘id’, ‘name’, ‘email’ columns?
✓ Valid WHERE condition	Is ‘status’ a valid column?
✓ Correct keyword order	Is comparing status to string valid?

Key insight: A parser with no schema information can only do syntactic analysis. Semantic analysis requires external knowledge about the database.

What Syntactic Analysis Catches

SELECT FROM users        -- Missing column list
FROM users SELECT *      -- Wrong keyword order
SELECT (a + b            -- Unbalanced parentheses

What Syntactic Analysis Cannot Catch

SELECT * FROM nonexistent    -- Parser doesn't know if table exists
SELECT foo FROM users        -- Parser doesn't know columns
WHERE name > 5               -- Parser doesn't know types

Column-Level Lineage

Lineage traces where data comes from and where it goes.

CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'

Table-level lineage (easy, no schema needed):

READ:  [users]
WRITE: [active_users]

Column-level lineage (needs schema):

active_users.id    ← users.id (direct)
active_users.name  ← users.name (direct)
active_users.email ← users.email (direct)

Data Flow vs Control Flow

This is where lineage gets nuanced.

CREATE TABLE active_users AS SELECT id, name, email FROM users WHERE status = 'active'

Question: Does status contribute to the lineage of active_users?

Perspective	`status` in lineage?	Reasoning
Data Flow	❌ No	`status` doesn’t appear in output columns
Control Flow	✅ Yes	`status` affects which rows are included

Most “lineage” discussions mean data flow. But impact analysis needs both. Changing status column could break this query even though it’s not in the output.

SQL Dialects

SQL is a “standard” that nobody fully implements.

The Dialect Landscape

Every database vendor implements a subset of the SQL standard, adds proprietary extensions, and has different syntax for the same operations.

                         ┌─────────────────┐
                         │  SQL Standard   │
                         │  (SQL-92, 99,   │
                         │   2003, 2016)   │
                         └────────┬────────┘
                                  │
         ┌──────────┬─────────┬───┴───┬─────────┬──────────┐
         ▼          ▼         ▼       ▼         ▼          ▼
   ┌──────────┐┌──────────┐┌──────┐┌──────┐┌──────────┐┌──────────┐
   │PostgreSQL││  MySQL   ││Oracle││SQL   ││Snowflake ││ BigQuery │
   │          ││          ││      ││Server││          ││          │
   │ +ARRAY   ││ +LIMIT   ││+ROWNUM│ +TOP ││ +FLATTEN ││ +STRUCT  │
   │ +JSONB   ││ +BACKTICK││+DUAL ││ +[]  ││ +VARIANT ││ +UNNEST  │
   │ +::cast  ││ +AUTO_INC││+PLSQL││+T-SQL││ +$$      ││ +SAFE_   │
   └──────────┘└──────────┘└──────┘└──────┘└──────────┘└──────────┘

   Each dialect: ~80% common SQL + ~20% proprietary extensions

This fragmentation is why so many SQL parsers exist. A parser built for PostgreSQL won’t understand MySQL’s backtick identifiers. A parser built for standard SQL won’t handle Snowflake’s FLATTEN function.

Identifier Quoting

Dialect	Quote Style	Example
Standard SQL	`"double quotes"`	`SELECT "Column" FROM "Table"`
MySQL	`backticks`	SELECT `Column` FROM `Table`
SQL Server	`[brackets]`	`SELECT [Column] FROM [Table]`
BigQuery	`backticks`	SELECT `Column` FROM `Table`

-- MySQL, PostgreSQL, SQLite
SELECT * FROM users LIMIT 10 OFFSET 5

-- SQL Server
SELECT TOP 10 * FROM users
-- Or (SQL Server 2012+)
SELECT * FROM users ORDER BY id OFFSET 5 ROWS FETCH NEXT 10 ROWS ONLY

-- Oracle (traditional)
SELECT * FROM users WHERE ROWNUM <= 10
-- Oracle 12c+
SELECT * FROM users FETCH FIRST 10 ROWS ONLY

Type Casting

-- Standard SQL
CAST(x AS INTEGER)

-- PostgreSQL
x::INTEGER

-- MySQL
CAST(x AS SIGNED)  -- No INTEGER, use SIGNED/UNSIGNED

-- BigQuery
CAST(x AS INT64)
SAFE_CAST(x AS INT64)  -- Returns NULL instead of error

Function Name Differences

The same operation, different names across dialects:

Operation	PostgreSQL	MySQL	SQL Server	Snowflake
Current timestamp	`NOW()`	`NOW()`	`GETDATE()`	`CURRENT_TIMESTAMP()`
String length	`LENGTH()`	`LENGTH()`	`LEN()`	`LENGTH()`
If null	`COALESCE()`	`IFNULL()`	`ISNULL()`	`NVL()`
Date add	`+ INTERVAL '1 day'`	`DATE_ADD()`	`DATEADD()`	`DATEADD()`

How Parsers Handle Dialects

Dialect flags (sqlparser-rs): Single parser with ~50 boolean flags like supports_filter_during_aggregation, supports_group_by_expr. Simple but can’t handle major syntax differences.

Parameterized grammar (SQLGlot): Base parser with overridable methods per dialect. Each dialect class inherits and overrides specific parsing methods. Flexible but has complex inheritance.

Separate grammars (Gudusoft): One complete grammar file per database. Complete accuracy but high maintenance burden. When a database releases a new version, you update that grammar file.

Parsers vs Query Engines

This distinction trips people up. A parser and a query engine are not the same thing.

┌─────────────────────────────┐    ┌─────────────────────────────────┐
│         PARSER              │    │         QUERY ENGINE            │
│                             │    │                                 │
│  • Lexical analysis         │    │  • Query planning               │
│  • Syntactic analysis       │    │  • Query optimization           │
│  • AST construction         │    │  • Physical execution           │
│  • (Optional) Semantic      │    │  • Data access                  │
│    analysis                 │    │  • Join algorithms              │
│                             │    │  • Aggregation                  │
│  INPUT: SQL string          │    │  • Sorting                      │
│  OUTPUT: AST or errors      │    │  • Result materialization       │
│                             │    │                                 │
│  NO data access             │    │  READS/WRITES data              │
│  NO execution               │    │  EXECUTES query                 │
└─────────────────────────────┘    └─────────────────────────────────┘

Examples:                          Examples:
• SQLGlot                          • PostgreSQL
• sqlparser-rs                     • DuckDB
• JSqlParser                       • Apache Spark
                                   • Presto/Trino

The Full Query Processing Pipeline

When you run a query in a database, it goes through many stages. Parsing is just the first.

┌─────────────────┐
│    SQL String   │
└─────────────────┘
         │
         ▼
┌─────────────────┐
│     PARSER      │  ◄── SQLGlot, sqlparser-rs stop here
└────────┬────────┘
         │ AST
         ▼
┌─────────────────┐
│    ANALYZER     │  ◄── Semantic analysis (name resolution, types)
└────────┬────────┘
         │ Analyzed AST
         ▼
┌─────────────────┐
│    PLANNER      │  ◄── Logical plan (relational algebra)
└────────┬────────┘
         │ Logical Plan
         ▼
┌─────────────────┐
│   OPTIMIZER     │  ◄── Rule-based and cost-based optimization
└────────┬────────┘
         │ Optimized Plan
         ▼
┌─────────────────┐
│   EXECUTOR      │  ◄── Physical operators, data access
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│    RESULTS      │
└─────────────────┘

What Parsers Do

Capability	Parser Does
Syntax validation	✅ Detect `SELECT FROM` (missing columns)
AST construction	✅ Build tree structure
Transpilation	✅ Convert MySQL → PostgreSQL
Lineage extraction	✅ Find table/column dependencies
Query formatting	✅ Pretty-print SQL

What Parsers Don’t Do

Capability	Parser	Engine
Execute query	❌	✅
Return results	❌	✅
Optimize execution	❌	✅
Choose join order	❌	✅
Manage transactions	❌	✅

Why This Matters

If you’re building a data catalog and need lineage, you need a parser. You don’t need a query engine.

If you’re building an IDE with autocomplete, you need a parser. You don’t need to execute anything.

If you’re building a transpiler to migrate queries from Snowflake to Databricks, you need a parser with good dialect support. You don’t need to run those queries.

The tools are different because the problems are different.

At a Glance

Parser	Language	License	Dialects	One-liner
SQLGlot	Python	MIT	31	Most feature-complete open-source
sqlparser-rs	Rust	Apache 2.0	~15	Fast, minimal, foundation for Rust engines
Apache Calcite	Java	Apache 2.0	~10	Full query planning framework
Gudusoft GSP	Java/C#	Commercial	25+	Enterprise, stored procedure support
JSqlParser	Java	Apache/LGPL	~6	Mature, simple Java parser

Layer Support

Parser	Lexer	Parser	AST	Semantic	Lineage
SQLGlot	✅	✅	✅	✅	✅
sqlparser-rs	✅	✅	✅	❌	❌
Calcite	✅	✅	✅	✅	⚠️
Gudusoft GSP	✅	✅	✅	✅	✅
JSqlParser	✅	✅	✅	❌	❌

Features

Parser	Transpile	Format	Lineage	Schema	Round-trip
SQLGlot	✅	✅	✅	✅	✅
sqlparser-rs	❌	⚠️	❌	❌	⚠️
Calcite	⚠️	❌	⚠️	✅	✅
Gudusoft GSP	⚠️	✅	✅	✅	✅
JSqlParser	❌	⚠️	❌	❌	✅

Quick Decision Matrix

Your Situation	Recommended
Python, need transpilation	SQLGlot
Python, need lineage	SQLGlot
Rust query engine	sqlparser-rs
Browser-based SQL tool	sqlparser-rs (WASM)
Java, basic parsing	JSqlParser
Java, query planning	Apache Calcite
Enterprise, stored procs	Gudusoft GSP