重温《让我们构建一个编译器》

重温《让我们构建一个编译器》
Revisiting "Let's Build a Compiler"

原始链接: https://eli.thegreenplace.net/2025/revisiting-lets-build-a-compiler/

## 经典编译器教程的现代演绎 Jack Crenshaw 的“让我们构建一个编译器”系列（1988-1995），最初用 Pascal 编写并以 68000 汇编为目标，至今仍对编译器爱好者具有惊人的现实意义。为了理解这一点，作者将该教程的编译器一丝不苟地翻译成了 Python，并输出 WebAssembly (WASM)。该项目的目标是通过提供可运行的代码与原始课程并存，使教程对现代程序员来说更易于理解。一个关键示例是将一个简单的 KISS 语言程序编译成 WASM，展示过程调用和参数传递。虽然生成的 WASM 故意没有进行优化，但它展示了核心概念。该教程持久的吸引力源于其循序渐进的实践方法：直接构建递归下降解析器，而不是依赖于解析器生成器，并立即关注代码生成。这与传统编译器课程形成了鲜明对比，传统课程在达到代码生成阶段*之前*， heavily 强调了语法分析和语义分析。然而，该教程的语法制导翻译方法，虽然非常适合初学者，但在处理类型检查等复杂问题时会暴露出局限性。最终，该项目为基础教程与现代开发实践之间提供了一座宝贵的桥梁。代码和与原始教程的详细映射可在 GitHub 上找到。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录重温“让我们构建一个编译器” (thegreenplace.net) 10 分，由 cui 1小时前 | 隐藏 | 过去 | 收藏 | 讨论指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

原文

There's an old compiler-building tutorial that has become part of the field's lore: the Let's Build a Compiler series by Jack Crenshaw (published between 1988 and 1995).

I ran into it in 2003 and was very impressed, but it's now 2025 and this tutorial is still being mentioned quite often in Hacker News threads. Why is that? Why does a tutorial from 35 years ago, built in Pascal and emitting Motorola 68000 assembly - technologies that are virtually unknown for the new generation of programmers - hold sway over compiler enthusiasts? I've decided to find out.

The tutorial is easily available and readable online, but just re-reading it seemed insufficient. So I've decided on meticulously translating the compilers built in it to Python and emit a more modern target - WebAssembly. It was an enjoyable process and I want to share the outcome and some insights gained along the way.

The result is this code repository. Of particular interest is the TUTORIAL.md file, which describes how each part in the original tutorial is mapped to my code. So if you want to read the original tutorial but play with code you can actually easily try on your own, feel free to follow my path.

A sample

To get a taste of the input language being compiled and the output my compiler generates, here's a sample program in the KISS language designed by Jack Crenshaw:

var X=0

 { sum from 0 to n-1 inclusive, and add to result }
 procedure addseq(n, ref result)
     var i, sum  { 0 initialized }
     while i < n
         sum = sum + i
         i = i + 1
     end
     result = result + sum
 end

 program testprog
 begin
     addseq(11, X)
 end
 .

It's from part 13 of the tutorial, so it showcases procedures along with control constructs like the while loop, and passing parameters both by value and by reference. Here's the WASM text generated by my compiler for part 13:

(module
  (memory 8)
  ;; Linear stack pointer. Used to pass parameters by ref.
  ;; Grows downwards (towards lower addresses).
  (global $__sp (mut i32) (i32.const 65536))

  (global $X (mut i32) (i32.const 0))

  (func $ADDSEQ (param $N i32) (param $RESULT i32)
    (local $I i32)
    (local $SUM i32)
    loop $loop1
      block $breakloop1
        local.get $I
        local.get $N
        i32.lt_s
        i32.eqz
        br_if $breakloop1
        local.get $SUM
        local.get $I
        i32.add
        local.set $SUM
        local.get $I
        i32.const 1
        i32.add
        local.set $I
        br $loop1
      end
    end
    local.get $RESULT
    local.get $RESULT
    i32.load
    local.get $SUM
    i32.add
    i32.store
  )

  (func $main (export "main") (result i32)
    i32.const 11
    global.get $__sp      ;; make space on stack
    i32.const 4
    i32.sub
    global.set $__sp
    global.get $__sp
    global.get $X
    i32.store
    global.get $__sp    ;; push address as parameter
    call $ADDSEQ
    ;; restore parameter X by ref
    global.get $__sp
    i32.load offset=0
    global.set $X
    ;; clean up stack for ref parameters
    global.get $__sp
    i32.const 4
    i32.add
    global.set $__sp
    global.get $X
  )
)

You'll notice that there is some trickiness in the emitted code w.r.t. handling the by-reference parameter (my previous post deals with this issue in more detail). In general, though, the emitted code is inefficient - there is close to 0 optimization applied.

Also, if you're very diligent you'll notice something odd about the global variable X - it seems to be implicitly returned by the generated main function. This is just a testing facility that makes my compiler easy to test. All the compilers are extensively tested - usually by running the generated WASM code and verifying expected results.

Insights - what makes this tutorial so special?

While reading the original tutorial again, I had on opportunity to reminisce on what makes it so effective. Other than the very fluent and conversational writing style of Jack Crenshaw, I think it's a combination of two key factors:

The tutorial builds a recursive-descent parser step by step, rather than giving a long preface on automata and table-based parser generators. When I first encountered it (in 2003), it was taken for granted that if you want to write a parser then lex + yacc are the way to go . Following the development of a simple and clean hand-written parser was a revelation that wholly changed my approach to the subject; subsequently, hand-written recursive-descent parsers have been my go-to approach for almost 20 years now.
Rather than getting stuck in front-end minutiae, the tutorial goes straight to generating working assembly code, from very early on. This was also a breath of fresh air for engineers who grew up with more traditional courses where you spend 90% of the time on parsing, type checking and other semantic analysis and often run entirely out of steam by the time code generation is taught.

To be honest, I don't think either of these are a big problem with modern resources, but back in the day the tutorial clearly hit the right nerve with many people.

What else does it teach us?

Jack Crenshaw's tutorial takes the syntax-directed translation approach, where code is emitted while parsing, without having to divide the compiler into explicit phases with IRs. As I said above, this is a fantastic approach for getting started, but in the latter parts of the tutorial it starts showing its limitations. Especially once we get to types, it becomes painfully obvious that it would be very nice if we knew the types of expressions before we generate code for them.

I don't know if this is implicated in Jack Crenshaw's abandoning the tutorial at some point after part 14, but it may very well be. He keeps writing how the emitted code is clearly sub-optimal and can be improved, but IMHO it's just not that easy to improve using the syntax-directed translation strategy. With perfect hindsight vision, I would probably use Part 14 (types) as a turning point - emitting some kind of AST from the parser and then doing simple type checking and analysis on that AST prior to generating code from it.

Conclusion

All in all, the original tutorial remains a wonderfully readable introduction to building compilers. This post and the GitHub repository it describes are a modest contribution that aims to improve the experience of folks reading the original tutorial today and not willing to use obsolete technologies. As always, let me know if you run into any issues or have questions!