想写一个编译器？只需要阅读这两篇论文 (2008)

想写一个编译器？只需要阅读这两篇论文 (2008)
Want to write a compiler? Just read these two papers (2008)

学习编写编译器常常被现有的教科书弄得不必要地困难，那些书过于全面和理论化，就像用一本臭名昭著的入门书来学习高级编程一样。Jack Crenshaw的《让我们构建一个编译器！》提供了一种令人耳目一新的替代方案——一个实用的、易于理解的教程，即使是初学者也能胜任，最初是用Pascal编写的（也有C和Forth版本）。然而，Crenshaw的方法跳过了创建抽象语法树，这是编译器灵活性的关键组成部分，因为Pascal的限制。像Python或Lisp这样的现代语言可以轻松地进行树操作。 Sarkar、Waddell和Dybvig的论文概述了一种更近期的“纳米通道”框架：将编译分解为程序内部表示形式的许多简单转换。这强调了模块化和清晰度。核心信息是？首先从实际实现入手，*然后*根据需要深入复杂的理论。你可能会发现，你可以在不需要那本著名的、晦涩难懂的“龙书”的情况下构建功能齐全的编译器。

原文

Imagine you don't know anything about programming, and you want learn how to do it. You take a look at Amazon.com, and there's a highly recommended set of books by Knute or something with a promising title, The Art of Computer Programming, so you buy them. Now imagine that it's more than just a poor choice, but that all the books on programming are at written at that level.

That's the situation with books about writing compilers.

It's not that they're bad books, they're just too broadly scoped, and the authors present so much information that it's hard to know where to begin. Some books are better than others, but there are still the thick chapters about converting regular expressions into executable state machines and different types of grammars and so on. After slogging through it all you will have undoubtedly expanded your knowledge, but you're no closer to actually writing a working compiler.

Not surprisingly, the opaqueness of these books has led to the myth that compilers are hard to write.

The best source for breaking this myth is Jack Crenshaw's series, Let's Build a Compiler!, which started in 1988. This is one of those gems of technical writing where what's assumed to be a complex topic ends up being suitable for a first year programming class. He focuses on compilers of the Turbo Pascal class: single pass, parsing and code generation are intermingled, and only the most basic of optimizations are applied to the resulting code. The original tutorials used Pascal as the implementation language, but there's a C version out there, too. If you're truly adventurous, Marcel Hendrix has done a Forth translation (and as Forth is an interactive language, it's easier to experiment with and understand than the C or Pascal sources).

As good as it is, Crenshaw's series has one major omission: there's no internal representation of the program at all. That is, no abstract syntax tree. It is indeed possible to bypass this step if you're willing to give up flexibility, but the main reason it's not in the tutorials is because manipulating trees in Pascal is out of sync with the simplicity of the rest of the code he presents. If you're working in a higher level language--Python, Ruby, Erlang, Haskell, Lisp--then this worry goes away. It's trivially easy to create and manipulate tree-like representations of data. Indeed, this is what Lisp, Erlang, and Haskell were designed for.

That brings me to A Nanopass Framework for Compiler Education [PDF] by Sarkar, Waddell, and Dybvig. The details of this paper aren't quite as important as the general concept: a compiler is nothing more than a series of transformations of the internal representation of a program. The authors promote using dozens or hundreds of compiler passes, each being as simple as possible. Don't combine transformations; keep them separate. The framework mentioned in the title is a way of specifying the inputs and outputs for each pass. The code is in Scheme, which is dynamically typed, so data is validated at runtime.

After writing a compiler or two, then go ahead and plunk down the cash for the infamous Dragon Book or one of the alternatives. Maybe. Or you might not need them at all.

permalink June 29, 2008

想写一个编译器？只需要阅读这两篇论文 (2008) Want to write a compiler? Just read these two papers (2008)

想写一个编译器？只需要阅读这两篇论文 (2008)
Want to write a compiler? Just read these two papers (2008)