我写了一个编译器。

我写了一个编译器。
I Wrote a Compiler

原始链接: https://blog.singleton.io/posts/2021-01-31-i-wrote-a-compiler/

上周末，怀旧之情与雨天相伴，我用Go语言编写了一个玩具BASIC编译器（toybasic），生成Go代码作为输出。编译器分为三个阶段：词法分析器、语法分析器和代码生成器。我利用nex（一个词法分析器生成器）和goyacc（Go内置的yacc变体）分别处理词法分析和语法分析阶段。语法分析器基于简化的TinyBASIC语法，从词元流构建抽象语法树（AST）。AST表示程序的结构，便于遍历和代码生成。代码生成器然后遍历这个AST，每个节点类型都有一个`Generate`函数来生成等效的Go代码。最终的toybasic编译器虽然简单，但支持`PRINT`、`LET`、`IF`、`GOTO`和`END`等核心BASIC结构。测试包括编写一个包含所有语言特性的BASIC程序以确保正确性。整个过程，尽管事先在概念上有所理解，但证明是一次令人满意且实际的体验。

Hacker News 最新 | 往期 | 评论 | 提问 | 展示 | 招聘 | 提交登录我编写了一个编译器 (singleton.io) ingve 2小时前 8 分 | 隐藏 | 往期 | 收藏 | 1 评论 TMWNN 2分钟前 [–] 我认为编译器，没有任何形容词或限制条件，应该将高级语言转换为机器语言。这篇文章描述的是将 BASIC 转换为 Go，更准确的描述不应该是“伪编译器”或“Go 编译器”之类的吗？我知道 Emacs 总是被称为拥有一个处理 Elisp 代码的“字节码编译器”，而不是“编译器”本身。我错了吗？回复考虑申请 YC 2025 秋季批次！申请截止日期为 8 月 4 日指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系我们搜索：

原文

I have a Computer Science degree. I attended a whole course of lectures on compilers (and have a certain fondness for “the red dragon book” as a result). However, I had never actually written a compiler from start to finish until a rainy day last weekend. Yes, this is what I do for fun.

I wanted to make a compiler for a real language, but a simple one so I could complete the project in a few hours. I’ve always had a bit of a soft spot for BASIC - it’s the first progamming language I learned as a kid. Luckily there’s a practical variant called TinyBASIC (thank you, wikipedia, for that tip). My version is even simpler than TinyBASIC (no INPUT statement). Hence I call it toybasic. The code is all available on GitHub.

It’s written in Go and it generates Go code from BASIC.

Demo

Example program example/hello.bas

10 PRINT "Hello, world."
20 LET x = (3 * 2) + 3
30 LET x = x + 1
40 IF x == 10 THEN PRINT "Ten!"
45 PRINT x
50 IF x >= 15 THEN GOTO 70
60 GOTO 30
70 END

Example output

$ ./toybasic <example/hello.bas
$ go run out.go
Hello, world.
Ten!
10
11
12
13
14
15

Structure

My compiler has three stages:

Lexer - which converts the sequence of characters in your source code into a sequence of tokens with meaning which can be passed to the…
Parser - builds a tree from the tokens. This gives us a structural representation of the program which is easy to traverse in the next stage. It also checks that the structure of the program provided is syntactically correct and generates helpful error messages if not.
Compiler - walks the parsed syntax tree and writes out Go code equivalent to the original BASIC.

It’s possible to write the lexer and parser entirely by hand, but there are great off the shelf tools available that make this much simpler. The venerable lex and yacc have been helping solve this problem in C since 1975. The original authors of yacc were Mike Lesk and Eric Schmidt - yes that Eric Schmidt.

There are a bunch of lex clones available for Go. I decided to use nex to help write my lexer. Nex has an intuitive awk like syntax and a very easy to follow README.

Go has a variant of yacc called goyacc built right in to its default tools, so that was the obvious choice to work with for the parser.

The parser

It’s the second step in processing your code, but the parser is the best place to start writing or explaining the compiler because it contains the grammar for toybasic. My grammar was inspired by Bogdan Kravtsov’s tinybasic, but I modified it to include strings and removed INPUT. The full grammar is here. Let’s take a look at a little excerpt:

statement:
    PRINT expr_list                                 { $$ = opr(PRINT, 1, $2); }
    | IF expression relop expression THEN statement { $$ = opr(IF, 4, $2, $3, $4, $6); }
    | GOTO expression                               { $$ = opr(GOTO, 1, $2); }
    | LET v '=' expression                          { $$ = opr(LET, 2, $2, $4); }
    | END                                           { $$ = opr(END, 0);  }
    ;

We define the grammar of the language in a special syntax and give goyacc a little snippet of Go code to run when it parses the relevant symbol in order to generate the parsed tree. Here you can see the part of the grammar which parses all the statements my toy BASIC supports. That opr function is defined in my compiler code to make the right kind and shape of objects for our syntax tree.

The net result is that a line of BASIC like:

10 PRINT "Look!", (2 + 3) * 3

Is parsed out to a tree of objects like:

PrintOp
    |
  ListOp --------.
    |            |
  StringOp      *Op --------.
                 |          |
              GroupOp      INTEGER
                 |      
              InfixOp
              |  |  |
        INTEGER  +  INTEGER

There’s also a pretty important little table defined alongside the grammar where you define the attributes that need to be populated for every node in the tree:

%union {
    v string    /* Variable */
    s string    /* String */
    num int     /* Integer constant */
    dec float64 /* Decimal constant */
    node Node   /* Node in the AST */
};

The lexer

nex generates the Go code necessary to tokenize a language using a series of Deterministic Finite Automata. Here’s a tiny snippet of the more than 1000 lines of code it generated for toybasic - it’s a relief not to have to write this by hand.

...
var dfas = []dfa{
	// LET
	{[]bool{false, false, false, true}, []func(rune) int{ // Transitions
		func(r rune) int {
			switch r {
			case 69:
				return -1
			case 76:
				return 1
			case 84:
				return -1
			}
			return -1
		},
...
	}, []int{ /* Start-of-input transitions */ -1, -1, -1, -1}, []int{ /* End-of-input transitions */ -1, -1, -1, -1}, nil},
...

Instead, nex takes a config file (here’s mine) where you specify a set of regular expresions which capture all the keywords and identifiers for your language. Here are a few examples:

/PRINT/ {return PRINT}
/==/ {return EQ}
/[0-9]+/                {lval.num, _ = strconv.Atoi(yylex.Text()); return INTEGER}
/[0-9]+\.[0-9]*/        {lval.dec, _ = strconv.ParseFloat(yylex.Text(), 64);return DECIMAL}
/"[^"]*"/               {lval.s = yylex.Text(); return STRING}

Those are regular expressions on the left between / chars. You provide them in precedence order. This code ends up being very tightly coupled with the yacc grammar. Each regex match runs the Go code in curly braces on the right which must return a token type and fill out the per-type properties for any leaf nodes in the tree.

Compiler

This is the only piece I needed to write directly in Go. It defines types for each possible node in the syntax tree and each Node type has a Generate function for each of them which knows how to transform that node into Go code (also calling Generate on any relevant children).

type Node interface {
	Generate()
}
...
type PrintOp struct {
	expression Node
}

func (op PrintOp) Generate() {
	fmt.Fprint(writer, "fmt.Println(")
	op.expression.Generate()
	fmt.Fprintln(writer, ")")
}

This is what PrintOp looks like - it simply knows how to write fmt.Println([whatever code my children generate]) in Go!

Putting it all together

So there we have it - a toy BASIC to Go compiler. To test it out I wrote a little BASIC program that uses every single construct in the language and made sure it works and generates the expected output. While I had a good idea how every step in the process works in the abstract it was elucidating and a tremendous amount of fun to do it for real. Definitely a good way to spend a rainy Saturday afternoon.

I can now verify that the very first BASIC program I ever wrote still works as expected.

10 PRINT "David is cool"
20 GOTO 10