用五个项目构建一个编译器
Build a Compiler in Five Projects

原始链接: https://kmicinski.com/functional-programming/2025/11/23/build-a-language/

## CIS531:从零开始构建编译器 CIS531是一门面向硕士级别的编译器设计课程,侧重于实际实现。学生将为一种不断发展的语言构建编译器,从简单的算术开始,逐步扩展到包含函数、递归和堆分配——最终生成x86-64汇编代码。 该课程使用Racket编程语言(易于学习,并提供相关资源),并参考Jeremy Siek的《编译原理基础》(可选购买)。项目涉及增量开发,每个阶段都建立在上一阶段的基础上,并具有全面的测试。 **课程的主要特点:** * **五个项目:** 逐步构建一个编译器,从栈解释器到支持函数和lambda表达式的语言。 * **强调测试:** 每个项目都包含一个强大的测试套件,以确保正确性。 * **清晰的结构:** 明确的项目结构,提供用于passes、IR定义、解释器和测试的代码。 * **独特的方案:** 专注于一种实用且函数式的方案,牺牲一些典型的编译器特性(如内存安全和寄存器分配),以实现快速开发和清晰度。 该课程旨在提供有益的体验,让学生能够构建一个功能齐全的编译器,并有可能通过添加类型检查或更高级的优化等功能来扩展它。资源和项目详情请访问[https://kmicinski.com/cis531-f25](https://kmicinski.com/cis531-f25)。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 用五个项目构建一个编译器 (kmicinski.com) 26 分,作者 azhenley 2 小时前 | 隐藏 | 过去 | 收藏 | 2 评论 AdityaSanthosh 10 分钟前 [–] 你好,看起来是个有趣的课程。我本科期间没有学过编译器(我是电子专业的学生),但我一直是一名程序员,学过 C 和一些底层语言。这门课程需要任何先修的编译器知识吗?回复 ktimespi 7 分钟前 | 父评论 [–] 唯一的先决条件可能是 Racket,以便跟随本书。 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

Class website here: https://kmicinski.com/cis531-f25

Are you interested in building a compiler? Learning how functional languages are implemented? Gaining a bit of practical experience with x86-64 assembly language? If so, I invite you to try your hand at the projects in my class, CIS531. CIS531 is a masters-level class on compiler design which assumes that (a) you know how to program, (b) you’ve had some exposure to C (know about stack allocation, malloc, etc.), and (c) have seen some assembly code. My class projects are in the Racket programming language, but if you don’t know Racket, it is quite easy to learn: I have a set of YouTube video lectures that teach Racket quickly! If you’ve never heard of Racket before, or you’re skeptical of functional programming, indulge me for a bit: there’s no hardcore FP theory or math in this course, and Racket is genuinely the best language to use for this specific setup.

My class follows Prof. Jeremy Siek’s excellent book, “Essentials of Compilation.” While I highly recommend buying the book and supporting Prof. Siek, I will also note that there are free online preliminary editions floating around; in my class, I followed the free version and suggested that students buy the book if doing so fit their goals. However, along with the book, I also have a set of class slides along with sporadic course videos, both available on the class website.

This class builds up to a compiler with the following features:

  • Variables and assignment via let
  • Integer arithmetic via + and -
  • Reading inputs / printing output
  • Booleans, conjunctions/disjunctions (and/or)
  • Branching via if, integer comparisons (<, etc.)
  • Heap-allocated vectors
  • Assignment / mutation (set!)
  • While loops
  • Fixed-arity functions and function application
  • Lambdas (closures at runtime)

The unique combination of features lets us tour an interesting cross-section of programming languages, exploring both imperative programming with loops and mutation but also functional programming with lists and recursion.

To be specific, I challenge you to complete five projects, each including a comprehensive test suite that will seriously stress the correctness of your implementation. p1 is a warmup project (you should skip if you already know Racket), but p2-5 build a compiler for a set of increasingly-complex languages to x86-64. The languages nest inside of each other, with p2 giving us straight-line arithmetic, p3 giving us decision trees, p4 giving us loops and mutation, and p5 giving us functions, recursion, and lambdas.

  1. p1 – Stack interpreter. This is a warmup project, if you know Racket and have some PL background, feel free to skip.

  2. p2 – Straight-line arithmetic / variables → x86-64 assembly language

  3. p3 – Booleans and branching (if, and, or) → x86-64 assembly language

  4. p4 – Vectors, heap allocation, set!, and loops → x86-64 assembly language

  5. p5 – Functions, lambdas, and closure conversion → x86-64 assembly language

The projects are designed with one key principle in mind: get us to the most expressive/fun language possible, as fast as possible. In doing this, we sacrifice a lot that might be typically covered:

  • Our languages aren’t type/memory safe, we assume the programmer is correct

  • No register allocation (possible to add, not too hard)

  • No garbage collection of any kind: we just use malloc. We could trivially support the Boehm GC (I have done that in the past), but it was another static library to link in and I really wanted to make this self contained.

  • We support a very limited set of builtins (but it is trivial to add more)

So even after project 5, getting to a “real” compiler would take a bit of effort. The most important (in my opinion) are (a) memory safety (the language needs to be safe, period) via dynamic type tagging and (b) slightly more builtins, and (c) register allocation. That would get us to a respectable compiler. After that, we could add more language features, or optimize the ones we have, e.g., by using abstract interpretation.

An Example Program

Our language will include functions, loops, branching, assignment, and even heap-allocated vectors. As an example of the power, here’s a Sudoku solver written in the language

(program
 ;; =========================
 ;; List primitives
 ;; Empty list is (void)
 ;; =========================
 (define (is_nil x) (eq? x (void)))

 ;; cons cell as 2-element vector: [0] = head, [1] = tail
 (define (cons h t)
   (let ([c (make-vector 2)])
     (let ([_ (vector-set! c 0 h)])
       (let ([_ (vector-set! c 1 t)])
         c))))

 (define (head c) (vector-ref c 0))
 (define (tail c) (vector-ref c 1))

 ;; =========================
 ;; Cell representation
 ;; cell = (row col val) as nested cons
 ;; =========================
 (define (make_cell r c v)
   (cons r (cons c (cons v (void)))))

 (define (cell_row cell)
   (head cell))

 (define (cell_col cell)
   (head (tail cell)))

 (define (cell_val cell)
   (head (tail (tail cell))))

 ;; =========================
 ;; Block indexing (0,1,2) for rows/cols
 ;; =========================
 (define (block_index3 x)
   (if (< x 3)
       0
       (if (< x 6)
           1
           2)))

 (define (same_block? r1 c1 r2 c2)
   (if (eq? (block_index3 r1) (block_index3 r2))
       (eq? (block_index3 c1) (block_index3 c2))
       #f))

 ;; =========================
 ;; Lookup current value at (row, col) in board
 ;; board is a list of cells
 ;; Return 0 if not assigned
 ;; =========================
 (define (lookup board row col)
   (if (is_nil board)
       0
       (let ([cell (head board)])
         (let ([r (cell_row cell)])
           (let ([c (cell_col cell)])
             (if (and (eq? r row) (eq? c col))
                 (cell_val cell)
                 (lookup (tail board) row col)))))))

 ;; =========================
 ;; Conflict check:
 ;; #t if some cell in board has:
 ;;   - same value, and
 ;;   - same row OR same col OR same 3x3 block
 ;; =========================
 (define (conflicts? board row col val)
   (if (is_nil board)
       #f
       (let ([cell (head board)])
         (let ([r (cell_row cell)])
           (let ([c (cell_col cell)])
             (let ([v (cell_val cell)])
               (if (and (eq? v val)
                        (or (eq? r row)
                            (or (eq? c col)
                                (same_block? r c row col))))
                   #t
                   (conflicts? (tail board) row col val))))))))

 ;; =========================
 ;; Recursive backtracking solver over (row, col)
 ;; board: list of assignments
 ;; rows, cols = 0..8
 ;; =========================
 (define (solve_cell row col board)
   (if (eq? row 9)
       ;; All rows done: solved
       board
       (if (eq? col 9)
           ;; End of row: go to next row
           (solve_cell (+ row 1) 0 board)
           ;; Otherwise, try this cell
           (let ([existing (lookup board row col)])
             (if (eq? existing 0)
                 ;; Empty cell: try values 1..9
                 (let ([candidate 1])
                   (let ([solution (void)])
                     (begin
                       (while (and (< candidate 10)
                                   (eq? solution (void)))
                              (begin
				(if (conflicts? board row col candidate)
                                    ;; conflict, skip
                                    (set! solution solution)
                                    ;; no conflict, extend board and recurse
                                    (let ([s (solve_cell row
                                                         (+ col 1)
                                                         (cons (make_cell row col candidate)
                                                               board))])
                                      (if (eq? s (void))
                                          (set! solution solution)
                                          (set! solution s))))
				(set! candidate (+ candidate 1))))
                       solution)))
                 ;; Pre-filled cell: just move on
                 (solve_cell row (+ col 1) board))))))

 ;; =========================
 ;; Read initial board from input:
 ;; 81 integers, row-major, 0 = empty, 1..9 = given
 ;; Returns list of cells
 ;; =========================
 (define (read_board)
   (let ([board (void)])
     (let ([i 0])
       (begin
         (while (< i 9)
		(begin
                  (let ([j 0])
                    (while (< j 9)
			   (begin
			     (let ([v (read)])
                               (if (eq? v 0)
				   (set! board board)
				   (set! board (cons (make_cell i j v) board))))
			     (set! j (+ j 1)))))
                  (set! i (+ i 1))))
         board))))

 ;; =========================
 ;; Entry: read board, solve from (0,0), return solution
 ;; Solution is a list of (row col val) cells
 ;; =========================
 (let* ([board (read_board)]
        [solution (solve_cell 0 0 board)])
   (lookup solution 8 8)))

The Full Language

The final language you’ll implement will be this one. In comments, I’ve also highlighted the sublanguages: for example, project 2 includes only numbers, input (read), binary plus, unary minus, variable references and let binding. It grows to all of R5.

(define (R5-exp? e)
  (match e
    ;; Project 2
    [(? fixnum?) #t]
    ['(read) #t]
    [`(+ ,(? R5-exp? e0) ,(? R5-exp? e1)) #t]
    [`(- ,(? R5-exp? e)) #t]
    [(? symbol?) #t]
    [`(let ([,(? symbol? x) ,(? R5-exp? e)]) ,(? R5-exp? eb)) #t]
	;; Project 3
    [#t #t]
    [#f #t]
    ['(void) #t]
    [`(- ,(? R5-exp? e0) ,(? R5-exp? e1)) #t]
    [`(and ,(? R5-exp? e0) ,(? R5-exp? e1)) #t]
    [`(or  ,(? R5-exp? e0) ,(? R5-exp? e1)) #t]
    [`(not ,(? R5-exp? e1)) #t]
    [`(,(? cmp? c) ,(? R5-exp? e0) ,(? R5-exp? e1)) #t]
    [`(if ,(? R5-exp? e-g) ,(? R5-exp? e-t) ,(? R5-exp? e-f)) #t]
    ;; Project 4
    [`(let* ([,(? symbol? xs) ,(? R5-exp? es)] ...) ,(? R5-exp? eb)) #t]
    [`(begin ,(? R5-exp?) ... ,(? R5-exp? ret)) #t]
    [`(while ,(? R5-exp? e-g) ,(? R5-exp? es) ...) #t]
    [`(make-vector ,(? R5-exp? len)) #t]
    [`(vector-ref ,(? R5-exp? v) ,(? fixnum? i)) #t]
    [`(vector-set! ,(? R5-exp? v) ,(? fixnum? i) ,(? R5-exp? e-v)) #t]
    [`(set! ,(? symbol? x) ,(? R5-exp? e)) #t]
    ;; Project 5
    [`(,(? R5-exp? e-f) ,(? R5-exp? a-args) ...) #t]
    [`(lambda (,(? symbol? xs) ...) ,(? R5-exp? e-body)) #t]
	[_ #f]))

(define (R5-defn? defn)
  (match defn
    ;; Project 5 adds multiple function definitions
    [`(define (,(? symbol? f) ,(? symbol? formals) ...)  ,(? R5-exp? e-b)) #t]
    [_ #f]))

(define (R5? p)
  (match p
    [`(program ,(? R5-defn? defns) ... ,(? R5-exp?)) #t]
    [_ #f]))

The Compiler’s Structure

To get you booted up fast as possible, every single project is designed the same way:

  • compile.rkt – Your pass implementations. You will edit functions provided here. -> This is the only file you will edit! The rest are read-only
  • irs.rkt – IR definitions and predicates like anf-program?, c1-program?, etc. (see also typed/shrunk variants)
  • interpreters.rkt – Reference interpreters for several IRs (used by tests and for your own debugging).
  • system.rkt – System/ABI configuration, pass names, runtime filenames, output paths, etc.
  • main.rkt – Driver that runs all passes, can build a binary, and can launch a debug server.
  • test.rkt – Test harness. Runs isolation tests or end-to-end native tests depending on -m mode.
  • runtime.c – Minimal runtime (read_int64, print_int64, etc.).
  • test-programs/ – Example programs (.scm).
  • input-files/ – Input streams for programs (lines of integers).
  • goldens/ – Instructor goldens (IR snapshots, interpreter outputs, and stdout baselines).

You write your code in compile.rkt, which consists of a set of passes. Each pass transforms an input language into an output language, and these intermediate languages (IRs) are codified via predicates in irs.rkt. To define the meaning of each IR, we give an interpreter for each in interpreters.rkt. For the compiler to be correct, it needs to be the case that–for all input streams–the compiler produces the same output stream across all intermediate IRs. There is some system-specific stuff in system.rkt, which takes care of things like Linux vs. Mac ABI issues, specifying register names, etc. The main.rkt file acts as a main compiler entrypoint, and it carefully runs each pass of the compiler, checking predicates before/after each pass and interpreting each IR, checking to ensure consistency. This is a huge win for debugging, in my opinion: you always want to localize errors to the proximate pass which causes misinterpretation, and main.rkt seriously aids debugging in my experience. There is also more comprehensive test infrastructure in test.rkt; this test script is invoked by the Python-based test scripts in test/. These tests check the behavior of the compiler on the programs in the test-programs/ directory, using the files from input-files as inputs and comparing to the outputs in goldens/.

Why Is This Course Unique and Cool?

  • You build a real compiler, all the way to actual x86-64 assembly.

  • Each IR has a corresponding interpreter, which is easy to find/read and written in a familiar style, giving semantic clarity and testable correctness.

  • The project is language scalable, meaning that you can use it as a base for building your own language. Of course, this is thanks to Dr. Siek’s great “incremental” design.

  • It is fully testable across multiple passes, which helps anticipate the thing we all fear most about writing a compiler: seeing a problem that is the ramification of far-away code from higher up in the compilation pipeline.

  • It is written in a simple, pure recursive style. Just plain old pattern matching and recursion here, no need for any complex abstractions.

How Do I Get Started?

  • Familiarize yourself with the course webpage: https://kmicinski.com/cis531-f25

  • If you don’t know Racket, start with project 1: https://kmicinski.com/cis531-f25/projects/1

  • Otherwise, start with project 2: https://kmicinski.com/cis531-f25/projects/2

  • When you finish each project, move on to the next!

  • When you’re done, start building your own language. Consider adding type (checking/inference), classes, more builtins, pattern matching, continuations, exceptions, algebraic effects. The options are myriad, but once you’ve finished projects 2-5, you’ve built a whole compiler for a surprisingly expressive language.

Thank you to the National Science Foundation and Others

If you like this work and live in the United States, please feel commensurately less bad about paying your taxes. I made the whole class free, at least as free as I could given practical constraints. This class work on compilation is partially supported by our NSF PPoSS large, which has already produced many cool major results. In subsequent explorations, I am hoping that I can use this class compiler as a baseline for highly-scalable engines that reason about programs. Given the simple, self-contained nature–and the presence of per-pass interpreters and consistency testing–I see this as an awesome potential baseline for cool extensions.

My course is of course heavily inspired by Prof. Siek’s book and course, along with inspiration from Thomas Gilray at Washington State. Eight years ago, Tom and I took a spontaneous trip to see the eclipse halfway across the country (skipping out on the ICSE ‘17 deadline basically); we discussed compiler design over a steamed seafood buffet in Myrtle Beach after napping in a cheap motel, having been awake for over 24 hours and feeling the eclipse had made it worth it. We sketched out his whole compiler on that roadtrip, and ever since that night eating steamed crabs, I wanted to build my own course compiler. Now that I have, I am not sure it compares to waking up for just four hours of twilight, only to consume copious amounts of butter and shellfish as the brisk ocean air wisps over your face, the closures and continuations softly washing rhythmically through the conversation as you walk along the beach back to your $50 motel room.

In closing, thanks for checking this out, this compiler was a ton of fun to build. Even as someone who has some amount of expertise in compiler design, building it and getting it 100% right (I hope!) was such a rewarding experience. My real sincere hope is that it offers students (and you!) a fun journey. If you end up doing anything this, please get in touch: [email protected]. I’d love to see what you come up with. Best wishes,

Kristopher Micinski – Syracuse, November, 2025

联系我们 contact @ memedata.com