（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=41227716

可以通过模式匹配来简化树操作代码的编写。讨论差异列表算法（用 Haskell 编写）的文章中演示了一个很好的例子。本质上，它创建了一个由嵌套列表组成的树结构。通过将树转换为右倾格式，然后连接所有叶列表，操作变得更加高效，减少了计算时间。模式匹配允许紧凑且简单的代码。如果没有它，相关代码将会变得更长并且更难以理解。值得注意的是，Rust 拥有模式匹配，但由于其强调低级问题（例如指针和内存管理）而受到阻碍。编译器设计的趋势包括越来越多地采用静态单赋值 (SSA) 作为中间表示 (IR)，以及编译器 IR 的重要性日益增强。此外，形式语义也得到了重视，即使是较弱的记忆模型也具有复杂的功能形式描述。从历史上看，解析器生成器被广泛用于实现多种流行的编程语言，但近年来它们变得不那么受欢迎，手工制作的解析器已成为常态。改进的编译器优化技术涵盖了更广泛的分析，超越了传统的 use-def 链和特定于循环的优化，从而促进了整个函数级分析。编译器技术的最新发展需要能够从更简单的底层表示中推断分析，LLVM 采用的结构化 Risc 汇编操作 (SROA) 就证明了这一点，它根据加载和存储而不是显式定义的结构属性来分析代码。最后，对将附加工具与编程语言集成、支持源代码查询以及提供语言服务器、自动格式化和自动重构等功能的需求不断增长。总之，编译器技术的进步带来了效率的提高、高级编译技术的利用以及与随附工具的更大兼容性，从而提高了生产力。模式匹配在促进编码便利性方面发挥着重要作用，特别是在树操作场景中。

Have read the first few chapters and it expects that you either read the accompanying source code or implement your own and pass the tests. The pseudo code presented in the book often look like function calls with the inner details not there in the book. Furthermore, as already pointed out in another comment, the available implementation is in OCaml, which is probably not something many C programmers have experience with.

Nevertheless, I think I'm learning more from this book than most other books I've tried before that are far more theoretical or abstract. I'm still eager to reach the chapter on implementing C types. I think it's a good book, but it requires more effort than something like Crafting Interpreters or Writing a Compiler/Interpreter in Go, while also covering topics not in those books.

Nand2Tetris is also like that - they provide an outline and tests, but you have to do the work. And, having the implementation language be different from the target language reduces confusion.

Plus, you get to become proficient in OCaml, which is a pretty good language.

that's a good point—it was pretty confusing when i wrote ur-scheme in scheme, or stoneknifeforth in stoneknifeforth, because i kept getting confused about which level of the language i was changing things in

I thought this book looked neat but closed the tab before reading the comments here, and after this one decided to go ahead and buy it. Sounds really fun!

I’ve been working through this book implementing the compiler in Ada. So far, I’m really enjoying it. The book doesn’t make too many assumptions about implementation details, leaving you free to experiment and fill in the blanks yourself.

It feels like a more advanced version of Crafting Interpreters.

I haven’t looked at the OCaml implementation at all. The text and unit tests are all you need.

Discussion on the Ada Forum: https://forum.ada-lang.io/t/writing-a-c-compiler/1024

I see many comments saying that the book implements the C compiler in ocaml. In the introduction the author states that the book actually uses pseudo code so you are actually free to implement it in any language. The only recommendation is that you use a language with pattern matching because the pseudo code makes heavy use of it. The reference implementation is in ocaml.

Thanks, can you please lemme know which part uses pattern matching? I'd assume mostly in the lexer, but the parser should just be something that consume the tokens and spit out AST. Unless of course it combines the two.

Thanks. At first I thought it is something like regex, but then I found it's something in functional programming. I need to read a few chapters of the book before buying it because I'm not interested in learning FP at the moment.

Question for HN, pattern matching is defined as “runtime type/value checking”, is that correct?

Is duck typing the pseudo-unsafe alternative? (Not unsafe as in accessing unsafe memory, but as in throwing exceptions if the duck-typed function doesn’t exist on the current type)

Can C handle both?

Coming from a static type system like rust and c#, i’m doing alot of “if this is a duck, duck.quack()” and i’m looking for faster alternatives and less verbosity if possible

One thing is that pattern matching can make writing tree manipulation code succinct and easier to read. For example, take this article[0] that describes the difference list algorithm (in Haskell). Basically, it's kind of like a rope, but for lists. It's a tree with lists at the leaves, and when you want to convert it into a list, you rewrite the tree to be right-leaning, and then concatenate all the lists at once. This turns repeated concatenation at the end of lists from taking quadratic time into one that takes linear time (strcpy can be an example of this in C [1]). The code can be written like this:

  data Tree a = Leaf a | Branch (Tree a) (Tree a)

  fromList :: [a] -> Tree [a]
  fromList = Leaf

  toList :: Tree [a] -> [a]
  toList (Leaf x) = x
  toList (Branch (Leaf x) r) = x ++ toList r
  toList (Branch (Branch l1 l2) r)
               = toList (Branch l1 (Branch l2 r))

  append :: Tree [a] -> Tree [a] -> Tree [a]
  append = Branch

In a language that doesn't have tree pattern matching, the code wouldn't be this short and easy to understand, and I don't think it could be replicated just by having duck typing. Rust has pattern matching, but because it's primarily focused on lower-level concerns like pointers and memory ownership, pattern matching isn't this nice, because you have to pattern match on the pointer first.

Since a compiler is all about tree manipulation, support for tree pattern matching should be a boon.

[0]: http://h2.jaguarpaw.co.uk/posts/demystifying-dlist/

[1]: https://en.wikipedia.org/wiki/Joel_Spolsky#Schlemiel_the_Pai...

So what's different about writing a compiler in 2024 than say 10, 20, or 30 years ago? When I started writing compilers in the 80's and 90's lex/flex and yacc/bison were popular. ANTLR came out but I never had a chance to use it. Everything after lexing and parsing was always hand rolled.

There are actually quite a few changes!

The most obvious change you'll see is the use of SSA, which has become the dominant representation in IR starting 25-30 years ago.

There's also been an increase in the importance of compiler IRs, and especially the concept of code passing through multiple IRs before reaching machine code.

Formal semantics has become more of a thing in the past decade or so. It's now routine that even weak memory models have a detailed formal model of how they work. In LLVM, it's now a requirement that you demonstrate formal correctness of new InstCombine transformations (which are essentially peephole optimizations).

The use of parser generators has really fallen into disrepute; everything has transitioned to handrolled parsers these days. Language standards themselves are starting to rely on context-sensitive keywords, which are hard to implement in a generator-based lexer/parser setup.

Optimizations have generally broadened in scope; we're now seeing whole-function level of optimization being the default way to look at stuff for analysis, and there's been a variety of techniques introduced to make whole-program optimization (aka LTO) much more tractable.

Another subtle but major shift is that compilers are increasingly reliant on inferred analysis from dumber representations over the original declared intent of the code. For example, SROA in LLVM (which breaks up structs so that individual fields can be independently allocated in registers) relies not on looking at the way the struct is declared but the offsets within the struct that various load and store operations use.

A final major shift is the trend towards support for ancillary tooling in the programming language space, so that the compiler isn't merely a tool that goes from source to object code, but is something that can be actively queried for information about the source code. Things like the language server, or smart formatting, or automatic refactoring tooling.

You can use the "classic" tool set and parse many programming languages. The real trick in writing compilers is taking advantage of the hardware (disclaimer: I designed and wrote middle-pass portions for IBM 360/370 processors (and clones), supercomputers from Cray and ETA Systems, and Sun 4 workstations among others, including some RISC systems that just disappeared around the bursting of the dot-com bubble) I tried and failed to optimize MPP systems from all of the "name" players in the 1990's. It kind-of broke my heart...

That "middle-pass" approach that will let you address many targets is still valid; the trick is finding a sufficiently robust and flexible internal representation at the right level. You also have to be able to out-guess the chip vendors where before you could go to the architect or a complete "System" book and get the real scoop, including things you shouldn't do. Oddly enough, there is simultaneously useful and completely worthless documentation scattered about the internet.

You might want to take a look at Muchnick and Jones' _Program_Flow_Analysis_ (yes, it's from 1981) but chapters 4-6 can be applied at code-generation time. How that fits modern Intel processors (for example) is unknown. Idealizing your processor as a RISC-V might be a reasonable way to proceed but in the end, you'll have to translate the code for the target -- it will be reasonably straight-forward if you drive it all from tables but it's not trivial.

The book doesn't have 2024 in the title. I suspect they put it there because last time a post about this book was made, I noted that it was from 2022, not realizing that the book has now been released in 2024.

https://news.ycombinator.com/item?id=40940799

> So what's different about writing a compiler in 2024 than say 10, 20, or 30 years ago?

As far as I can tell, the main difference is that static single assignment (SSA) as an intermediate form was not the norm 30 years ago, but it is nowadays. Also, in newer books, it's more common to go over global register allocation now, whether that's graph coloring or linear scan register allocation. If you read old compiler books, the main optimizations they talk about are use-def chains, moving computations out of loops, and using the local and tree-based Sethi-Ullman register allocation algorithm.

ANTLR generate a visitor system from a BNF grammar that you need to implement the logic of in your chosen language, which I believe is similar to C++'s std::visit. lex and yacc generate state machines using BNF grammar and you implement the logic in the tools themselves

Parser-generators were always academic projects that had little relevance to making real-world programming languages -- where parsing is very easy to write, and necessarily benefits from doing it (ie., you can get better error handling/etc.).

Today most languages are front-ends for LLVM IR, but LLVM is very slow and takes a long time to optimize. Many new languages target x86/arm directly with their own weakly optimized backends, and output an LLVM IR for "release builds".

yacc/bison was used for lex, bc, pcc, gcc, original awk, the bsd pascal compiler, eqn, m4 (!), and many other languages. it's still used for pcc, oawk, mawk, pari/gp, and units. that's just what i have sitting around in my downloads directory

and, while we're talking about ocaml, ocaml does use ocamllex and ocamlyacc for its own parser

so, while you can certainly do without parser generators, they have very commonly been used for making real-world programming languages. almost every programming language anyone here has ever heard of was first implemented with a parser generator. the main exceptions are probably fortran, cobol, algol, lisps, c, and pascal

I guess I wasn't very clear. I didnt mean to say, as a historical matter, were irrelevant.

I meant to say the idea of a parser generator is a solution to a problem that that real world langs don't really have. When writing a programming language, your issue isnt how much time the parser is going to take to write, or how complex it's going to be. The parser is a relatively trivial part of the problem.

Due to language designers often being taught to develop langauges in this fashion, many have relied on these tools. But the modern view of compliers as "programming langauge UIs" and the focus on DX, i'd argue its actively pathological to use a parser generator.

Much academic work has, til recently, focused on these areas -- whereas today, the bulk of the difficulty is in understanding SSA/LLVM/ARM/Basic Optimizatiosn/etc. details which are "boring, circumstantial" etc. and not really research projects. I was just pointing this out since a lot of people, myself included, go down the EBNF parser-generator rabbit hole and think inventing a langauge is some formal exercise -- when the reality is the opposite: it's standard programming-engineering work.

oh, i agree that parsing is not the hard part of writing a compiler, and that compilers classes overemphasize it

but no language starts out as a 'real world lang'; every language is initially a toy language, and only becomes a 'real world lang' in the unlikely case that it turns out to be useful. and parser generators are very useful for creating toy languages. that's why virtually every real world lang you've ever used was implemented first using a parser generator, even if the parser you're using for it now is handwritten

having a formally defined grammar is also very helpful for dx features like syntax highlighting and automated refactoring

The one thing parser generators do is that they ensure that the language you implement actually matches the grammar you wrote. That’s still an important assurance to have.

Indeed, that's a defunct profile where everything should be private anyway. The reops there are 13/14 years old: these were experiments with using RPython to create languages, I'd guess when I was ~20. The point of those was to profile RPython. I have created real front-ends and compiler backends in C for non-trivial langugaes.

I will soon likely create a probabilistic programming language and compiler.

The parser defines the grammar. This is quite common in mainstream languages -- iirc, only after some years did python get a formal description of a grammar.

This still isn't quite correct. Back in the day parsing was a much larger portion of the complexity of a compiler: performance was much more of a concern, as was memory usage. Parser generators were an attempt at helping with that, by allowing programmers to produce more efficient (e.g. table-driven) parsers than what they could have otherwise written by hand. They only really went out of fashion because A) computers got faster and bigger faster than programs got longer, so parsing became less and less of a percentage of total utilization, and B) you can get much better error messages out of recursive-descent parsers.

Fortran compilers did go through a phase when table-driven parsers were used, but it had the disadvantage of needing complicated lexers and statement classifiers that rely on semantic information. Fortran’s a hard language to parse, given its lack of reserved words, optional spaces, and many ambiguities.

The f18 compiler’s parser uses parser combinations to construct a backtracking recursive descent parser that builds a parse tree for the whole source file before doing any semantic analysis. This approach allows good error recovery without having to revert any updates to the symbol table.

that's an interesting approach! though probably not applicable to c and c++

i assume that by 'parser combinations' you mean parser combinators

what i meant about fortran is that the first fortran compiler didn't use a parser generator

When your computer was anemic, and could barely do the tasks required for it, eking out a few percent — or a 2x! — from an optimizer was important.

Now-a-days, the difference between "big compiler optimized" and "little compiler not optimized" can be quite dramatic; but, is probably no more than 4x — certainly within range of the distinction between "systems programming language" and "high tuned JITted scripting language". I think most people are perfectly fine with the performance of highly-tuned scripting languages. The result is that all of the overhead of "big compiler" is just ... immaterial; overhead. This is especially true for the case of extremely well-tuned code, where the algorithm and — last resort — assembly, will easily beat out the best optimizer by at least an order-of-magnitude, or more.

Wouldn’t just a simple case of bad code generation render little compiler into a toy one?

Repeating others, today’s compilers are really just “optimizing compilers”, there is no room for toying in production environments.

>just a simple case of bad code generation render little compiler into a toy one

If you find some time to go through gcc bugzilla you'll find shockingly simple snippets of code that miscompiled (often by optimization passes), with fixes never backported to older versions that production environments like RHEL are using.

I realized with all the rhel systems I’m using, we are never using default toolchains on them. Just use those old systems to run stuff, even newer toolchains.

Ok, but that’s sw engineering issue.

I still insist that a production grade compiler can’t leave performance on table. Which is where the current battlefield is.

I think a production grade compiler not only can, but must, leave performance on the table when the cost is correctness (unless the performance gain is incredibly high and the correctness loss is minimal). Correctness is not all important, but it is the most important thing. Unfortunately, compiler writers do not agree and they do silly things like "let's assume UB cannot ever happen and optimize based on that".

I do not agree in the general case. There are very useful DSL compilers which do not consider performance at all, but just compile to a target which does the optimization for them (JVM, LLVM IR or even just C)

if you aren't running on the gpu you're leaving 80+% of your computer's performance on the table. no optimizing compiler is going to make your legacy c or lisp or rust code run efficiently on the gpu, or even in most cases on a multicore cpu. nor, as thechao points out, can it compete with assembly-language programmers for simd vectorization on the cpu

in summary, optimizing compilers for c or pascal or zig or rust or whatever can only be used for code where considerations like compatibility, ease of programming, security, and predictability are more important than performance

probably the vast majority of production code is already in python and javascript, which don't even have reasonable compilers at all

No.

Main post is about a C compiler, post I responded is saying

> When your computer was anemic, and could barely do the tasks required for it

which I could only interpret as a machine that can run a single process at a time, so not really about gpus or what.

gcc was first released in 01987, but it didn't replace its bison parser for c until gcc 4.1 https://gcc.gnu.org/gcc-4.1/changes.html which was in 02006 https://gcc.gnu.org/releases.html, only 18 years ago, and 19 years after it was released. joseph myers first proposed doing this in 02004 https://gcc.gnu.org/legacy-ml/gcc-patches/2004-10/msg01969.h...

so gcc has literally been using a parser-generator-generated parser for c for more than half its existence, at which point it had already become the most popular c compiler across the unix world and basically the only one used for linux, which had already mostly annihilated the proprietary unix market. it was also imposingly dominant in the embedded space

and i think that kind of development path is fairly typical; getting a parser up and running is easier with a parser generator, but it can be tricky to get it to do strange things when that's what you want (especially with lalr, less so with peg)

Correct me if I'm wrong, but I think both Zig and Jai use LLVM as their default backend...at least, that's what I have seen via live streaming for Jai, and from Zig's repo.

I don't think it's moving away, because if you have read the entire thread, the gaming community reacts negatively to say the least and they assure everyone that LLVM does not going anywhere any time soon.

I uh misread the title and thought someone built a C compiler in Scratch.

On topic, though: wouldn't a simpler language (maybe even a pseudo language) be a better target for a first learning compiler. I understand they don't build a full C compiler, but still. It looks to me like there's a lot of complexity add from choosing such a lofty target.

What do you think would make a better target? C maps pretty closely to assembly, so it seems like it would be the simplest. Maybe Pascal or BASIC, but most people these days don’t have experience with Pascal, and BASIC would probably be too simple for a full-length book.

For writing an interpreter or transpiler, there are probably better options, but for a true compiler I can’t think of a better choice than C (or at least a subset of C).

I took a compilers course in university and the course culminated in having a compiler for C Minus (a subset of C). The professor noted how each year the line count of the compilers was dropping as students found ways libraries or languages that made it easier. I think the evolution was Java -> Antlr -> Python. I used OCaml and emitted LLVM and blew that metric out of the water.

Far fewer, to the point of another student asking me what I even did for the project because I didn't have to implement any of the algorithms.

chibicc[0] complement this book nicely, in addition to a basic compiler, it guides you through writing the preprocessor and driver, which, although not addressed much in literature, are the missing link between the compiler built from the book and real C projects.

[0] https://github.com/rui314/chibicc

ML (short for "meta-language") was originally designed for use in programming language research, and really shines for that purpose. And OCaml is probably the most pragmatic dialect for the purpose.

SML is very dated and the standard library and ecosystem lack many things that are considered table stakes for a viable programming language nowadays. And F# and Scala are fine as enterprise languages, but being tied to .NET and Java respectively makes them less desirable for implementing a language that won't itself be coupled to one of those runtimes.

Tree processing is best done in a language with decent algebraic datatypes and pattern matching. I would’ve preferred Standard ML, but, well, pot-ay-to, pot-ah-to. Haskell is another choice but the techniques you need to use there (while undeniably gaining you some possibilities) don’t really generalize to other languages, so you’re now writing a book about compiler construction in Haskell rather than just compiler construction. Ditto for Rust. Kotlin has deliberately anemic pattern matching. C# or F# leave you depending on Microsoft’s benevolence (sic). Metalua and Sweet.js both have decent ADT support but both are pretty much dead. Racket exists, I guess, and there are some pattern-matching libraries for normal Scheme as well, but the charisma malus of the parenthesis is real even if I don’t understand what causes it.

So OCaml was probably the most mainstream choice among the languages with appropriate tools, as funny as that sounds. And honestly, once you get over the syntax, it doesn’t actually have anything outrageous.

This looks cool, been interested in learning more about compilers since I did the basics in college. Lots of things seem to focus on making interpreters and never make it to the code generation part so its nice to see that this features information about that.

This book covers compiling to assembly whereas Crafting Interpreters only has a bytecode VM implementation. We'll see how good this book is when it drops, but I think that's a worthwhile feature that Crafting Interpreters punted on.

I remember seeing this a while back. That typesetting is beautiful. Thank you for bringing it up here, I might have to pick that one up.

I’ve been bored with building line-of-business applications, despite designing for complex requirements in high-volume distributed systems.

In fact I took a break from CS learning entirely 9 months ago. Even reading HN. I’ve been studying electronics and analog signal processing instead.

But now that I’ve built about 50 guitar pedals of increasing complexity, I feel ready to switch back to CS studies again.

What approach does this book take to error recovery?

Several "compiler light" style articles and books kind of walk over that part, and it can be non-trivial to do properly, especially with modern expectations.

I remember way back in the day, one of the early C compilers for the PDP, and, honestly, it could almost be argued that ed(1) had better error messages than what that thing produced.

A lot of simple compilers hit an error and just give up.

So, just curious what the approach was in this book.

Wirth's book does not implement a "real" programming language. Whatever your thoughts on Oberon and Pascal-like SHOUTCASE languages, it's largely irrelevant. Oberon is arguably a "real" language (and operating system), but Wirth's book does not cover the implementation of Oberon. It covers the implementation of Oberon0, an inarguably toy subset of Oberon. (Actually, "subset" is not even correct.) The example code has also diverged from the book, with Wirth abandoning the strategy described in the book for avoiding redundant initialization of the module static base, among other things.

Aside from that, I encourage everyone who cites Compiler Construction to actually work through the first 10% of the book and then count the number of errata.

The book is a very hands on tutorial whereas Wirths is basic literature for the general case.

While they teach similar content, they have a different approach.

There are literally thousands of compiler design books out there, I don't really see anything particularly comparable between this book and Wirth's

Similar to studying OS concepts using Silberschatz' Operating System Concept and Tanenbaum's Operating Systems Design and Implementation. The former only explains the theoritical ideas, while the latter is the documentation of an implementation.

Weird that this is about building a C compiler[0] in OCaml. I expected the implementation language to also be C both for consistency but also because i'm willing to bet that there are more people who can read C than OCaml.

[0] actually from the readme in the github repo[1] it seems to be a C subset, not all of C

[1] https://github.com/nlsandler/nqcc2

ocaml makes writing a compiler enormously more accessible, and learning to read ocaml, while it can be somewhat intimidating at first, is much easier than learning to write a compiler

(imagine a medieval accountant trying to learn to do long division in roman numerals. he'll be much better off learning the western arabic numerals fibonacci is so excited about)

I really really really want to get more into Ocaml but as far as I know there is no good support for a debugger to use in an IDE like VSCode or even vim. Everyone I talk to says they just do print debugging. It's easier to 'reason about your code' in FP, but I do NOT want to go back to a time where I coded without breakpoints. I use F# because of its first party tooling support

it does have a debugger with breakpoints, which even supports time-travel debugging (except on windows obviously), but i've never used it. it even has first-party ide integration: https://ocaml.org/manual/5.2/debugger.html#s:inf-debugger

i use debuggers a lot when i'm programming in assembly, and from time to time when i'm programming in c or c++, but in ocaml i've never needed one. it's not that i've never had bugs in ocaml, but they tend to be of a different flavor, a flavor for which breakpoints are of little value

it sounds like your f# experience is different; what kinds of bugs have you recently found the debugger valuable for in f#?

debug logging is usually a superior option to breakpoint debugging for problems to which both are applicable, because debug logging shows you the whole history of your program's execution rather than just a single point in time. breakpoint debugging requires a lot of manual labor, painstakingly operating the machine to navigate the execution to the state that has the problem. it's like grinding in a video game. i'd rather program the computer to do that labor for me, at which point i no longer need the debugger

(except in c and assembly)

> it does have a debugger with breakpoints, which even supports time-travel debugging (except on windows obviously), but i've never used it. it even has first-party ide integration: https://ocaml.org/manual/5.2/debugger.html#s:inf-debugger

1. I am developing on windows so that's an issue for me and

2. I don't use emacs, I use VScode and I've not been able to get the experimental debugger working for the VScode plugin.

It's not that the debugger is purely for fixing bugs; I use it as an active part of development. I want to be able to freeze the program and inspect values as they are running. I may know what the types or state of the program are just by viewing the code, but I want to be able to inspect the actual data for myself as the program is running.

> debug logging shows you the whole history of your program's execution rather than just a single point in time

breakpoints also provide stack traces, so they provide a kind of history as well. I'd rather inspect and interact with a program than dig through thousands of lines of logs

> breakpoint debugging requires a lot of manual labor, painstakingly operating the machine to navigate the execution to the state that has the problem

I see things the opposite as you: print debugging is tedious and requires restarting the program from the beginning whenever you make a change to the source, unless you are constructing your program piecemeal with the repl, which I consider to be extremely tedious as well. To me, a debugger with breakpoints is a far more efficient way to code than print debugging.

I think there is a cultural difference in software engineering between people who use debuggers and people who don't. John Carmack once pointed out that people who come from the game dev and Windows/PC world use debuggers while people from the linux and web dev world tend not to. It seems to be a matter of preference/taste, and I think FP programmers seem to have a distaste for debuggers and graphical debugging/development environments

"... I think FP programmers seem to have a distaste for debuggers ..." I'm unsure that I'd make this assertion so broadly. For Haskell, I'm usually left using Debug.Trace because I've found the traditional symbolic-step debugger is less than ergonomic in the face of lazy evaluation. In Common Lisp, the debugger is my best friend.

"... and graphical debugging/development environments" The loud ones might be saying they use Arch btw (on ThinkPads no less) and lean heavily on using neovim inside the current popular Rust rewrite of tmux, but I personally don't care for it.

VS Code is neat. I mostly still use Emacs because that's where the tooling investment has historically been, but my ideal is definitely much closer to Smalltalk-meets-Oberon.

mine too! except, not so much of a closed-world system? a lot of what i like about emacs is that it has some of that same live-malleability that smalltalk and oberon have

if you've tried godot i'm interested to hear what you think about it

Unfortunately, I have no idea what godot is. I would like to know more though if you're so inclined.

(Surely not the game engine? That's the only thing my disambiguation machinery can come up with.)

> John Carmack once pointed out that people who come from the game dev and Windows/PC world use debuggers while people from the linux and web dev world tend not to. It seems to be a matter of preference/taste, and I think FP programmers seem to have a distaste for debuggers and graphical debugging/development environments

i think that's true! but i don't think it's purely a matter of preference; it's also a matter of what the ecosystems support well and what you're trying to achieve. lisp is a huge exception to the rule about fp programmers; lisp systems, even including emacs, have extremely strong debugging support. but non-lisp fp languages tend to heavily emphasize static typing, thinking things through ahead of time, and correctness by construction, which reduce the need for debugging. but those are more valuable for writing compilers than for writing games or uis in general, where it's hard to define what counts as 'correct' ahead of time but easy to recognize it when you test it interactively

webdev of course has the problem that you can't stop your http response handler while the browser is waiting for a response. and, often, it's sadly not very concerned with ux. and it's often concerned with operating systems in a way where you have to debug problems after they occur, in part because of the scale of the problems

automated testing is another ecosystem thing that reduces the need for debuggers; to a significant extent it competes with static typing

one of the things i really appreciate about godot is being able to continuously adjust parameters and observe variables in my games as they're running. godot is motherfucking awesome, man. i definitely don't have a distaste for graphical debugging and development environments!

> breakpoints also provide stack traces, so they provide a kind of history as well. I'd rather inspect and interact with a program than dig through thousands of lines of logs (...) print debugging is tedious and requires restarting the program from the beginning whenever you make a change to the source

oh, see, i have programs like grep and emacs that dig through thousands of lines of logs for me. often when i'm debugging from logs i don't run the program at all; i just look at the logs and the source code. sometimes the program is running someplace i can't interact with it—memorably, on some occasions, on a satellite out of range of the groundstation. and usually exceptions on python or java give me a pretty decent stack trace, though other log messages unfortunately don't

there's another ecosystem support issue here—although i've sometimes developed on systems that supported fix-and-continue (cmucl, gforth, squeak, basic-80, godot), python and gcc support it very poorly or not at all. so for me i have to restart the program from the beginning whenever i make a change to the source in any case, whether i'm doing printf debugging or not. godot, again, is a very nice exception to this rule, and incidentally lets me add print debugging to the game while it's running

one of the great things about a breakpoint debugger, from my point of view, is that it makes it possible to add logging after the fact to a running program without editing the source or restarting it

i really appreciate you sharing your experience!

If you're literally "from the Linux world", you're probably younger and less experienced, and likely from a hobby background rather than CS or engineering.

Developers in Unix shops before the Linux era used debuggers.

Obviously, the GNU project developed a debugger for an audience. GNU wouldn't be a complete replacement for proprietary systems like Unix with only compilers, but no debugger.

There are many disadvantages to writing a compiler in c (and I have done it). For me, the biggest is simply that it is very verbose for the type of things that you do commonly in a compiler. I have written a self-hosting compiler that transpiles to c and the generated code is about 5x as long as the original code. I could go through and clean it up and probably get it down to 2-3x, but there are certain concepts that cannot easily be expressed compactly in c (except possibly with preprocessor abuse that I don't have patience for). A simple example is that if you want to represent a union type you have to manually define 2 structs and an enum. Matching with switch is also verbose as you have to both match and manually assign the value in the match case. Again, you can use macros to do this though at that point you arguably aren't even using c but rather a hybrid language. A language like ocaml, makes it much more straightforward to define and match on unions as well as gives you other nice high level things like gc so you can focus on the compiler itself and not low level details.

For hobbyist compiler implementations, right? Compilers for the most popular languages are either written in C/C++, or self-hosted.

You can write compilers in almost any language. I fail to see how C, C++, or even Java or Python aren’t the right tool for the job here. I like pattern matching too, but given that hundreds of successful production compilers have been written without pattern matching, it’s surely just a personal preference.

I’ve worked on multiple compilers in industry that are written in Ocaml. A number of industrial static analyzers are written in Ocaml too (eg, Infer from Facebook/Meta). Yes, LLVM and GCC are the big ones written in the C/C++ family but they don’t represent everything.

And the Go, Java, Ruby, JavaScript, C#, Typescript, PHP, Kotlin, R compilers, and so on.

But even for hobby projects, it’s just a matter of personal preference. OCaml is great for implementing compilers. So are Go, C++, and Java.

I mean, we _are_ talking about a book which invites you to build your own toy C compiler ^^

Nevertheless, OCaml is very strong in compiler design. For example Rust and Hack were written in OCaml initially.

Nevertheless you are not wrong that compilers needing the very last bit of performance like the JVM and LLVM tend to be written in C++

But the barrier is quite a lot more tending to high performance/very high performance and not toy/production

Java and Python are suitable for implementing a toy Compiler and the auther invites you to use any language you like. Just the reference implementation is using OCaml

I would however argue that using C++ is quite advanced since it does not have pattern matching and using C is just masochm. You will be fighting against the language to do even trivial things instead of fighting the actual problem at hand

I totally agree that OCaml is a great language to write a compiler. I’ve used Rust and Haskell, and loved them both.

I was more so pushing back on the the implication that if it’s not OCaml, it’s not the right tool for the job.

Like, I honestly can’t think of a mainstream language in which it would be hard to implement a C compiler in.

I don't really need to know how to build a compiler, and I've got enough other "don't need but am doing out of curiosity" things going on that I don't need any more of those, but if it wasn't $70 I'd probably get it anyway. It would be interesting to compare to the last building a compiler book I read back and see how things have changed. Based on the comments here a lot has changed.

That last book was Allen Holub's "Compiler Design in C", which is from 1990. Here's how the blurb on the back describes it:

> Allen I. Holub's Compiler Design in C offers a comprehensive, new approach to compilers that proves to be more accessible to computer science students than the other strictly mathematical books.

> With this method in mind, the book features three major aspects:

> (1) The author develops fully functional versions of lex and yacc (tools available in the UNIX® operating system to write compilers), (2) he uses lex and yacc to develop a complete C compiler that includes parts of C that are normally left out of compiler design books (eg., the complete C "type" system, and structures), and (3) the version of yacc developed here improves on the UNIX version of yacc in two ways (error recovery and the parser, which automatically produces a window-oriented debugging environment in which the parse and value stacks are visible).

It's out of print, but the author has made a searchable PDF available on his website [1]. I found it quite useful.

Holub seems to like the "learn by doing" approach. He's got another book, "Holub on Patterns" that teaches all the design patterns from the gang of four book organically by developing two programs that together use all of those patterns. The two programs are an embedded SQL interpreter and a GUI application for Conway's Game of Life.

PS: Ooh. It occurred to me that No Starch Press books are often available on O'Reilly Learning. I checked and this one is there. So I guess it is going on my "don't need but am doing out of curiosity" pile after all.

[1] https://holub.com/compiler/

I’m working through this book now and really enjoying it!

Each chapter of the book includes a test suite to run against the code you’ve written.

In some ways, the tests in this book feel very similar to the labs in the book Computer Systems: A programmers perspective — which is high praise!

Book is not yet published but in early access since a couple of years

Was featured here a couple of times.

Unfortunately the timing of the release is quite unfortunate with regards to the summer holidays. Will take a look at it next year

Many compiler related books take inspiration from the "Dragon book" (Compilers: Principles, Techniques and Tools). So with likely lots of books with similar looking covers.

Point was more that there are bunch of compiler related books featuring a dragon (and knight/other character). The original also has a different looking dragon & knight themed cover for every edition.

Like if I’d see the book on a shelf I would instantly guess it’s related to compilers. And I bet that’s completely intentional homage to the original.

I would love to see a book that talks about going all the way to generate machine code, i.e., not stopping at generation of assembly.

Alternatively, I would like to learn about not just how to make a compiler, but also simultaneously a debugger, hot-reloading, etc.

Writing an simple assembler is trivial. Even macro assemblers are very easy.

However, it's also boring.

Nevertheless the contents of the book cover all the techniques required to write an assembler, if you'd really like to

I understand that assembly file can be parsed in the same way. However, I want to learn about the machine instructions to the level of bits, and likewise the layouts of binary files. Unless I am able to go all the way to machine code loaded in memory, I would not know where in memory to add a breakpoint instruction when a developer wants the same on a line of code.

If there is some library that can help create machine code from assembly instructions on a line by line basis (at least as opposed to invoking a separate program that generates the entire binary collectively from the assembly code), that could also work.

In my case, I already know enough of the lexer, parser, etc., parts. What's missing is going all the way to making a debugger, profiler, etc.

asmjit is great; I used it for building a primitive (but quite fast) query engine for in-memory graphs. It made it simple to go from query AST to native instructions, most of which was "call this other compiled function".

Building a debugger and profiler is quite an advanced task compared to building an assembler though ^^

Also much of that work is heavily dependent on the used operating system.

Nevertheless, I'm wishing you all the best on your journey!

There can be weird interactions unless there are strong enough limits on what kind of expressions the assembler allows. Especially if it supports conditional assembly and loops in the macros. One ugly way around it -- which causes its own headaches -- is to introduce pass-sensitive conditional assembly (as in "if in pass 1/2/...").

It's also "fun" if some instructions come in different sizes... and you may need stronger restrictions on allowed expressions in that case.

cool, remember some tutorials online i think from the same author (not 100% sure) doing stuff around c compilation in python. shame its not in a language i want to learn. the other book on compilers i got is almost to heavy to lift! :D

Somewhat unrelated: Is there a book that walks you through building a database system from storage to queries, optimizer, execution, indexing, transactions, etc?

transaction processing by gray (rip) and reuter was pretty close back in the 90s. i don't think it covered query optimization because it's really about tp monitors rather than databases, but, perhaps surprisingly, it does cover the other topics you're asking about

My experience (and I admit I may be too biased given years of prior C/C++ experience) is that Rust's syntax is a necessity, since no other mainstream languages besides C/C++ are as low-level as Rust.

Most mainstream languages have a GC, and don't support distinguishing between values on the stack or references, don't need to deal with lifetimes or don't provide the safety you get with them, etc.

I'm curious though, could you give an example of syntax you consider convoluted, and how you would do it instead?

Oh, I absolutely see it.

My point was that it's necessary. How would you implement the same features Rust and C++ have, without garbage collection, but with simpler syntax?

So you are very careful not to depend on any of their toxic complex features. In the end, better not use them at all.

Even plain and simple C99 compilers can be reasonably written by a solo dev, so a motivated small team...

I do depend on their features, and I see the value in doing that.

For example, at $WORK I had to reimplement Apache's mod_rewrite.c in Rust, and I made it 100x faster for our particular use-case.

I could've done that in C as well, sure, but the simplicity of the C language just moves the complexity to the code itself; whereas, with Rust, I whipped out a prototype in just 3 days, and was able to freely pass around pointers to data, with zero allocations, zero copying, and every time it compiled I knew it was guaranteed to be safe.

You can't get that safety in C. You can't get that speed in Java.

You can't do this in any other language that does not have these features. I absolutely agree that the syntax is horrible... but I see no other way to achieve this.

I severely disagree, this is quite short sighted.

The obscene complexity of the rust language (like c++) makes a toolchain beyond anything reasonable to code alternatives: that reason alone is sufficient to avoid it.

You can have as many "features" you want, the anti-feature of absurd/grotesque syntax complexity _alone_ buries it.

There is nothing more to say or argue about. This is dead simple.

（评论） (comments)

（评论）
(comments)