不需要分号

不需要分号
No Semicolons Needed

原始链接: https://terts.dev/blog/no-semicolons-needed/

## Roto 语句分隔：一种语言设计探索作者正在设计一种新的脚本语言 Roto，并着手解决可选分号的难题——这是提高可读性的常见目标。本文详细介绍了一项调查，探讨了 11 种不同的语言如何在没有显式终止符的情况下处理语句分隔。这些方法差异很大。**Python** 使用缩进和显式行连接。**Go** 通过词法分析器插入分号，依赖于简单的规则和错误检查。**Kotlin** 将换行处理融入其语法中，导致复杂的规则。**Swift、Gleam 和 Lua** 基本上忽略换行符，尽可能地进行解析。**Ruby、R 和 Julia** 按行分割语句，允许在不完整时进行延续。**Odin** 融合了 Go 和 Python 的方法。**JavaScript** 的自动分号插入以复杂且经常被劝阻而闻名。主要收获包括简单性、歧义性和开发者体验之间的权衡。一些语言优先考虑清晰的规则（Python、Gleam），而另一些语言则依赖于工具来捕获错误（Go、Swift）。作者为 Roto 提出了指导原则：优先考虑清晰、简单的规则，倾向于基于换行的分隔，并提供强大的工具来防止歧义。最终，最佳方法取决于语言的整体设计和语法。

## 黑客新闻讨论摘要：语言设计与分号一篇名为“不再需要分号”的文章引发了黑客新闻关于语言设计常见陷阱和显式语法的益处的讨论。一个关键主题是避免需要后期“改造”的功能——例如添加`const`/`mut`属性或布尔类型——因为最初的遗漏会造成长期的复杂性。许多评论者认为，为了清晰度和减少歧义，应该使用显式的语句分隔符（如分号），尤其是在解析复杂表达式时。他们指出，依赖缩进（如Python）可能导致错误，尤其是在复制粘贴或IDE格式化不完美时。一些人认为现代IDE可以处理分号插入，从而减轻开发人员的负担。另一些人则为基于缩进的语言辩护，认为它们具有可读性，并且能够编写视觉上简洁的代码。函数式语言，如Haskell和Lisp，被指出为例外，因为它们的语法不依赖于传统的语句。讨论还涉及语言设计中灵活性与严格性之间的权衡，一些人提倡强制执行一致格式的语言。最终，这场争论凸显了最小化打字工作量与最大化代码清晰度和可维护性之间的紧张关系。

I'm making a scripting language called Roto. Like so many programming languages before it, it has the goal of being easy to use and read. Many languages end up making semicolons to delimit or terminate statements optional to that end. I want that too!

This sounds simple, but how do they implement that? How do they decide where a statement ends without an explicit terminator? To illustrate the problem, we can take an expression and format it a bit weirdly. We can start with an example in Rust:

fn foo(x: u32) -> u32 {
    let y = 2 * x
          - 3;
    y
}

In Rust, that is perfectly unambiguous. Now let's do the same in Python:

def foo(x):
    y = 2 * x
      - 3
    return y

We get an "unexpected indent" error! Since Python doesn't require semicolons, it gets confused. As it turns out, many languages have different solutions to this problem. Here's Gleam, for instance:

fn foo(x) {
    let y = 2 * x
          - 3
    y
}

That's allowed! And if we echo foo(4) we get 5 just like in Rust. So, how does Gleam determine that the expression continues on the second line?

I think these differences are important, especially when you're interested in programming language design. The syntax of the language need to be intuitive and clear for the programmer, so that they can confidently explain how an expression gets parsed.

Usually, those syntactic rules are obvious; in many languages, function arguments are delimited by (). The rules for newline-separated statements are often vaguer and differ from language to language. Users are often told either to defensively put semicolons in their code or not to worry about it. Both seem like (minor) failures of language design to me.

How then do I find an approach for Roto that doesn't have these problems? I decided that my best course of action was to look at what 11 (!) languages are doing and how their approaches stack up. This post is that exploration. I don't really have an answer on what's best, but I hope this is still be an informative overview.

NOTE: I'm not fluent in all the languages below, in fact, some of them I've barely used. I've tried to cite sources where possible but I might still have gotten some details wrong. Let me know if you find any mistakes!

This is a long post, so I'd understand if you just want to jump to your favorite language, so here are links to all the sections:

Python

Let's start with the language most famous for whitespace sensitivity ([citation needed]). Python assumes that one line is one statement. In its grammar, it describes what it calls logical lines. These are constructed from one or more physical lines, i.e. the lines you see in your editor.

There are 2 ways that physical lines can be joined:

either explicitly with the \ token,
or implicitly while the end of the line is enclosed between delimiters such as (), [], {} or triple quotes.

The reference gives these examples:


if 1900 < year < 2100 and 1 <= month <= 12 \
   and 1 <= day <= 31 and 0 <= hour < 24 \
   and 0 <= minute < 60 and 0 <= second < 60:
        return 1


month_names = ['Januari', 'Februari', 'Maart',
               'April',   'Mei',      'Juni',
               'Juli',    'Augustus', 'September',
               'Oktober', 'November', 'December']

Having only these rules would be fairly error prone. You can see this by considering the example from the introduction again:

y = 2 * x
  - 3

If you forget to put a backslash at the end of the first line, Python would simply treat that as two statements. Luckily, Python has a solution: it strictly enforces correct indentation. Since the - 3 is on a new line, it must have the same indentation as the line before.

Now let's consider the consequences of Python's approach. It is quite principled and strict about its statement separation. It is also very unambiguous. It's easy to keep the rule of "one line, one statement" in your head while programming with the two exceptions being quite explicit.

A somewhat ironic consequence for an indentation-based language, however, is that Python's rules have encouraged the community to embrace explicit delimiters. For example, the ubiquitous code formatters black and ruff both prefer parentheses over backslashes.


y = long_function_name(1, x) \
  + long_function_name(2, x) \
  + long_function_name(3, x) \
  + long_function_name(4, x)


y = (
    long_function_name(1, x)
    + long_function_name(2, x)
    + long_function_name(3, x)
    + long_function_name(4, x)
)

I think Python's system is pretty good! It's simple, it's clear and the indentation rules are likely to catch any mistakes. From my time writing Python, I don't remember this getting in my way much, except for sometimes having to wrap expressions in (). I was never really surprised by this behavior.

Sources:

Go

Go's approach is very different from Python's. Go's official book states:

Like C, Go's formal grammar uses semicolons to terminate statements, but unlike in C, those semicolons do not appear in the source. Instead, the lexer uses a simple rule to insert semicolons automatically as it scans, so the input text is mostly free of them.

The first thing that I dislike about this is that it encourages thinking of semicolons being inserted instead of statements being terminated. I find that to be a roundabout way of thinking about the problem. But alas, this is what we're dealing with. I want to highlight something in that text: the semicolons are inserted by the lexer. The reasoning behind this is this that it keeps the rule for automatic semicolon insertion are very simple.

Go's lexer inserts a semicolon after the following tokens if they appear just before a newline or a }:

an identifier,
a basic literal,
or one of break, continue, fallthrough, return, ++, --, ) or }.

Simple enough! Let's go to our introductory example:

x := 4
y := 2 * x
   - 3

Those lines end with numbers so the lexer inserts semicolons:

x := 4;
y := 2 * x;
   - 3;

Just like Python, that seems error prone! But as we run this, Go has a nice surprise in the form of an error (not just a warning!):

-3 (untyped int constant) is not used

So, it has some guardrails in place to prevent mistakes. Even when I replace -3 with more complex expressions it usually errors on unused values that might occur by accident. That's good!

This post gives us an example that doesn't error and where the newline changes behavior. It first requires a bit of setup:

func g() int {
    return 1
}

func f() func(int) {
    return func(n int) {
        fmt.Println("Inner func called")
    }
}

And then these snippets have a different meaning:

f()
(g())

f()(g())

I'm not too worried about this to be honest; it looks like this requires pretty convoluted code to be considered ambiguous.

Now remember that this semicolon insertion is done entirely by the lexer. That means that semicolons sometimes get inserted at unexpected places:

if x
{
  ...
}

foo(
  x
)

Both of these result in parse errors. The fix is to adhere to Go's mandatory formatting style:

if x {
  ...
}

foo(x)

foo(
  x,
)

That's fair, even if it seems a little pedantic. I like these formatting choices, but I'd prefer if the "wrong style" was still syntactically valid and a formatter would be able to fix it. As it stands with Go, its formatter also errors on these invalid snippets. This strictness also seems to lead to confusion for newcomers every once in a while, particularly if they come from languages like Java, where braces are often put on a separate line.

So, Go's approach is simple, but in my opinion not very friendly. It is saved by disallowing some unused values, but I'm not competent enough with writing Go to evaluate whether that covers all ambiguous cases.

Sources:

Kotlin

As far as I can tell, Kotlin does not have simple "rules" for when a newline separates two statements, like Python and Go have. Instead, it makes newlines an explicit part of the grammar. So, for each construct where a newline is allowed, it opts into it explicitly. I'll spare you the BNF-like notation, but it seems to boil down to this:

Statements are separated by one or more newlines or ;.
If a construct is unambiguously incomplete at the end of a line, it is allowed to continue on the next line.
Delimited constructs (like function calls) allow newlines within them.
Newlines are not allowed before (, [ or {.
Binary operators seem to fall into two camps:
- &&, ||, ?:, as, as?, . and .? allow newlines on both side of the operator,
- the rest of the operators only allow a newline after the operator.
Prefix unary operators allow a newline after themselves.

This approach of baking newline handling into the grammar gives the language designers a lot of control, but this comes at the cost of simplicity and transparency. This approach is like the opposite of Go's. It can get pretty nuanced and I can't find a clear explanation of it.

My best attempt at summarizing this approach: an expression is allowed to continue on the next line if that is unambiguous in the grammar.

After that theory, we try our example:

val x = 4
val y = 2 * x
      - 3
print(y)

This gives us 8 with an unused value warning. That makes sense because -, + and many other infix operators only allow newlines after the operator. However, the logical operators && and || allow newlines on both sides.


val y = false
      || true


val y = 1
      + 2

Another case where the "continue if unambiguous" approach gets into trouble is when very similar operators have different rules. Kotlin has the :: and . operators to respectively access a method and a field of a class. Of these two, . allows newlines on both sides, but :: doesn't. That is because :: is also a valid start of a callable reference expression.

val x = foo
  .bar      

val y = baz
  ::quux

Since newlines are part of the grammar explicitly and therefore plainly disallowed in some places, I expected that this would give me an error because + only allows newlines after the operator in the grammar:

val y = (
    1
    + 2
)

But it works! I think it makes sense that they added this behavior, but I cannot find traces of this behavior in the specification. If somebody could show me where this is documented, I'd love to see it!

The vibe that I get from this implementation is that Kotlin's designers try really hard to make the behavior intuitive, regardless of how many rules and exceptions they need. I guess that if people never run into problems with it, then they don't need to understand it fully either. I'm not sure I agree with this fully, but it's a somewhat reasonable position.

This Stack Overflow answer echoes that sentiment:

The rule is: Don't worry about this and don't use semicolons at all [...]. The compiler will tell you when you get it wrong, guaranteed. Even if you accidentally add an extra semicolon the syntax highlighting will show you it is unnecessary with a warning of "redundant semicolon".

We could characterize this approach "don't worry, your IDE will fix it" and I guess that's fair when the company behind the language creates IDEs. Although if that is truly the consensus in the community, they've done a pretty good job!

Another potential problem might be that all these complex rules might make it difficult to write custom parsers for Kotlin. I wouldn't want to be the person responsible for maintaining its tree-sitter grammar for instance.

Sources:

Swift

There is a somewhat obvious approach that hasn't come up yet: just parse as far as you can ignoring newlines. Swift takes that approach and it's not hard to see why:

let x = 4
let y = 2 * x
      - 3
print(y)

That prints 5 as we would expect. The downside is that this prints 5 too:

let x = 4
let y = 2 * x
- 3
print(y)

But that's not too bad if you just have the rule that the language does not have significant whitespace. That's a rule people should be able to remember. Interestingly, Swift does have some significant whitespace to prevent mistakes. For example, it is not allowed to put multiple statements on a single line:

var y = 0
let x = 4 y = 4

They seem to have decided to ignore that in their grammar specification, but it is part of the compiler.

With this approach, the most confusing examples I can find are around symbols that can be both unary and binary operators (our eternal nemesis). This snippet prints 8:

let x = 4
let y = 2 * x
      -3
print(y)

Why? Because Swift has some special rules for parsing operators. If an operator has whitespace on both or neither side, it's parsed as an infix operator. If it has only whitespace on the left, it's a prefix operator. And finally, if only has whitespace on the right, it's a postfix operator. This means that this also parses as two statements:

let y = 2
      -foo()

Swift's designers seem to be aware of this problem (obviously) and therefore emit a warning on unused values, which would trigger on the example above. That should catch most erroneous cases.

Another tweak they made is that the parentheses of a function call must be on the same line as the name of the function. If it isn't, then expression will end after the first line. For example, The snippet below is parsed as two lines. They check whether the ( is at the start of the line and do not continue parsing if that's the case. The same is also done for [. This is a pretty good rule! You can check the JavaScript section to see how a language can get this wrong.

let y = x
  (1)

It's also worth discussing error reporting for syntax errors. Swift cannot easily guess where a statement is supposed to end if the syntax isn't correct.

let x = 4
let y = ( 2 * x
print(y)

This snippet is obviously wrong, because there's a missing ), but Swift instead complains about a missing , and a circular reference because we're using y before we declare it. It does that because it doesn't know that the statement was supposed to end. Now, to be fair, I found only 1 comment complaining about that, so it might not be a big deal. I haven't written enough Swift code to judge.

I like this approach a lot. It seems intuitive yet simple to understand and debug. The error messages might take a bit of a hit compared to languages with explicit semicolons, but you also get no "missing semicolon" errors so that's a bit of a trade-off.

Sources:

JavaScript

JavaScript seems to be the language that has given Automatic Semicolon Insertion a bad reputation. Its rules are pretty complex, but luckily there's an excellent MDN article about it.

There are three important cases where a semicolon is inserted:

If a token is encountered that is not allowed by the grammar that either

a. is separated by at least one newline with the previous token, or

b. if the token is }.
If the end of the input is reached and that is not allowed by the grammar.
If a newline is encountered in certain expressions such as after return, break or continue.

Note that this is not everything! There are many exceptions to these rules, such as that no semicolons are inserted in the for statement's head and that no semicolon is inserted in places where it creates an empty expression.

All in all, this means that our example is parsed as one line:

const y = 2 * x
        - 3

const y = 2 * x
        - 3;

The complexity of these rules is kind of a problem in itself as these rules are hard to remember. The worst part of this feature that the first rule only triggers on invalid syntax. The MDN article is full of examples where this goes wrong, such as these snippets, which are both parsed as a single line:

const a = 1
(1).toString()

const b = 1
[1, 2, 3].forEach(console.log)

If you want to code without semicolons in JS, you have to think about whether consecutive lines would be valid syntax if they were joined. Or you have to learn a whole lot of rules such as:

Never put the operand of return, break, etc. on a separate line.
If a line starts with one of (, [, `, +, -, /, prefix it with a semicolon, or end the previous line with a semicolon.
And more!

No wonder that many people just opt to write the semicolons in JS. Take for instance this quote from JavaScript: The Good Parts:

JavaScript has a mechanism that tries to correct faulty programs by automatically inserting semicolons. Do not depend on this. It can mask more serious errors.

In conclusion, you could write JS without semicolons, but the fact that many people recommend you always add semicolons is quite damning. I haven't seen that sentiment with the other languages in this post and it means that the feature does more harm than good. This feature is too complex doesn't even manage to be robust. Quite honestly, this feature is a disaster.

Sources:

Gleam

Gleam's approach is very similar to Swift's: it also just parses the expressions until they naturally end. Swift had a few exceptions to this though, so let's investigate what Gleam does.

First, we can look at our recurring example:

let y = 2 * x
      - 3

As we might expect, that's parsed as one expression. However, we can remove one space to change that:

let y = 2 * y
      -3

Kind of like Swift, Gleam seems to parse the -3 as a single token if it is preceded by whitespace and as a binary operator otherwise. I couldn't find a source for this so the details might be off here.

Gleam's approach of parsing everything regardless of whitespace does have some strange consequences. For example, this is accepted and parses as 2 expressions:

pub fn main() {
  1 + 1 1 + 1
}

I would personally require a newline there if I was designing Gleam, but this is technically unambiguous. Gleam's formatter will also put the expressions on separate lines and Gleam will warn you about an unused value, so you'll notice that something's off soon enough.

This is parsed as one expression, i.e. a function call:

pub fn main() {
  foo
  (1 + 1)
}

Now if you've written any Gleam, you might be yelling at your screen: "That isn't ambiguous!" And you'd be right; it can only be a function call, because Gleam uses {} for grouping expressions. So, if we use {} it's not a function call anymore:

pub fn main() {
  foo
  { 1 + 1 }
}

In another stroke of genius ambiguity prevention, Gleam doesn't have list indexing with []. So this is also parsed as two expressions:

pub fn main() {
  foo
  [ 1 + 1 ]
}

It's interesting that Gleam doesn't have the same guardrails that Swift has. It gets away with that by having a very unambiguous grammar. This is very impressive language design. Its rules are also pretty easy to grasp, so it looks like a pretty good implementation to me.

Sources:

Lua

Speaking of languages that just parse the thing as far as they can, Lua does that too! The book says:

A semicolon may optionally follow any statement. Usually, I use semicolons only to separate two or more statements written in the same line, but this is just a convention. Line breaks play no role in Lua's syntax[.]

This means that it basically works like Gleam! What sets it apart is that it does have indexing with [] and groups expressions with (). Here's an example that requires a semicolon to prevent it being parsed as a single statement:

(function() end)(); 
(function() end)()

There might be even more problematic cases, but I'm not experienced enough with Lua to find them.

Sources:

R

We've seen before that some languages insert semicolons when reading further would be invalid. R sort of takes the opposite approach: it inserts a semicolon when the grammar allows it. Here's the official explanation from the R Language Definition:

Newlines have a function which is a combination of token separator and expression terminator. If an expression can terminate at the end of the line the parser will assume it does so, otherwise the newline is treated as whitespace.

There's one exception to this rule, which is that the else keyword can appear on a separate line.

That approach is somewhat reminiscent of Python's. However, R allows expressions to continue to the next line if they are incomplete. Our recurring example would parse as two expressions because the grammar allows the expression to end after the x:

y = 2 * x
  - 3

But with a slight modification it parses as one expression:

y = 2 * x -
    3

The result is that you'd almost never have to worry about the next expression being parsed as part of the former. They are only joined explicitly, for example with parentheses or trailing operators. On the downside, I would generally prefer to write the operator at the start of the next line, which we can only do if we wrap the expression in parentheses (just like with Python).

It looks like a pretty good approach. I like that the newline has some semantic meaning and it doesn't feel confusing.

Sources:

Ruby

Another famously semicolonless language is of course Ruby. It has a very similar approach to R, but — as is becoming a bit of a theme — not quite the same. Like R, it splits statements by lines, but allows the expression to continue if it's incomplete. So we can basically copy our examples for R verbatim:


y = 2 * x
  - 3


y = 2 * x -
    3

But Ruby has a few more tricks up its sleeve. First, you can end a line with \ to explicitly continue the expression on the next line, kind of like Python. Second, it has a special rule that lines starting with a ., && or || are a continuation of the line before. It does that to allow method chaining and logical chains.

File.read('test.txt')
    .strip("\n")
    .split("\t")
    .sort

File.empty?('test.txt')
  || File.size('test.txt') < 10
  || File.read('test.txt').strip.empty?

I find this slightly confusing, because it's strange that some operators can start the next statement but not all of them. I guess it's not too bad to remember 3 exceptions. So, it looks pretty good!

Sources:

Julia

Documentation on how Julia's syntax works was a bit hard to find, so I looked at their parsing code. This means I have to guess a little bit at what the intention is.

Here are some things I tried:

b = 3
  - 4


c = 3 -
  4


d = ( 3
  - 4)

It seems to be dependent on the kind of expression whether a newline continues a statement. But in general, they seem to prefer splitting into multiple lines if that is legal. The newline is really treated as a separator in the parser. In that sense, it matches other languages with a lot of use in the scientific community, such as Python and R.

If anybody knows where to find documentation on Julia's syntax, let me know!

Sources:

Odin

While I was working on this post, Odin's creator GingerBill released a blog post that contained an explanation of Odin's approach. What I found particularly interesting are the reasons he cites for making semicolons optional:

There were two reasons I made them optional:

To make the grammar consistent, coherent, and simpler

To honestly shut up these kinds of bizarre people

It looks like he didn't care much for that feature himself. What's nice about this post is that he lays out some reasoning for Odin's approach. He describes it as a mix of Python and Go, where semicolon insertion is done by the lexer, but not within (), {} and [].

Another exception he lays out is that Odin has a few exceptions to allow braces to start on the next line:

a_type :: proc()

a_procedure_declaration :: proc() {

}

another_procedure_declaration :: proc()
{

}

another_type :: proc()

{

}

In a way, this looks like the opposite of Go, where instead of enforcing a certain coding style, they go out of their way to allow other coding styles than their own. This rule seems a sign that their grammar might be a bit too "overloaded", using very similar syntax for different concepts. But hey, they probably had good reasons to do so.

Sources:

A Different Idea

Here is an idea I haven't seen being used and I wonder whether it makes sense.

The only language that seems to consider indentation at all is Python, but only to restrict mistakes. I would love to see a language try to implement a rule where only an indented line is considered part of the previous expression.

x = 3
- 3

x = 3
  - 3

This feels quite intuitive to me. I could see this being a replacement for Python's line joining with \. A problem, of course, is that now the indentation always needs to be correct and many developers (myself included) like to just have their formatter deal with the indentation. In any case, it might be an interesting lint to consider for a language with optional semicolons.

We made it to the end! I think the best way to summarize this document is by grouping the languages:

Split statements on newlines, with exceptions
- Python
- Ruby
- R
- Julia
- Odin
- Kotlin
Continue statements on the next line, unless that's invalid
Let the lexer insert semicolons
Do not consider whitespace while parsing

Note: These categories are not perfect, for some languages, you could make the argument that they fit in multiple categories.

You could make some other categories as well. For example, you could call Python, Ruby, R, Julia, and Odin conservative in their parsing, they usually stop parsing at a newline. Lua, Gleam and Swift, on the other hand, are more greedy: they usually keep parsing across newlines as far as they can.

Another distinction to make is how it is implemented. JavaScript, Go and Odin have at least some part of the semicolon insertion implemented in the lexer, while many other languages make it part of the parser.

A final interesting category are the languages that are entirely insensitive to whitespace, such as Lua and Gleam. Even though Swift gets close to this category, it turned out to have some whitespace-sensitive rules.

This turned out to be a much more complicated topic than I expected! While there are approaches I like better than others, not all languages should use the same solution, because there might be other ways that the syntax differs that should be taken into account.

Nevertheless, I guess this is the part where I have to give my opinion about all of this, so here are some guidelines I would use (which you may very well disagree with):

Prefer defining clear rules over baking it into your parser.
Keep those rules as simple as possible (looking at you, JS).
Use a parser that splits on newlines in most cases, instead of continuing expressions greedily onto the next line.
Think about the rest of your language's syntax and what problems might arise with your chosen approach.
Add tooling to help catch mistakes (such as warnings on unused values) to prevent the most ambiguous cases.

Do you agree? What do you think is best? Have I missed any important languages? Do you have cool ideas for better implementations? Let me know be responding to this post on Mastodon!

Thanks to Thijs Vromen, waffle and Anne Stijns for proofreading drafts of this post. Any mistakes are my own. You can send corrections to [email protected] or on Mastodon.

No LLMs were used while writing this piece, neither for gathering information or for writing.