格蕾丝·霍珀的复仇

格蕾丝·霍珀的复仇
Grace Hopper's Revenge

原始链接: https://www.thefuriousopposites.com/p/grace-hoppers-revenge

## 代码的未来：简洁与验证 Kernighan 定律——调试时间是编码的两倍——强调了简单代码的重要性。然而，随着 LLM 的兴起，这超越了复杂性，延伸到*语言设计*。LLM 在 Python 和 Javascript 等流行语言上表现挣扎，但在 Elixir、Kotlin 和 C# 等语言上表现出色。这并非关于训练数据量，而是关于*结构*。具有清晰、明确规则的语言——函数式范式、不可变性、模式匹配——使 LLM 能够轻松理解和生成代码。这些语言优先使程序逻辑可见，有助于人工验证。瓶颈不是代码*创建*（机器现在更擅长），而是*验证*——确保代码实现其预期功能。这类似于特斯拉押注为人类设计世界而构建的视觉系统所取得的成功。同样，软件应该针对人类理解进行优化——清晰的规范、审计日志和可测试的属性。未来，LLM 将处理代码生成和调试，而人类将专注于定义需求和验证结果。这种转变需要针对机器*生成*和人类*验证*进行优化的语言，推动我们走向更简单、更结构化的设计——这正是 Grace Hopper 数十年前预见的愿景。

## 黑客新闻讨论摘要：格蕾丝·霍珀的复仇黑客新闻上的一场讨论围绕着一篇近期文章展开，该文章认为函数式编程语言在 LLM 代码生成方面表现良好，并非因为其内在优越性，而是因为其约束性简化了人工智能的解决方案空间。文章认为 LLM 擅长验证，从而将重点从人类代码创建转移到确保代码正确性。评论者们争论文章的观点，质疑数据和方法论。许多人指出，像 Elixir、C# 和 Kotlin 这样的语言中强大的约定和生态系统约束，似乎比纯粹的函数式范式更具影响力。有人建议数据质量——像 Python 和 JavaScript 这样的语言中大量劣质代码——可能会扭曲结果。一个关键点是，不变性和强约定比仅仅是函数式或面向对象更重要。其他人强调了人工智能可能加剧现有问题的潜力，从而创建人类不完全理解且难以维护的代码。还有关于 LLM 是否真的提高了代码*可理解性*，或者只是将推理转移到人工智能的讨论。最终，这场讨论凸显了人工智能时代软件开发思维方式的转变，从纯粹的代码创建转向验证，以及促进可靠、可审计代码的语言的重要性。

原文

The world of software has lots of rules and laws. One of the most hilarious is Kernighan’s Law:

Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.

I’ve always understood Kernighan’s Law to be about complexity—about keeping the code you write as simple as possible to reason about.

With LLMs now I’m learning it has a lot to do with language design too.

I’m still seeing a decent number of people on Twitter complain occasionally that they’ve tried AI-driven coding workflows and the output is crap and they can move faster by themselves. There’s less of these people in the world of Opus 4.5 and Gemini 3 now, but they’re still there. Every time I see this I want to know what they’re working on and what languages and libraries they’re using.

The big benchmarks for software engineers right now are SWEBench for coding and TerminalBench for computer tasks. Benchmarks are supposed to represent all coding tasks, so it’s critical to note here that SWEBench is focused on Python. TerminalBench involves more varied computer tasks, but when the agents need to write code, they write Python.

These are effectively Python benchmarks.

So what about other languages? Fortunately there’s AutoCodeBench, which doesn’t just test different models— it tests across 20 different programming languages. How’s that look? It looks like this:

AutoCodeBench language benchmarks across models. Elixir, Kotlin, Racket, and C# at the top. PHP, Javascript, Python, and Perl at the bottom.

Now, what we’ve been told about models is that they’re only as good as their training data. And so languages with gargantuan amounts of training data ought to fare best, right?

Turns out that models kind of universally suck at Python and Javascript.

The top performing languages (independent of model) are C#, Racket, Kotlin, and standing at #1 is Elixir.

I’ve been using Elixir as my primary language for a few years now, so I have obvious bias here. But this points to something critical about language design and where the nature of programming computers is going in the future. The amount of training data doesn’t matter as much as we thought. Functional paradigms transfer well. Structure beats volume. JavaScript has the training data but fights the architecture. Elixir has less data but flows with it.

So let’s talk about Tesla.

Tesla bet on vision when everyone else was bolting LIDAR to their roofs. It seems naive—human eyes are cheap sensors fooled by glare and rain and darkness. But Tesla bet on it because roads aren’t chaotic in random ways. They’re chaotic in specifically human ways, because humans built them. Color-coded lights. Painted lines. Signs shaped like their meanings. A century of visual grammar, internalized by every driver, encoded in every intersection.

Tesla and Figure are making the same bet with robots. Humanoid form, human-shaped hands, human-scale movement. Not because it’s elegant—it’s actually harder to engineer than wheels and grippers. But humans built a world for humans. Doors are human-width. Stairs are human-height. Tools have handles shaped for palms. Build a robot that moves like we move, and a thousand years of infrastructure comes free.

This isn’t about optimizing for humans. It’s about infrastructure: optimizing for the load-bearing interfaces. For cars and robots, that’s vision and hands—because we built our physical world for eyes and fingers.

For software, the load-bearing interface isn’t actually code. Code is implementation. The load-bearing interface is ideas written in English: requirements docs, bug reports, interface specs, audit logs. Humans specify intent and verify outcomes in language. The code is just what happens in between.

Abelson and Sussman famously said “Programs must be written for people to read, and only incidentally for machines to execute.” But we’ve spent fifty years optimizing programming languages for human writing. We built objects with identity and state because that’s how we experience reality—babies develop object permanence at eight months. It felt natural. But the bottleneck was never creation. It was always verification.

When Grace Hopper originally imagined and wrote the first compiler, she envisioned the translation layer moving directly from English to machine code. 75 years later, we’re finally able to work with her original vision.

“Programs must be written for people to verify, and only incidentally for machines to execute.”

That’s a statement about accountability. Humans own the specification. Humans own the verification. Everything in between is implementation.

We humans are not very good at writing code. The machines are better and they’re the worst they’ll ever be. How good? So good that even Anthropic — which I feel quite confident in saying has some of the best coders anywhere — says that Opus 4.5 now beats all of their incoming hires on their coding tests.

The machines are better. The gulf is going to grow. But let’s consider why.

Humans remember in episodes and narratives. Not data points but scenes — sequences with befores and afters. We evolved to track an animal as it moved behind a rock, to hold in mind that the berry bush was there even when we couldn’t see it, and to construct little movies of cause and effect. No wonder we love if-then statements. No wonder we built programming languages that fake the real world and read like plots: first do this, then do that, now check if it worked. We wrote code that matched the movies in our heads.

LLMs have no movies.

Complexity exists in the codebase, yes, but the programmer needs to reason about the runtime state of the program. What does this apply to in this part of the function and how is var foo bound inside this function?

Should this be where the complexity lives? Or should it be bound up in English, in the messy, inherently narrative world of product requirements and design docs and bug reports that surround the code?

Grace Hopper would have an answer.

Now consider the differentiated skills of an LLM. They’re extraordinary pattern matchers. They find idioms across huge corpora of text, handle declared structure well, reason locally within constrained context.

And they are bad at physical space and time. This is why they needed such an insane amount of data and fine-tuning to get hands right in image and video. They are bad at narrative consistency. They are bad at maintaining large amounts of state.

What does Javascript offer? Three hours into debugging a React component and we’re five layers deep in a stack trace that tells us nothing. The error is in useEffect. Which useEffect? The one that fires on mount, or the one that fires on update, or the one that fires whenever state changes except when it doesn’t because the dependency array is lying? We have to reconstruct the entire lifecycle in our heads—what ran, in what order, which promise landed, and with what bound to this—just to understand why a button isn’t toggling. We end up performing archaeology on our own work from Tuesday.

On the other hand, in Elixir functions are pure. You have an input, and you get an output. A function takes this shape and returns that shape. That’s it. That’s the whole story. There’s nowhere for the complexity to hide. There is no mutating state. All data is immutable. Pattern matching means that the “shape” of the data is always defined explicitly in the parameters. Combine this with multiple function heads, which feels weird at first to a human, and you get very explicit local context: this function does this thing and the data always looks the same.

To a human, object oriented programming feels natural and functional programming feels weird. Functions need translation. There’s no mug that moves; there’s a function that takes “mug at location A” and returns “mug at location B.” Deeply unintuitive to beings who evolved tracking objects through space.

Programming with objects and state are easier to write. Programming with functions and immutable data are easier to verify. This is the difference between easy and simple. When we keep adding “easy” things, we make systems overcomplex.

Code is best written with simple tools, simple primitives, and repeatable structures. It’s not just the Elixir language design that’s remarkable, it’s the entire ecosystem. In Elixir, there is one build system. One format option. There’s one library for Enums and Collections. Naming is predictable. Simplicity is a clear value and it all works. An LLM trained on Elixir sees the same patterns repeatedly. An LLM trained on JavaScript sees a thousand variations.

Elixir maximizes the amount of program meaning visible in local context. And LLMs are context machines.

Which brings us back to Kernighan’s Law. When we’re encouraged to write complicated state machines, we struggle to debug them because we so easily reach the edges of our cognition.

And now we have these LLMs and everyone is worried that machines will write code that humans can’t read. I don’t worry about that. But I do think about language design. LLMs can write passable Python and JS, but they write brilliant Elixir and Racket. The amount of training data doesn’t matter.

What seems to matter is locality: can the LLM see everything it needs without reconstructing state from elsewhere. Pattern matching makes data shapes explicit. Immutability means no hidden mutations. Pipes and composition mean predictable flow. One way to do things means patterns repeat.

These are the exact same features that help humans audit code. They’re what make code reviewable, debuggable, provable.

Functional languages with explicit semantics are optimized for machine generation and human verification. This is the sweet spot. And where we’re evolving.

In the past, humans have written, read, and debugged the code.

Now LLMs write code, humans read and debug. (And LLMs write voluminous mediocre code in verbose languages.)

Humans will do less and less. LLMs will write code, debug, and manage edge cases. LLMs will verify against human specifications, human audits, human requirements. And humans will only intervene when things are misaligned. Which they can see because they have easy verification mechanisms.

To do that well, we need clear contracts. Explicit effects. Testable properties. Auditable logic. Composable pieces.

These are functional programming virtues. And what makes formal verification possible. And what LLMs handle best.

I’m running Claude Code daily now, churning millions of tokens and outputting more code than I could even type in a day. It’s all Elixir, all the time, with rigorous planning documents to understand the complexity. The structure is simple and the interfaces are clear right down to the function level. If I were writing React, I’d be worried about the libraries, the component structures, how it interacted with the chosen build tool...I’d be terrified of the spaghetti soup I’d be living in.

I couldn’t go back to writing my own code at this point. Not just because it would be so much slower, but because this is so clearly better. Readable. Verifiable. It would be harder for me to write, but it’s easy to grok.

LLMs have arrived and showed us which languages are actually well-designed. The AutoCodeBench tests are a message: the “hard” languages were never hard. They were just waiting for a mind that didn’t need movies.

The future of software engineering is still dependent on humans, but we’re not writing the code anymore. Leave that to the machines, just give them good tools for the work.

The nerds who insisted on Haskell and Erlang weren’t wrong about language design. They were just early. McCarthy was the earliest: Lisp in 1958.

NVIDIA named the AI chip architecture powering all of our progress after Grace Hopper. The machines named after her finally let us see what she saw.

格蕾丝·霍珀的复仇 Grace Hopper's Revenge

格蕾丝·霍珀的复仇
Grace Hopper's Revenge