Bun Has Been Converted to Rust. Now What?

原文

On May 14, PR #30412 merged into Bun's main branch: a little over a million lines of Rust, 6,755 commits, generated almost entirely by Claude Code agents over nine days. Anthropic, which acquired Bun in December, supplied the agents. The Zig implementation that powered Bun is gone. Jarred Sumner's own words - "we haven't been typing code ourselves for many months now" - are the part everyone quoted, and the part that turned a routine merge into a 685-point Hacker News thread with the PR itself split almost evenly between thumbs-up and thumbs-down.

The Rust rewrite passed 99.8% of the existing test suite. That number is enormous and significant, but let's be precise about what it actually says: it says that the new implementation behaves like the old one at the runtime's public interface. That's it. It does not say that the new implementation is safe, or better, or even good. Those are different claims.

The benchmarks are neutral-to-faster, and the binary shrank by a few megabytes (from a starting point around 93MB on Linux on x64, for example). If that were the whole story there would be nothing to add: it'd be a "good merge," I suppose, because "smaller" and "faster" while not losing test validation are "good attributes." But the actual stated reason is not one of those reasons and that suggests that there's more here than people are thinking about, blinded by fascination with Rust - a fascination I share - and with the interest in whether an LLM can accomplish something like this at all.

The stated reason was safety

Sumner has been consistent that the motivation was not performance. The Zig codebase had cost the team years of debugging memory bugs - use-after-free, double-free, the usual parade of potential errors - and the pitch for Rust was compiler-assisted memory safety. Catch the class of bug at compile time that Zig, like C and C++, leaves to the programmer. That is a coherent and respectable reason to move. In fact, it seems to be the common reason most large systems projects cite when they reach for Rust.

But: the rewrite ships with somewhere more than ten thousand unsafe blocks^{across more than 700 files. For scale: uv, a Rust project of broadly comparable size from the same general corner of the ecosystem, contains 73. That is not a rounding difference. That's two orders of magnitude, and it is the direct consequence of the porting strategy.}

The team published a porting guide instructing the agents to translate the Zig faithfully - same architecture, same data structures, file by file. A faithful port of manual memory management does not become memory-safe in transit. It becomes manual memory management wearing a mask made of Rust^{. Every place the Zig code did something the borrow checker would reject, the translation reached for unsafe, and the borrow checker stands down exactly where it was supposed to do the most good.}

Todd Smith commented to me that they didn't even have to fail this way; they could have used guardrails to say "you cannot use unsafe" and added a git pre-commit hook to forbid it for real. Given such restrictions, a good LLM would work with them, adding memory safety as it went, but even this requires review: this is, as Todd said, a good reason on its face not to attempt such ports.

So the 99.8% test pass rate and the 10,000+ unsafe references are not in tension at all. They are the same fact viewed twice. The test suite passes because the port is faithful. The unsafe count is high because the port is faithful. Faithfulness was the goal, and faithfulness was achieved. What was not achieved - what cannot be achieved by faithful translation - is the safety property that the whole exercise was supposed to use as justification.

You can have a faithful port or you can have idiomatic safe Rust. The first is what an LLM translating file-by-file produces. The second is what the memory-safety argument promised. They are not the same artifact, and the test suite cannot tell them apart, because behavioral equivalence at the public interface is blind to whether the bytes underneath are sound.

Why this is not a cleanup problem

The natural defense is that this is early. It is canary-only; follow-up PRs are coming; the unsafe count will come down as the team refactors toward idiomatic Rust. Maybe so! But this is where it is worth being honest about what "refactor the unsafe away" actually entails, because it is not a chore - it is an open research problem.

Verifying that unsafe Rust is sound is hard enough that Amazon convened a Rust Foundation–hosted community effort specifically to verify the unsafe code in the standard library - a far smaller, far more scrutinized, human-authored body of code than a million lines of agent-generated runtime. That effort exists because unsafe code re-opens the door to undefined behavior and a single mistake in an unsafe block voids the type-system guarantees of everything around it, a point Todd Smith made to me a long time ago and one well worth heeding when using Rust^.

The standard library alone has produced over twenty CVEs traceable to unsafe code, despite decades of expert eyes. The academic state of the art for verifying unsafe Rust is semi-automated tools and proof-of-concept verifiers that need human-written specifications. There is no push-button "make this unsafe block sound" pass, and there is no near-term prospect of one.

Todd's recommendation here, based on a lot of research into similar issues: "Don't port memory-unsafe code automatically at all. Produce detailed specifications of the macro-observable surface of your product and then tell the agent to use the existing code as guidance only, for detail filling, while taking primary direction from the specification." Of course, this means you have to have a good, complete specification...

Which means the path from more than 10,000 unsafe blocks to something defensible is not a follow-up PR. It is a multi-year auditing effort against a target that was produced faster than any human can read it. Generation scales; verification does not. That asymmetry is the actual news, and it is not specific to Bun - Bun may just be the largest, most public instance of it so far.

Who audits this?

The question that the loudest HN commenters converged on was not "Rust or Zig" and not "should AI write runtimes." It was narrower and harder to wave off: who actually reviews a million lines an agent shipped in nine days? The honest answer appears to be that nobody read it the way code at this blast radius is normally read, because reading it at the rate it was written is not a thing humans can do. The team's own confidence rests on the test suite - which brings us back to the top, because the test suite was never measuring the whole point of the conversion.

There is a small, ironic coda. The follow-up PR to delete the 600,000 or so lines of leftover Zig was titled, by Sumner, "ai slop." GitHub's automated anti-slop detection - built to flag exactly the kind of AI-generated mass change this was - caught it and auto-closed it. The author named his own cleanup slop, and the platform's tooling agreed. It's also the clearest one-line statement of where the verification layer currently sits relative to the generation layer: the machines that write the code are now well ahead of the machines that are supposed to check it, and the humans are downstream of both.

Edited: This is less ironic than it appeared. Someone on Hacker News pointed out that Sumner added the hook and the comment himself, so this last paragraph isn't accurate; it's the perception I, as the author of this post, had, and thus is a failure of research. My apologies.

What this means for you, specifically

None of this is a prediction that Bun will break. It may run beautifully for years; faithful ports of working code usually work, that being the entire point of fidelity. The 0.2% that does not pass is made of edge cases and platform-specific behavior, and undefined behavior does not announce itself as a failing test - it shows up as a CVE on the one libc nobody runs in CI, eighteen months out. (Or maybe that one wonky musl that a router chose...) The risk profile did not get worse than the Zig version in any obvious way. It just did not get better in the specific way the rewrite was sold as delivering, and a large pile of unsafe is now load-bearing infrastructure under a tool that, by Anthropic's own description, ships inside Claude Code to millions of users.

The practical takeaway is not "AI bad" or "Rust bad." (Rust is actually good. So is AI, as long as it's used responsibly, like any other tool.)

This is a measurement discipline: when someone offers you a test pass rate as evidence of a safety property, check whether the test suite measures that property. Behavioral equivalence and memory soundness are different axes. A green test suite tells you the new thing acts like the old thing. If the old thing was a body of manual memory management and the new thing is a faithful translation of it, then green tells you the translation is good - and tells you nothing whatsoever about whether the thing is safe. The number that would actually answer the question is the one nobody can produce yet, because producing it is, for now, an unsolved problem.

That, and not the merge, is the story.