我用 Rust 构建的 PHP 引擎通过了 17% 的 PHP 源码测试，并已成功渲染 WordPress。

我用 Rust 构建的 PHP 引擎通过了 17% 的 PHP 源码测试，并已成功渲染 WordPress。
My AI-built PHP engine in Rust passes 17% of PHP-src tests, renders WordPress

原始链接: https://ekinertac.com/blog/i-dont-know-rust-my-ai-is-rewriting-php-in-it/

作者在没有 Rust 或编译器设计相关背景的情况下，通过完全由 AI 代劳的方式，成功构建了一个名为“Phargo”的 PHP 引擎。作者并未试图去理解或审计代码，而是扮演项目经理的角色，指挥 AI 根据客观的第三方基准（即包含 22,000 个文件的官方 PHP 测试套件）来修复识别出的故障。该实验依赖于“彻底的诚实”原则，即拒绝让 AI 给自己的工作评分。通过运行已有数十年历史且不可篡改的上游测试套件，该项目迫使 AI 必须处理 PHP 语言中那些“令人抓狂”的边缘情况。事实证明，这种方法比人工代码审查更有效，它揭露了许多本会被忽视的“波特金功能”（即语法解析通过但实际上并未实现任何功能的代码）。尽管作者缺乏技术专长，但 Phargo 目前已能成功运行 WordPress。这证明了只要拥有严谨且不可腐蚀的测试工具，通过 AI 驱动的迭代，完全可以开发出复杂的软件。作者认为，在现代软件开发中，开发者的角色已经从编写代码转变为“瞄准”——即确定目标，并根据毫不妥协的客观标准来验证输出结果。

开发者 ekinertac 分享了一个由 Rust 构建的 PHP 引擎实验，该项目主要由 AI 开发。目前，该项目已通过了约 17% 的原始 `php-src` 测试套件，并能成功运行一个全新的 WordPress 站点。作者并非专业的 Rust 或 PHP 内部机制开发者，其实验目的是测试 AI 是否能通过针对现有的测试套件有效地复刻一个项目。虽然目前的实现速度远慢于标准 PHP，但作者正在开发一个字节码虚拟机以提升性能。作者指出，17% 的通过率客观反映了项目的范围，因为其余的许多测试涉及复杂的 C 扩展，这超出了该项目的目标。 Hacker News 社区对此反应不一：一些用户认为在项目完成度更高之前其价值有限；另一些用户则认为这种依赖自动化反馈循环而非人工代码审查的“直觉编程”（vibe-coding）实验，是对 AI 驱动开发能力的一次令人印象深刻的展示。

原文

A few nights ago I watched my terminal print out a 26 KB WordPress front page — <title>Phargo Test Site</title>, the block-library CSS, “Hello world!” pulled from a SQLite database, and a clean </html> at the bottom. Completely unremarkable output, except for one detail: the PHP engine that served it contains zero lines of PHP’s actual source code. It’s a from-scratch interpreter written in Rust.

Here’s the part I need you to sit with: I don’t know Rust. I have never written a lexer. I could not explain to you what a “tree-walking evaluator” is without reading the Wikipedia article in another tab. If you cornered me at a party and asked how PHP’s garbage collector works, I would fake a phone call.

The engine is called Phargo, and my contribution to it is, roughly, aiming. An AI writes the code. I point it at a target, read what comes back like a medieval king reviewing naval charts — solemn nod, zero comprehension — and type the most powerful phrase in modern software development:

“looks good, continue.”

The Experiment: Radical Honesty as a Build System

Everyone and their houseplant has an AI-built project right now, and every single one comes with the same unfalsifiable claim: “it works!” Works according to whom? The AI that wrote it? The demo that was recorded on the fourth take?

So the whole experiment rests on one idea, borrowed from watching the Bun team drive their JavaScript runtime against real-world test suites: don’t let the AI grade its own homework.

PHP ships with its own test suite — about 22,000 .phpt files written by the PHP internals team over three decades. Tests I didn’t write. Tests the AI didn’t write. Tests that encode every cursed corner of the language, from DateTime daylight-saving math to what exactly var_dump() prints for a float. That suite is the oracle. The scoreboard runs all of it, and the pass rate is auto-generated into the repo after every run.

The number cannot be flattered, negotiated with, or prompted into a better mood. Either bug40261.phpt passes or it doesn’t.

Current score: 3,844 of 22,037 — 17.4% of the entire upstream PHP test suite. And before you snort-laugh at 17%: the realistic ceiling is around 40–45%, because the rest of the suite tests C extensions (GD, curl, SOAP, intl, MySQL drivers…) that are explicitly out of scope. Within the actual playing field, the climb is very real — it started at zero.

My loop as the human is almost embarrassingly thin:

The AI runs a failure histogram over the whole corpus to find the biggest cluster of failing tests it can actually fix
It implements the thing
It runs the ~22,000-test scoreboard (about 7 minutes of fan noise)
If the number went up: commit, push, repeat
If the number went down: I get to say my other line, “hmm, that regressed, look again”

That’s it. That’s the job. I’ve achieved peak delegation and I’m not even sorry.

The Oracle Cannot Be Bribed. The Harness, However, Lied to My Face.

Early on, the pass rate plateaued in a way that felt wrong. Whole categories of tests — obviously simple ones — kept failing with diffs that looked identical to the expected output. I stared at those diffs like a man staring at two identical photos in a spot-the-difference puzzle, finding nothing.

The difference was invisible because it was literally invisible: carriage returns. The test corpus had been checked out on Windows with CRLF line endings, and our scoreboard compared output byte-for-byte. PHP’s own test runner normalizes line endings before comparing. Ours didn’t. Which means the harness was silently failing essentially every multi-line test in the corpus on line endings alone, and had been for weeks.

One line of normalization code. Hundreds of tests flipped to green instantly.

The lesson tattooed itself onto the project: measure your measurement. Your oracle is only as honest as the plumbing that connects you to it. We now normalize exactly the way run-tests.php does, and every suspicious plateau since has triggered the same question first — is the engine wrong, or is the scoreboard lying?

PHP’s Test Suite Is a Minefield With a Readme

Here’s something nobody tells you about running somebody else’s 22,000-file test corpus: some of those files are bombs. Not malicious ones — accidental ones. Regression tests for ancient memory bugs that allocate absurd structures, generator tests that expand into infinity, tests that were only ever meant to run inside PHP’s own carefully-fenced CI.

I found this out the way all great discoveries are made: my development machine hard-restarted. Not “the program crashed.” Not “the terminal froze.” The entire computer went black and rebooted, because a generator test convinced our engine to eat every byte of RAM in the house like a rocket-powered shopping cart with no brakes.

The aftermath turned the engine paranoid, and honestly it wears paranoia well:

A capped global allocator — the engine physically cannot allocate more than 6 GiB, no matter how creative the test gets
A step limit, so infinite loops die with an error instead of a space heater
Caps on string sizes, array nodes, output length, generator expansion
A breadcrumb file the scoreboard updates with the current test name, so when something hangs, we know exactly which file to glare at

None of this is glamorous language-implementation work. All of it is the difference between “research project” and “thing that can safely chew through 22,000 hostile files unattended while I make coffee.”

The Features That Were Quietly Lying

My favorite genre of bug — and the corpus finds them relentlessly — is the feature that exists, parses, runs without error, and does absolutely nothing. The Potemkin builtin. Over the months the suite has exposed, among others:

clone — parsed fine, evaluated to NULL. Engine-wide. Every DateTimeImmutable in every test had been silently broken forever, because immutable date math is made of clone
unset($arr[$key]) — a total no-op. The key simply… stayed
trim($str, $charlist) — ignored the charlist argument since the beginning of time and trimmed whitespace anyway
$$variableVariables — didn’t exist
static function variables — didn’t exist
spl_autoload_register() — accepted your autoloader with a warm smile and never called it
catch (\Throwable) — matched nothing, which is a very funny property for a catch-all to have

Each of these would survive a demo. Each would survive a code review by someone (me) who doesn’t read Rust. None of them survived the corpus. That’s the entire thesis of the experiment in one bullet list: I can’t audit the code, so the 22,000 tests audit it for me, with a thoroughness no human reviewer could sustain past lunch.

And Then It Served a WordPress Page

The north star was always WordPress — it’s the final boss of PHP compatibility, a codebase old enough to contain sedimentary layers of every PHP idiom since 2003.

Getting wp-load.php to even bootstrap burned through a chain of blockers that reads like a language-lawyer’s fever dream: goto (yes, WordPress’s HTML parser uses goto), str_replace’s by-reference $count parameter, \xNN escapes inside regex character classes, function_exists() being blind to half our builtins. The installer then corrupted its own database because preg_split was ignoring the PREG_SPLIT_DELIM_CAPTURE flag inside wpdb::prepare — a bug four layers below anything I could have diagnosed, found and fixed by the AI while I supervised with the confidence of a man watching open-heart surgery through frosted glass.

And then one evening: wp_install() completes. Admin user created, options table populated, three posts in SQLite. The front page renders — real theme, real posts, real permalinks.

Full disclosure, because radical honesty is the whole bit:

✅ Fresh install runs, front page renders from the database
✅ /wp-admin/ renders too — the actual dashboard, without any issues, which frankly surprised me more than the front page did
⚠️ The REST API is unexplored territory
⚠️ It’s currently ~55x slower than real PHP on that page (7.1 s vs 126 ms) — though the shiny new bytecode VM already runs micro-benchmarks at 1–3x of PHP 8.5 and is coming for that number next

What This Is Actually About

I went in expecting to learn whether an AI could write a language engine. The surprising answer is that this was never really the question. Of course it can write a lexer — it’s read every lexer ever published. The real question was always: how does a person who can’t verify the code keep the project honest? And the answer turned out to be old-fashioned, boring, and beautiful: tests somebody else wrote, a number that only moves when reality moves, and every result pushed to a public repo whether it flatters us or not.

I still don’t know Rust. The engine is now ~24,000 lines of it. The scoreboard goes up a few dozen tests at a time, the dev log collects war stories, and somewhere out there is a version of this where WordPress runs in your browser on a Rust engine compiled to WASM.

Watch the number climb: github.com/ekinertac/Phargo. Or don’t, and just enjoy knowing that somewhere, a man who cannot write a parser is shipping one.

Oh, and obviously: this post was drafted with an LLM too, then edited by me until it sounded like me. The machine writes, I aim. It’s the whole point.