建造一个完美的1993年游戏《SimTower》克隆版
Reverse Engineering SimTower

原始链接: https://phulin.me/blog/simtower

## Towers.world:用LLM重现失落的游戏 作者怀着怀旧之情,利用大型语言模型(LLM)成功重现了他们最喜欢的童年游戏《SimTower》,现在可在towers.world体验。最初的目标是将原始二进制代码逆向工程成一个干净的重新实现规范。然而,由于游戏的复杂性以及人工智能在细节、上下文保留和将实现从设计中抽象出来方面的挣扎,LLM的纯静态分析被证明不足。 项目转变为动态分析方法。一个模拟器使用LLM(特别是Claude Code)构建,以模拟原始游戏环境。这使得状态匹配成为可能——通过游戏过程中的快照,比较重现代码和原始代码的行为。 通过LLM引导的迭代改进,自主修复差异,重新实现达到了与原始代码的行为对等。作者强调需要一个“闭环”——动态验证——对于LLM处理复杂任务,因为仅靠静态分析的抽象设计是不可靠的。虽然在token使用方面成本较高,但这种方法展示了复活和修改废弃机器代码的潜力,为可持续的软件复用打开了可能性。

帕特里克·胡林(Patrick Hulin)精细地逆向工程了1993年的经典游戏《摩天大楼》(*SimTower*),从而创建了[https://towers.world](https://towers.world),一个非常准确的重现版本。他花费数周时间剖析原始可执行文件,记录了模拟的内部运作——从人口动态到电梯逻辑和星级评定系统——并将他详细的规范发布在GitHub上 ([https://github.com/phulin/tower-together/tree/main/specs](https://github.com/phulin/tower-together/tree/main/specs))。 这个项目的独特之处在于其协作多人游戏特性。玩家可以同时实时地建造和管理同一座大楼,只要有人连接,模拟就会持续存在。胡林还通过诸如Shift+点击网格建造等功能改进了用户体验。整个项目使用Cloudflare Durable Objects构建,是开源的,并且在GitHub上可用 ([https://github.com/phulin/tower-together](https://github.com/phulin/tower-together))。
相关文章

原文

I woke up the other day and asked: could an LLM reverse engineer a modern clone of my favorite childhood video game? So I did it. towers.world is live today and allows collaborative, coop play on a perfect clone of the original game.

The Clone

The clone

The Original

The original

Thanks to this retrospective for the image.

At its core, the game is an elevator simulator. It starts getting hard when your tower grows and your sims contend with each other on getting where they want to go. For example, offices grow your tower the fastest, but they come with 6 sims on a synchronized 9-5 schedule that heavily loads your elevators. As a result, you have to manage very tightly what floors your elevators go to and when.

Unfortunately, the game is now basically impossible to play without an emulator. You could reimplement it if the engine were fully described somewhere—but it isn’t. My original idea was to have the LLMs reverse engineer the binary into a complete spec, and then use that spec to reimplement the game. I was inspired by Simon Willison’s posts on translating programs from one language to another—after all, assembly language is a programming language like any other, with well-defined semantics. Then I could add features like collaboration and better UI (primarily a build-a-grid-of-rooms feature).

So I did it. I started with the general binary reverse-engineering framework I’ve built, reaper. It’s a simple harness around a coding agent for static analysis: what can you learn about the binary from reading its disassembled code?

The Static Analysis Era

reaper uses a set of prompt templates and a skill to enable the AI of your choice to connect to Ghidra. Over multiple prompting sessions, the AI builds up understanding of the binary, beginning with low-level, easy-to-identify functions and data structures that compose into the higher-level logic. This process has worked for me on smaller programs, and I was hoping to use the output as the foundation of the clean-room spec.

That was my theory, at least. I had dramatically underestimated the complexity of a full specification. Each room’s sims have 10-15 possible states and a complex transition function requiring precise tracking of information. And due to the optimizations necessary at the time, the game aggressively caches and stores any computation it can, often in packed bitfields or other obscure structures. The results? I’d say they approached a functioning simulation. But the simulation never got to a point that was playable, and certainly never got to behavioral parity.

Static analysis did teach me three specific failures of the current AIs.

First, the AI makes premature conclusions about subsystems, records them, and then struggles to figure out when to abandon its earlier guesses. Worse, it struggles with what level of specificity to use. It would frequently choose bizarre abstract terms like “runtime entity” to describe a sim or “queued-car continuation” to describe a sim getting in line at the elevator. In the first pass, it found the multi-floor lobby code and called it the “lower-atrium band,” a term I haven’t yet been able to fully excise from the Ghidra database and notes (because it’s everywhere).

It also gets lazy about recording its conclusions, often neglecting to name parameters and local variables as instructed. My prompts are very clear that it should analyze function-by-function, and make sure every local variable and every parameter is named, but it just doesn’t do it. The models have been getting better at this every iteration, but they aren’t there yet.

When constructing the spec, it had trouble splitting apart the conceptual design of the simulation from the specific implementation details of the binary. I was looking for a clean-room design spec, not a description of data structure layout—but if I asked it not to include binary details, it would move to a conceptual level that was not detailed enough to reimplement.

Many of these challenges are driven by a lack of tools to manage the context window. Disassembly and decompiler output are just too verbose for an LLM starting from scratch. While I do think there are ways to handle this by making the exploration more LLM-native, using a single Ghidra database is a difficult environment for the LLM to work in. Over and over, I watched the LLM compact and compact, forgetting the details of what it just explored. Unlike text-processing tasks, to reverse engineer you need to have all the precise details in working memory. Compacted summaries don’t cut it.

Still, I persevered. I spent about a week guiding the LLM to refine the spec, discard its incorrect guesses and build tests for the rewritten simulation. I repeatedly asked it to review and improve the spec, pointed to specific deficiencies that needed to be filled in. But I never got there.

The Dynamic Analysis Era

So what next? A friend pointed me to this February post about SimCity, which described engineer Christopher Ehrlich’s efforts to compare a function-for-function SimCity port to the original via input/output comparisons. Ehrlich’s original tweets in turn cited a user named banteg’s post on the game Crimsonland.

I didn’t want to do a function-by-function port. First, APIs may be copyrightable - and copying a binary that closely might implicate copyright more than an approach closer to clean-room design. But it was clear that I needed some level of feedback from the ground-truth binary in order to provide a hill for the LLM to climb on the reimplementation.

An immediate problem: the whole justification for this project is that we can’t run SimTower on modern hardware. Unlike banteg’s project, which used WinDbg, we need some kind of emulator. Off-the-shelf whole-system emulators are not the best option, as you often end up sifting for the program behavior you’re interested in through millions of instructions of operating system cruft. A quick search reminded me of the existence of unicorn, an emulation framework that uses the JIT code-generation logic from QEMU to allow arbitrary emulation pipelines.

I had Claude Code build a Unicorn emulator with a custom mock for each of the 195 Windows 3.1 API functions called by the game. My prompt to Claude was barely more detailed than this sentence. The work took all of about a half an hour and was 99% correct, except for one nasty bug in the loader Claude built that did cause some downstream issues (of course, Claude found the bug later and fixed it). I did almost no work to make this happen. Honestly, I found the resulting success (which took basically zero guidance) to be yet another shocking LLM progress moment.

I then had Claude specify a bunch of small tower configurations, call functions inside the emulated binary to build the rooms, and record snapshots of tower state every few game ticks to enable comparison with the reimplementation. Once I had the state traces, I could direct the AI to autonomously fix any issues that were causing divergence from the emulated binary behavior.

Of course, this meant I needed to make sure every detail matched precisely: most irritatingly, the RNG and the ordering of processing sims each step had to match. RNG parity means that every function that uses the RNG needs to run in the same sequence, or e.g. the wrong sim will be dispatched and lead to a cascading chain of mismatched state. The original binary allocates sims using something like a slab allocator, and some rooms “claim” space in then table and then release it. To get RNG parity, you need to replicate this pattern somewhere so that RNG calls triggered by sim iteration happen in the same order.

From there, it was just repeated improvements to the trace test coverage and letting Claude Code run for hours and hours (my longest run without any prompting was about 8 hours, and it autonomously fixed 5 parity bugs and committed its changes during that time).

As I worked further the game converged on a closer and closer reproduction of the original binary’s structure, violating my original not-function-by-function constraint. But I think that’s OK. Once we have a perfect reproduction, with comments and named functions and variables in a high-level language, we can use the AI to translate again if we need to.

The cool thing is that emulation and state-matching provide an excellent verification target. The LLM can hill-climb piece by piece as more of the trace matches, and it knows enough about code to autonomously debug and solve issues. The big lesson—one I keep relearning—is that the LLMs need to be in a closed loop when doing complex tasks. They are just not capable (yet) of executing an abstract task like clean-room design from static analysis reliably. Repeating “make it better” does not solve the problem: you need the dynamic-analysis verification on the backend.

There are also a number of specific harness patterns I found helpful. The best way I’ve found to get current LLMs to operate autonomously is to have a high-level coordination thread that runs for a long time and low-level subagents for specific tasks.

This project used an absolutely ridiculous number of tokens. I had to upgrade to the Claude Code $200/month plan and carefully avoid usage in the 8 AM-2 PM 2x peak window. While autonomous hill-climbing is a great pattern for LLMs, it is not something you can do cheaply. They want to experiment and fill up their context window with information they might need to know from your project. You can very carefully engineer your way around some of this, but lots of unavoidable. There are some interesting approaches for disassembly in the academic literature, e.g. custom formats to reduce the verbosity, but none address the fundamental problem.

In conclusion: I did this, and you can do it too! You can translate code from machine code to a modern language. There are terabytes of abandoned machine code out there. For the first time, we have a tool that lets us economically modify and reuse that software in a sustainable way. That’s an incredible advance.

towers.world is available now. Play it!

Postscript

I did all this work while a participant at the Recurse Center, a retreat for programmers in NYC and one of my favorite places in the world. If you’re wondering whether to apply, you should!

联系我们 contact @ memedata.com