x86 模拟器团队发现代码写得太烂，于是直接在模拟过程中将其修复。

x86 模拟器团队发现代码写得太烂，于是直接在模拟过程中将其修复。
The time the x86 emulator team found code so bad they fixed it during emulation

原始链接: https://devblogs.microsoft.com/oldnewthing/20260615-00/?p=112419

一位同事分享了一段往事，那是在 Windows 利用二进制翻译在其他处理器上模拟 x86-32 代码的时期。当遇到一个需要清理 64KB 栈内存的程序时，编译器为了实现所谓的“优化”，放弃了标准的循环结构。编译器没有使用精简的循环，而是将这个过程展开成了 65,536 条独立的写指令。这导致仅仅为了初始化 64KB 的数据，就产生了 256KB 的代码。这种极度臃肿的代码令模拟团队感到无法忍受，于是他们在翻译器中专门打了一个补丁，用于识别这种模式，并将庞大的指令流替换为正确且高效的循环。

这篇 Hacker News 帖子讨论了微软“Old New Thing”博客中的一篇文章，内容讲述了一个 x86 模拟器团队在遇到极其糟糕的软件代码时，直接在模拟器内部实现了修复方案。评论者澄清了技术细节，指出该团队是将此修复应用于仿真循环以确保软件正常运行，而非在代码运行时进行“修复”。由于开发人员无法获取原始源代码，他们必须在架构层面弥补程序的低效问题——例如，有一个优化器竟然将一个简短的循环替换成了 64,000 条独立的指令。此次讨论还揭示了与现代软件的一个有趣的相似之处：用户指出，Wine 和 Proton 等兼容层目前也会为 Linux 上的游戏执行类似的“热修复”，这往往使得仿真用户获得的性能比在原生平台上运行该软件的用户更好。

During an exchange of war stories, a colleague of mine told one from back in the days when Windows included a processor emulator for x86-32 on systems that natively ran some other processor. (This has happened many times. And no, I don’t know which processor this particular story applied to.)

This particular emulator employed binary translation, generating native code to perform the equivalent operations of the original x86-32 code. This offered a significant performance improvement over emulation via interpreter. You can imagine that x86-32 is just a bytecode, and the emulator is a JIT compiler.

Anyway, my colleague found that there was one program that needed to allocate around 64KB of memory on the stack and initialize it. The standard way of doing this is to perform a stack probe to ensure that 64KB of memory is available, then subtracting 65536 from the stack pointer, and then initializing the memory in a small, tight loop.

But using a loop to initialize the memory was too mundane for whatever compiler was used to compile this code. Instead of generating a loop to initialize each byte of the buffer, the compiler “optimized” the code by unrolling the loop into 65,536 individual “write byte to memory” instructions, each 4 bytes long.

All in all, it took this program 256 kilobytes of code to initialize 64 kilobytes of data.

This offended the team so much that they added special code to the translator to detect this horrible function and replace it with the equivalent tight loop.

x86 模拟器团队发现代码写得太烂，于是直接在模拟过程中将其修复。 The time the x86 emulator team found code so bad they fixed it during emulation

x86 模拟器团队发现代码写得太烂，于是直接在模拟过程中将其修复。
The time the x86 emulator team found code so bad they fixed it during emulation