WATaBoy：将 Game Boy 指令即时编译为 WASM 比原生解释器更快

WATaBoy：将 Game Boy 指令即时编译为 WASM 比原生解释器更快
WATaBoy: JIT-Ing Game Boy Instructions to WASM Beats a Native Interpreter

由于苹果对 iOS 上的 JIT 编译进行了限制，诸如 Dolphin 等高性能模拟器通常无法使用。本篇博客探讨了一种替代方案：“JIT-to-Wasm”。该技术通过在运行时生成 WebAssembly (Wasm) 字节码来绕过上述限制，随后由浏览器引擎将其编译为原生机器码。作者开发了一款名为“WATaBoy”的 Game Boy 模拟器作为概念验证，用于对比该方法与传统解释器的性能。该实现依赖 `wasm-encoder` 来生成字节码，并利用 C ABI 在 Rust 和 JavaScript 之间传递数据。由于 Wasm 采用哈佛架构，作者使用了“后期链接”（late-linking）流程：即由浏览器实例化新的 Wasm 模块，将其添加到主实例的间接函数表中，并通过 `call_indirect` 执行。尽管 JIT-to-Wasm 方法在性能上优于解释器，但作者指出，它仍缺乏专业模拟器所使用的底层优化（如硬件快速内存访问）。此外，缺乏稳健且易用的运行时 Wasm 生成工具，依然是其广泛应用的主要障碍。未来的工作将专注于完善 PPU 模拟，并探索更深层的优化，以进一步挖掘这一 JIT 策略的潜力。

Hacker News 最新 | 过往 | 评论 | 提问 | 展示 | 招聘 | 提交登录 WATaBoy: 将 Game Boy 指令即时编译（JIT）为 WASM 比原生解释器更快 (humphri.es) energeticbark 在 29 分钟前发布，10 分 | 隐藏 | 过往 | 收藏 | 讨论 | 帮助指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

Background

This text assumes the reader is familiar with the concept of just-in-time compilation.

Dolphin isn’t on iOS, because you can’t do JIT compilation on iOS. That’s a quick summary of OatmealDome’s blog post “Why Dolphin Isn’t Coming to the App Store”. Ever since reading that, I’ve wondered what it would take to get a CPU-bound emulator like Dolphin working on iOS. Do we just... have to wait a few years for iPhone CPUs to get fast enough to run Dolphin with an interpreter?

Well, Apple has one exception to its JIT restrictions: web browsers. JavaScriptCore, WebKit’s JS engine, uses JIT compilation for its higher-performance tiers. So, if a JS function is called enough times, eventually it’ll be optimised and compiled into native machine code. The same is true for WebAssembly.

So, what if we just piggyback off of this? Instead of generating native machine code directly, we could just generate Wasm bytecode, which will eventually be compiled to native machine code by the web browser. After reading Andy Wingo's blog post "just-in-time code generation within webassembly", I knew such a thing would be possible. In fact, a handful of projects already use this technique, namely The Jiterpreter and v86, but at the time of writing, no emulators for game consoles have used it, and nobody has compared the performance to an interpreter running natively to see if it's faster.

So, for my undergraduate final-year project, I decided I’d build a Game Boy emulator, first using an interpreter, and then using a JIT-to-Wasm. This project primarily serves as a proof of concept and benchmark to compare the performance of each approach. For the rest of this blog post, I'll call this a “JIT-to-Wasm” instead of a “Wasm JIT” to avoid confusion with what the JS engine itself does (recompile Wasm to machine code).

Screenshot of WATaBoy, a Game Boy emulator that compiles SM83 to Wasm

Anyone reading this who knows a bit about emulation just rolled their eyes, because how the hell is a Game Boy emulator going to benefit from JIT compilation? Luckily, GameRoy’s blog post describes exactly how it’s possible while remaining cycle-accurate:

predict when interrupts are going to occur
whenever a JIT block might be interrupted, fall back to an interpreter
lazily evaluate any non-CPU Game Boy components accessed via MMIO

GameRoy’s JIT only targets x86, but nearly all of its optimisation techniques still apply to our JIT-to-Wasm. Definitely check it out if you’re interested in the nitty-gritty details of the Game Boy emulation side of things; it was a huge inspiration.

Still, a Game Boy emulator doesn't benefit from JIT compilation as much as, say, a sixth-gen console. But it was much faster to make, and actually fit within the scope of my final-year project.

Implementation

Now, to narrow the scope of this blog post, I’ll take you through the most broadly applicable part of WATaBoy that I couldn't find a guide for anywhere else: Wasm codegen and late-linking from within Rust. A lot makes WATaBoy interesting, specifically from a Game Boy emulation perspective (e.g., SIMD tile rendering), but those implementation details deserve separate write-ups (you can also just read WATaBoy’s source, of course). If you aren’t interested, skip to the results.

Normally we'd usually reach for tools like wasm-bindgen and wasm-pack to generate glue code between Rust and JavaScript. But those tools cause some ergonomics issues when working with Wasm at a low level. Instead, I use an approach similar to the one described in ”Rust to WebAssembly the hard way”. This just means we'll pass data across the Rust-JS boundary via the C ABI, using pointers and buffer lengths instead of JavaScript objects.

Just a heads up, you’ll need Nightly Rust, because we'll use a tiny bit of inline Wasm later. So run:

rustup default nightly

To switch back, just run this again but swap ‘nightly’ for ‘stable’.

Create a new library:

cargo new --lib jit-to-wasm

Hey look, we've already got some code here:

pub fn add(left: u64, right: u64) -> u64 {
	left + right
}

For our simple example, let’s try producing some Wasm bytecode at runtime that does the same thing.

Wasm code generation

The wasm-encoder crate will be our only dependency. With it, we can emit the bytes for Wasm instructions using a sort of builder pattern. It wasn’t designed for our JIT use case, so there are some ergonomics issues and a tiny bit of boilerplate, but it definitely beats writing an array of raw bytes by hand. :)

[package]
name = "jit-to-wasm"
version = "0.1.0"
edition = "2024"

[lib]
# Required to produce a .wasm file.
crate-type = ["cdylib"]

[dependencies]
wasm-encoder = "0.252.0"

Now, let’s use it to produce the bytecode for a Wasm module containing an ‘add’ function. Here comes that boilerplate I mentioned:

use wasm_encoder::*;

fn make_add_module() -> Vec<u8> {
	let mut module = Module::new();

	// Encode the type section for the add function.
	// Parameters: 32-bit int left, 32-bit int right.
	// Returns: 32-bit result.
	let mut types = TypeSection::new();
	let params = vec![ValType::I32, ValType::I32];
	let results = vec![ValType::I32];
	types.ty().function(params, results);
	module.section(&types);

	// Encode the function section.
	let mut functions = FunctionSection::new();
	let type_index = 0;
	functions.function(type_index);
	module.section(&functions);

	// Encode the export section.
	let mut exports = ExportSection::new();
	exports.export("my_add_func", ExportKind::Func, 0);
	module.section(&exports);

	// Encode the code section.
	let mut codes = CodeSection::new();
	let locals = vec![];
	let mut my_add_func = Function::new(locals);
	my_add_func
		.instructions()
		// Get the first 32-bit int onto the stack (left).
		.local_get(0)
		// Get the second 32-bit int onto the stack (right).
		.local_get(1)
		// Add the two ints together.
		.i32_add()
		.end();
	codes.function(&my_add_func);
	module.section(&codes);

	// Extract the encoded Wasm bytes for this module.
	module.finish()
}

This example is almost exactly the same as the one from wasm_encoder’s documentation.

Alright, now how do we actually execute this bytecode?

#[unsafe(no_mangle)]
pub extern "C" fn make_and_execute_add(left: i32, right: i32) -> i32 {
	let add_bytecode = make_add_module();

	// Execute add ...somehow???
}

Compiling and linking

Harkening back to Wingo’s blog post, Wasm is a Harvard architecture rather than a von Neumann architecture. Practically speaking, this means we can’t directly execute the bytecode generated by our programme. For WebAssembly specifically, we have to reach out to the embedder (typically JavaScript) to compile, instantiate and link in our new Wasm bytecode. The jit-interface proposal may provide a way to do this directly in Wasm with a func.new instruction, but for now, we gotta talk to JavaScript.

First, we use the synchronous compilation interface to compile and instantiate our bytecode. (Compile & Instantiate)
Then, we add the function from our generated module to our main module’s indirect function table, and keep track of its index in the table so we can invoke it later. (Link)
Finally, we can actually execute the function using the call_indirect instruction, which calls the nth function in our indirect function table. (Dispatch).

Let’s imagine we’re already importing a function called "linkNewModule" that compiles, instantiates, and links a buffer of bytecode; we’ll implement the real thing in JavaScript later.

#[link(wasm_import_module = "env")]
unsafe extern "C" {
	// Returns the new function's index in the table.
	#[link_name = "linkNewModule"]
	fn link_new_module(buffer: *const u8, len: usize) -> i32;
}

Next, we implement our dispatch function to call the nth function in our indirect function table. All we really need to do is execute the call_indirect Wasm instruction. Normally when you want to do something like this, you'd reach for an intrinsic function in std::arch, but there isn't one for call_indirect. So we're going to have to use a tiny bit of inline WebAssembly.

This is an unstable feature, so you'll have to put this at the top of lib.rs:

#![feature(asm_experimental_arch)]

use std::arch::asm;

// Indirectly call the function at `index` in this module's function table.
fn dispatch(index: i32, left: i32, right: i32) -> i32 {
	let mut result: i32;
	unsafe {
		asm!(
			"local.get {right}",
			"local.get {left}",
			"local.get {index}",
			"call_indirect (i32, i32) -> (i32)",
			"local.set {result}",
			index = in(local) index,
			left = in(local) left,
			right = in(local) right,
			result = lateout(local) result,
		);
	}
	result
}

Putting it all together, this is what we have:

#[unsafe(no_mangle)]
pub extern "C" fn make_and_execute_add(left: i32, right: i32) -> i32 {
	let add_bytecode = make_add_module();

	let func_idx = unsafe {
		link_new_module(add_bytecode.as_ptr(), add_bytecode.len())
	};

	dispatch(func_idx, left, right)
}

And one last thing: we have to pass a couple of flags to LLD using a /build.rs file: The first one, --export-table, exports our main Wasm module's indirect function table, so we can access it from the embedder (JS). The second one, --growable-table, lets us grow the table so we can append our JIT-compiled functions. This flag is totally undocumented, but it works, and there's a test for it, so...

fn main() {
	println!("cargo:rustc-link-arg=--export-table");
	println!("cargo:rustc-link-arg=--growable-table");
}

Alright, that's the Rust side of things done. Let's build our main Wasm module:

cargo build --release --target wasm32-unknown-unknown

The embedder (JavaScript) side of things

Now, let's try to call our make_and_execute_add function from the embedder:

// Instantiate the main Wasm module for the JIT itself.
const source = fetch(
	"target/wasm32-unknown-unknown/release/jit_to_wasm.wasm"
);
const {instance} = await WebAssembly.instantiateStreaming(source);

// Generate an add function at runtime and use it to add 2 and 3 together.
const result = instance.exports.make_and_execute_add(2, 3);
console.log(result);

Console output:

TypeError: import env:linkNewModule must be an object

Ah right, we haven't implemented that linking function yet. Let's do that now:

const linkNewModule = (bufferPtr, bufferLen) => {
	// Read the Wasm bytecode from the main instance's memory.
	const bytecode = new Uint8Array(
		instance.exports.memory.buffer,
		bufferPtr,
		bufferLen
	);
	
	// Compile and instantiate the bytecode into a new instance.
	const newModule = new WebAssembly.Module(bytecode);
	const newInstance = new WebAssembly.Instance(newModule);
	
	// Add the new instance's "my_add_func" function to our main instance's
	// indirect function table.
	instance.exports.__indirect_function_table.grow(
		1,
		newInstance.exports.my_add_func
	);
	
	// Return the index of the function we've just linked in.
	return instance.exports.__indirect_function_table.length - 1;
}

const importObj = {env: {linkNewModule}};

// Instantiate the main Wasm module for the JIT itself.
const source = fetch(
	"target/wasm32-unknown-unknown/release/jit_to_wasm.wasm"
);
const {instance} = await WebAssembly.instantiateStreaming(
	source,
	importObj
);

// Generate an add function at runtime and use it to add 2 and 3 together.
const result = instance.exports.make_and_execute_add(2, 3);
console.log(result);

Here's the console output:

And here’s an example of the code we just wrote running on this page:

Left: + Right: =

And that’s the basis of WATaBoy’s codegen, linking, and dispatch. I'm sure you can guess how you might modify the function's signature and instructions in make_add to generate more useful Wasm modules at runtime. In WATaBoy, our JIT recompiles and appends each non-branching Game Boy instruction to create a basic block (a Wasm module with a single execute_block function) that we can cache and re-execute later. If you're curious, check out how part of the Game Boy's instruction set is recompiled.

Further work

WATaBoy

Audio and GBC support are the most prominent missing features.

In terms of performance, profiling shows that emulating the PPU still takes up most of WATaBoy's runtime, because there are still a few PPU interrupts that I haven't implemented prediction for. This causes the JIT to fall back to the interpreter more often than it actually needs to, so it'll be my main priority before optimising the JIT compiler any further.

Our JIT-to-Wasm clearly beats out our interpreter running natively, and these results possibly apply to other emulators as well, especially those which are heavily CPU-bound. But looking at the results critically, we have only shown that our basic-block JIT compiler beats our basic fetch-decode-execute interpreter.

The interpreter is fast, and a lot of time was spent optimising it, but there are still niche optimisation techniques (e.g., a cached interpreter) that might help it catch up with our basic block JIT compiler.

The same goes for optimising our JIT compiler as well. For example, recompiling branching instructions would mean we’d stay executing JIT blocks for longer and spend less time falling back to the interpreter and dispatching between blocks.

I think it would be interesting to compare their relative performance with further optimisations, and I plan to continue working on this project as a hobby until I’m pushing the limits of both approaches. And if you know about cycle-accurate Game Boy emulation and you’d like to contribute, or if you're just curious, check out the project on GitHub.

JIT-to-Wasm in general

I'd argue that right now, the main pain point with JIT-ing to Wasm is codegen. Every project I've seen so far is using their own bespoke tooling for generating Wasm bytecode, and none of them is as ergonomic or robust as tools like DynASM or Cranelift. For this technique to see more widespread adoption, emulator developers will probably want some way to write strings of human-readable WAT that gets translated into bytecode at compile time, in the same way that DynASM translates ARM/x86 assembly into machine code.

It’s also worth acknowledging another limitation to this approach. There’s no way to do a few of the lower-level optimisations Dolphin relies on. For example, Dolphin's hardware fastmem wouldn't work since any invalid memory accesses are irrecoverable within the Wasm runtime.