苹果硅芯片上通过 WebAssembly 实现零拷贝 GPU 推理

苹果硅芯片上通过 WebAssembly 实现零拷贝 GPU 推理
Zero-Copy GPU Inference from WebAssembly on Apple Silicon

原始链接: https://abacusnoir.com/2026/04/18/zero-copy-gpu-inference-from-webassembly-on-apple-silicon/

## Apple Silicon 与 WebAssembly：零拷贝 GPU 访问用于 AI Apple Silicon 的统一内存架构 (UMA) 允许 WebAssembly (Wasm) 模块直接与 GPU 共享内存——消除了传统上在沙盒运行时和加速器之间所需的高成本数据拷贝。这意味着 CPU 和 GPU 读取/写入*相同*的物理内存，将 Wasm 变成控制平面，将 GPU 变成计算平面，且开销接近于零。作者正在构建“Driftwood”，一个利用此功能的用于有状态 AI 推理的运行时。其关键在于三部分验证：使用 `mmap` 进行对齐内存，Metal 访问该内存无需拷贝的能力，以及 Wasmtime 的自定义分配器向 Wasm 模块提供内存。使用矩阵乘法的测试证实了零拷贝功能以及与传统方法相当的性能，但内存占用显著减少——这对于像 transformer 推理中使用的那样的大型模型至关重要。在 M1 Macbook Pro 上运行 Llama 3.2 1B 的初步测试表明，Wasm-GPU 边界的开销可以忽略不计。更重要的是，作者成功地序列化并恢复了 GPU 的键值缓存，从而实现了可移植的 AI 对话状态。这为“有状态 actor 迁移”奠定了基础——冻结、移动和恢复跨机器的 AI 对话。Driftwood 旨在在此基础上构建 actor 快照、检查点可移植性和多模型支持。

一种新技术实现了在苹果芯片上从 WebAssembly (Wasm) 进行“零拷贝”GPU 推理，消除了 Wasm 模块和 GPU 之间的数据复制和序列化需求。这意味着更快速、更高效的机器学习推理。目前，这种功能已确认可在 Wasmtime（一种 Wasm 运行时）上工作，但不一定在网页浏览器中有效，引发了关于潜在浏览器兼容性的讨论。该消息引发了评论区的争论，一些人质疑这种方法的必要性，并哀叹对人工智能生成想法的依赖。另一些人指出，人工智能生成的内容往往缺乏深入理解，并对编码面试等实用技能产生影响表示担忧。作为反驳，有人认为传统的白板面试可能很快就会过时。

原文

tl;dr: on Apple Silicon, a WebAssembly module's linear memory can be shared directly with the GPU: no copies, no serialization, no intermediate buffers. The CPU and GPU read and write the same physical bytes. End-to-end, it works: a Wasm guest fills a matrix in its linear memory, the GPU reads it, computes, writes back, and the guest sees the result through the same pointer, same memory, zero copies.

Normally Wasm and GPUs are separated by an expensive serialization boundary: on most hardware, getting data from a VM sandbox to an accelerator means copying across a bus. Apple Silicon's Unified Memory Architecture erases that boundary (no bus, same physical memory), and what falls out is a runtime where Wasm is the control plane and the GPU is the compute plane, with near-zero overhead between them.

I'm building something called Driftwood that exploits this for stateful AI inference ... and this post is about the foundation (how the zero-copy chain works, what I measured, what it opens up). Still early, still poking at it.

Why this is normally hard

Quick background, for anyone who doesn't live in this stack: WebAssembly gives you a sandbox. Your module gets a flat byte array (linear memory) and that's the universe ... everything outside is mediated by "host" function calls. The whole point is isolation, portability, determinism.

GPUs also want a flat byte array, but a specific kind: page-aligned, pinned, accessible to the DMA engine. On a discrete GPU (think NVIDIA, or AMD), that memory sits across a PCIe bus from the CPU, so getting data from a Wasm module's linear memory to the GPU means: copy out of the sandbox into host memory, then copy across the bus into GPU memory. Two copies, two latency hits, and an awkward impedance mismatch between "isolated VM" and "hardware accelerator."

Apple Silicon changes the physics. The CPU and GPU share the same physical memory (Apple's Unified Memory Architecture) ... no bus! A pointer the CPU can read, the GPU can also read, from the same DRAM. The real question: can you thread that pointer through the layers of abstraction (the Wasm runtime, the GPU API) without anyone making a defensive copy along the way?

Turns out ... you can!

The three-link chain

Three links. I validated each one on its own before trying to compose them: it's the kind of thing where if you skip the isolation step and the whole pipeline breaks, you have no idea "which joint is leaking".

Link 1: mmap gives you page-aligned memory. On ARM64 macOS, mmap with MAP_ANON | MAP_PRIVATE returns 16 KB-aligned addresses. This isn't a lucky accident, it happens to be the ARM64 page size, and mmap aligns by contract. The alignment matters because Metal requires it.

Link 2: Metal accepts that pointer without copying. MTLDevice.makeBuffer(bytesNoCopy:length:) wraps an existing pointer as a Metal buffer. On Apple Silicon, this is the zero-copy path, i.e. the GPU accesses the same physical memory the CPU does. I verified pointer identity: the MTLBuffer.contents() pointer equals the original mmap pointer. I verified no hidden copies: RSS delta was 0.03 MB (measurement noise), compared to 16.78 MB for the explicit-copy path. And, same compute latency either way.

Link 3: Wasmtime lets you bring your own allocator. Wasmtime's MemoryCreator trait lets you control how linear memory is allocated. Instead of letting Wasmtime call mmap internally, you provide the backing memory yourself. I implement MemoryCreator to return our own mmap region, and Wasmtime's memory.data_ptr() returns exactly the pointer I handed it. The Wasm module reads and writes through Wasmtime's memory API; the GPU reads and writes through the Metal buffer; both are operating on the same bytes.

The composition: allocate an mmap region, hand it to both Wasmtime (as the actor's linear memory) and Metal (as a GPU buffer). The Wasm module writes data at known offsets, the GPU computes on it in place, and the results appear in the module's linear memory with no copies and no explicit data transfer.

I tested the full chain with a 128×128 matrix multiply: the Wasm module fills matrices A and B, the GPU runs a GEMM shader, the module reads result C back. Zero errors across 16,384 elements. Small test, but it's the kind of thing where either it all lines up or you get garbage, so zero errors is the signal I wanted.

What I measured

Three things I cared about: pointer identity (is it actually zero-copy?), memory overhead (any hidden copies sneaking in?), and correctness (does the GPU see what Wasm wrote?).

  Measurement                     Zero-copy path     Copy path
  ─────────────────────────────────────────────────────────────
  Pointer identity                mmap == MTLBuffer   different addrs
  RSS delta (16 MB region)        0.03 MB             16.78 MB
  GEMM latency (128×128)          ~6.75 ms            ~6.75 ms
  Correctness (16K elements)      0 errors            0 errors

The latency equivalence makes sense: on UMA, the compute itself is identical either way. The memory picture is where it shows up: the zero-copy path has essentially no overhead for making data GPU-accessible, and the copy path doubles your memory footprint.

At small tensor sizes, nobody cares. At the scale of KV caches in transformer inference (hundreds of megabytes per conversation) it's the difference between fitting four actors in memory or two. That's the regime I actually want to operate in, so the memory part matters.

From zero-copy to inference

So now I've got a primitive: Wasm and the GPU share memory with no overhead. What do you do with it?

I plugged the chain into Apple's MLX framework and ran Llama 3.2 1B Instruct from a Wasm actor: a full transformer decoder written in Rust, compiled to a native host runtime, driving inference on the Apple Silicon GPU through host function calls. (I was too lazy to wire up a custom kernel path from scratch, and ... MLX was there)

Measured latencies, running Llama 3.2 1B (4-bit quantized, 695 MB) on a 2021 M1 Macbook Pro (old personal laptop, I'll re-evaluate on a proper Mac Studio someday when I can get my hands on it 😄):

  Operation                    Latency
  ──────────────────────────────────────
  Model load (safetensors)     229 ms      (one-time)
  Prefill (5 tokens)           106 ms
  Per-token generation          ~9 ms
  Host function boundary        negligible

The host function boundary (the Wasm-to-GPU dispatch) isn't measurable against the inference cost. Anyone who's worked with sandboxed runtimes has probably winced at the thought of crossing that boundary per dispatch. On this hardware, it's not a thing.

KV cache portability

Transformers maintain a key-value cache that accumulates context across conversation turns, which is normally ephemeral (kill the process, lose the cache, start over). If you've tried running local inference, you know the feeling.

Because the cache lives in GPU-accessible memory that I control, I can serialize it. So, I dump the KV cache to safetensors format (standard ML tensor serialization, nothing exotic) and restore it later, on the same machine, or a different machine, or potentially against a different model on a different machine! That last one I haven't tested across meaningfully different architectures yet ... we'll see.

  Operation                    Latency      Size
  ───────────────────────────────────────────────────
  Serialize (24 tokens)        1.1 ms       1.58 MB (~66 KB/token)
  Restore from disk            1.4 ms
  Re-prefill from scratch      67.7 ms      (the alternative)
  ───────────────────────────────────────────────────
  Speedup from restore:        5.45×
  Round-trip fidelity:         bit-identical (10/10 tokens match)

5.45× at 24 tokens, and the ratio improves with context length: restore time is nearly constant, re-prefill scales linearly. At 4,096 tokens, restore would be around 100× faster than recomputation (I haven't actually pushed it to 4,096 yet; that's napkin math extrapolating from the constant-vs-linear shape).

This is the basis for stateful actor mobility: freezing a conversation mid-exchange, moving it somewhere else, thawing it with full context intact. The Wasm module's linear memory captures the actor's logical state; the KV cache captures the inference engine's accumulated context. Together: a portable snapshot of a running AI conversation (or, at least, that's the plan 😅).

What's being built

Driftwood is a runtime for stateful Wasm actors with GPU inference. The zero-copy chain is the foundation: on top of it I'm going to add on actor snapshots (freeze and resume any conversation), checkpoint portability (move inference state across machines), and multi-model support (the snapshot format is model-agnostic, so in theory the actor's identity survives model swaps ... which might work, will revisit once I test it).

This is all early, still stitching things together. But the "physics" works: Wasm and the GPU can share memory on Apple Silicon with zero overhead, the KV cache is portable, and a full transformer runs from a sandboxed actor at native speed. The next things I want to poke at: whether the snapshot really survives a model swap, whether the chain holds up on larger models, and whether I'm missing some obvious reason this will fall over at scale. Slow and steady ...

More on the actor model and snapshot architecture in a future post, once I've actually shipped something past the "physics works" stage.