nullprogram.com/blog/2026/01/01/
(The author is currently open to employment opportunities in the United States.)
Software above some complexity level tends to sport an extension language, becoming a kind of software platform itself. Lua fills this role well, and of course there’s JavaScript for web technologies. WebAssembly generalizes this, and any Wasm-targeting programming language can extend a Wasm-hosting application. It has more friction than supplying a script in a text file, but extension authors can write in their language of choice, and use more polished development tools — debugging, testing, etc. — than typically available for a typical extension language. Python is traditionally extended through native code behind a C interface, but it’s recently become practical to extend Python with Wasm. That is we can ship an architecture-independent Wasm blob inside a Python library, and use it without requiring a native toolchain on the host system. Let’s discuss two different use cases and their pitfalls.
Normally we’d extend Python in order to access an external interface that Python cannot access on its own. Wasm runs in a sandbox with no access to the outside world whatsoever, so it obviously isn’t useful for that case. Extensions may also grant Python more speed, which is one of Wasm’s main selling points. We can also use Wasm to access embeddable capabilities written in a different programming language which do not require external access.
For preferred non-WASI Wasm runtime is Volodymyr Shymanskyy’s wasm3. It’s plain old C and very friendly to embedding in the same was as, say, SQLite. Performance is middling, though a C program running on wasm3 is still quite a bit faster than an equivalent Python program. It has Python bindings, pywasm3, but it’s distributed only in source code form. That is, the host machine must have a C toolchain in order to use pywasm3, which defeats my purposes here. If there’s a C toolchain, I might as well just use that instead of going through Wasm.
For the use cases in this article, the best option is wasmtime-py. The distribution includes binaries for Windows, macOS, and Linux on x86-64 and ARM64, which covers nearly all Python installations. Hosts require nothing more than a Python interpreter, no native toolchains. It’s almost as good as having Wasm built into Python itself. In my tests it’s 3x–10x faster than wasm3, so for my first use case the situation is even better. The catch is that it currently weighs ~18MiB (installed), and in the future will likely rival the Python interpreter itself. The API also breaks on a monthly basis, so you’re signing up for the upgrade treadmill lest your own program perishes to bitrot after a couple of years. This article is about version 40.
Usage examples and gotchas
The official examples don’t do anything non-trivial or interesting, and so to figure things out I had to study the documentation, which does not offer many hints. Basic setup looks like this:
import functools
import wasmtime
store = wasmtime.Store()
module = wasmtime.Module.from_file(store.engine, "example.wasm")
instance = wasmtime.Instance(store, module, ())
exports = instance.exports(store)
memory = exports["memory"].get_buffer_ptr(store)
func1 = functools.partial(exports["func1"], store)
func2 = functools.partial(exports["func2"], store)
func3 = functools.partial(exports["func3"], store)
A store is an allocation region from which we allocate all Wasm objects.
It is not possible to free individual objects except to discard the whole
store. Quite sensible, honestly. What’s not sensible is how often I have
to repeat myself, passing the store back into every object in order to use
it. These objects are associated with exactly one store and cannot be used
with different stores. Use the wrong store and it panics: It’s
already keeping track internally! I do not understand why the interface
works this way. So to make things simpler, I use functools.partial to
bind the store parameter and so get the interface I expect.
The get_buffer_ptr object is a buffer protocol object, and if you’re
moving anything other than bytes that’s probably what you want to use to
access memory. The usual caveats apply for this object: If you change the
memory size you probably want to grab a fresh buffer object. For
bytes (e.g. buffers and strings) I prefer the read and write methods.
Because multi-value is still in an experimental state in the Wasm ecosystem, you will likely not pass structs with Wasm. Anything more complicated than scalars will require pointers and copying data in and out of Wasm linear memory. This involves the usual trap that catches nearly everyone: Wasm interfaces make no distinction between pointers and integers, and Wasm runtimes interpret generally interpret all integers as signed. What that means is your pointers are signed unless you take action. Addresses start at 0, so this is bad, bad news.
malloc = functools.partial(exports["func1"], store)
hello = b"hello"
pointer = malloc(len(hello))
assert pointer
memory = exports["memory"].write(store, hello, pointer) # WRONG!
To make matters worse, wasmtime-py adds its own footgun: The read and
write methods adopt the questionable Python convention of negative
indices acting from the end. If malloc returns a pointer in the upper
half of memory, the negative pointer will pass the bounds check inside
write because negative is valid, then quietly store to the wrong
address! Doh!
I wondered how common this error, so I searched online. I could find only one non-trivial wasmtime-py use in the wild, in a sandboxed PDF reader. It falls into the negative pointer trap as I expected. Not only that, it’s a buffer overflow into Python’s memory space:
buf_ptr = malloc(store, len(pdf_data))
mem_data = memory.data_ptr(store)
for i, byte in enumerate(pdf_data):
mem_data[buf_ptr + i] = byte
The data_ptr method returns a non-bounds-checked raw ctypes pointer,
so this is actually a double mistake. First, it shouldn’t trust pointers
coming out of Wasm if it cares at all about sandboxing. The second is the
potential negative pointer, which in this case would write outside of the
Wasm memory and in Python’s memory, hopefully seg-faulting.
What’s one to do? Every pointer coming out of Wasm must be truncated with a mask:
pointer = malloc(...) & 0xffffffff # correct for wasm32!
This interprets the result as unsigned. 64-bit Wasm needs a 64-bit mask, though in practice you will never get a valid negative pointer from 64-bit Wasm. This rule applies to JavaScript as well, where the idiom is:
let pointer = malloc(...) >>> 0
Wasm runtimes cannot help — they lack the necessary information — and this is perhaps a fundamental flaw in Wasm’s design. Once you know about it you see this mistake happening everywhere.
Now that you have a proper address, you can apply it to a buffer protocol
view of memory. If you’re using NumPy there are various ways to interact
with this memory by wrapping it in NumPy types, though only if you’re on a
little endian host. (If you’re on a big endian machine, just give up on
running Wasm anyway.) The first use case I have in mind typically involves
copying plain Python values in and out. The struct package is
quite handy here:
vec2 = malloc(...) & 0xffffffff
memory = exports["memory"].get_buffer_ptr(store)
struct.pack_into("<ii", memory, vec2, x, y)
It fills a similar role to JavaScript DataView. If you’re copying
lots of numbers, with CPython it’s faster to construct a custom format
string rather than use a loop:
nums: list[int] = ...
struct.pack_into(f"<{len(nums)}i", memory, buf, *nums)
To copy structures back out, use struct.unpack_from. If you’re moving
strings, you’ll need to .encode() and .decode() to convert to and from
bytes, which are well-suited to read and write.
In practice with real Wasm programs you’re going to be interacting with
the “guest” allocator from the outside, to request memory into which you
copy inputs for a function. In my examples I’ve used malloc because it
requires no elaboration, but as usual a bump allocator solves
this so much better, especially because it doesn’t require stuffing a
whole general purpose allocator inside the Wasm program. Have one global
arena — no other threads will sharing that Wasm instance — rapid fire a
bunch of allocations as needed without any concern for memory management
in the “host”, call the function, which might allocate a result from that
arena, then reset the arena to clean up. In essence a stack for passing
values in and out.
WebAssembly as faster Python
Suppose we noticed a computational hot spot in our Python program in a pure Python function (e.g. not calling out to an extension). Optimizing this function would be wise. Based on my experiments if I re-implement that function in C, compile it to Wasm, then run that bit of Wasm in place of the original function, I can expect around a 10x speed-up. In general C is more like 100x faster than Python, and the overhead of interfacing with Wasm — copying stuff in and out, etc. — can be high, but not so high as to not be profitable. This improves further if I can change the interface, e.g. require callers to use the buffer protocol.
Thanks to wasmtime-py, I could introduce this change without fussing with cross-compilers to build distribution binaries, nor require a toolchain on the target, just a hefty Python package. Might be worth it.
My main experimental benchmark is a variation on my solution to the “Two Sum” problem, which I originally wrote for JavaScript, then extended to pywasm3 and later wasmtime-py. It’s simple, just interesting enough, and representative of the sort of Wasm drop-in I have in mind. It has the same interface, but implements it with Wasm.
# Original Pythonic interface
def twosum(nums: list[int], target: int) -> tuple[int, int] | None:
...
# Stateful Wasm interface
class TwoSumWasm():
def __init__(self):
store = wasmtime.Store()
module = wasmtime.Module.from_file(store.engine, ...)
instance = wasmtime.Instance(store, module, ())
...
def twosum(self, nums, target):
# ... use wasm instance ...
There’s some state to it with the Wasm instance in tow. If you hide that by making it global you’ll need to synchronize your threads around it. In a multi-threaded program perhaps these would be lazily-constructed thread locals. I haven’t had to solve this yet.
However, the weakness of the wasmtime “store” really shows: Notice how
compilation and instantiation are bound together in one store? I cannot
compile once and then create disposable instances on the fly, e.g. as
required for each run of a WASI program. Every instance permanently
extends the compilation store. In practice we must wastefully re-compile
the Wasm program for each disposable instance. Despite appearances,
compilation and instantiation are not actually distinct steps, as they are
in JavaScript’s Wasm API. wasmtime.Instance accepts a store as its first
argument, suggesting use of a different store for instantiation. That
would solve this problem, but as of this writing it must be the same
store used to compile the module. This is a fatal flaw for certain real
use cases, particularly WASI.
WebAssembly as embedded capabilities
Loup Vaillant’s Monocypher is a wonderful cryptography library. Lean, efficient, and embedding-friendly, so much so it’s distributed in amalgamated form. It requires no libc or runtime, so we can compile it straight to Wasm with almost any Clang toolchain:
$ clang --target=wasm32 -nostdlib -O2 -Wl,--no-entry -Wl,--export-all
-o monocypher.wasm monocypher.c
It’s not “Wasm-aware” so I need --export-all to expose the interface.
This is swell because, as single translation unit, anything with external
linkage is the interface. Though remember what I said about interacting
with the guest allocator? This has no allocator, nor should it. It’s not
so usable in this form because we’d need to manage memory from the
outside. Do-able, but it’s easy to improve by adding a couple more
functions, sticking to a single translation unit:
#include "monocypher.c"
extern char __heap_base[];
static char *heap_used;
static char *heap_high;
void *bump_alloc(ptrdiff_t size)
{
// ...
}
void bump_reset()
{
ptrdiff_t len = heap_used - __heap_base;
__builtin_memset(__heap_base, 0, len); // wipe keys, etc.
heap_used = __heap_base;
}
I’ve discussed __heap_base before, which is part of the ABI.
We’ll push keys, inputs, etc. onto this “stack”, run our cryptography
routine, copy out the result, then reset the bump allocator, which wipes
out all sensitive data. Often memset is insufficient — typically it’s
zero-then-free, and compilers see the lifetime about to end — but no
lifetime ends here, and stores to this “heap” memory externally observable
as far as the abstract machine can tell. (Otherwise we couldn’t reliably
copy out our results!)
There’s a lot to this API, but I’m only going to look at the AEAD interface. We “lock” up some data in an encrypted box, write any unencrypted label we’d like on the outside. Then later we can unlock the box, which will only open for us if neither the contents of the box nor the label were tampered with. That’s some solid API design:
void crypto_aead_lock(uint8_t *cipher_text,
uint8_t mac [16],
const uint8_t key [32],
const uint8_t nonce[24],
const uint8_t *ad, size_t ad_size,
const uint8_t *plain_text, size_t text_size);
int crypto_aead_unlock(uint8_t *plain_text,
const uint8_t mac [16],
const uint8_t key [32],
const uint8_t nonce[24],
const uint8_t *ad, size_t ad_size,
const uint8_t *cipher_text, size_t text_size);
By compiling to Wasm we can access this functionality from Python almost like it was pure Python, and interact with other systems using Monocypher.
Since Monocypher does not interact with the outside world on its own, it
relies on callers to use their system’s CSPRNG to create those nonces and
keys, which we’ll do using the secrets built-in package:
class Monocypher:
def __init__(self):
...
self._read = functools.partial(memory.read, store)
self._write = functools.partial(memory.write, store)
self.__alloc = functools.partial(exports["bump_alloc"], store)
self._reset = functools.partial(exports["bump_reset"], store)
self._lock = functools.partial(exports["crypto_aead_lock"], store)
self._unlock = functools.partial(exports["crypto_aead_unlock"], store)
self._csprng = secrets.SystemRandom()
def _alloc(self, n):
return self.__alloc(n) & 0xffffffff
def generate_key(self):
return self._csprng.randbytes(32)
def generate_nonce(self):
return self._csprng.randbytes(24)
...
With a solid foundation, all that follows comes easily. A finally
guarantees secrets are always removed from Wasm memory, and the rest is
just about copying bytes around:
def aead_lock(self, text, key, ad = b""):
assert len(key) == 32
try:
macptr = self._alloc(16)
keyptr = self._alloc(32)
nonceptr = self._alloc(24)
adptr = self._alloc(len(ad))
textptr = self._alloc(len(text))
self._write(key, keyptr)
nonce = self.generate_nonce()
self._write(nonce, nonceptr)
self._write(ad, adptr)
self._write(text, textptr)
self._lock(
textptr,
macptr,
keyptr,
nonceptr,
adptr, len(ad),
textptr, len(text),
)
return (
self._read(macptr, macptr+16),
nonce,
self._read(textptr, textptr+len(text)),
)
finally:
self._reset()
And aead_unlock is basically the same in reverse, but throws if the box
fails to unlock, perhaps due to tampering:
def aead_unlock(self, text, mac, key, nonce, ad = b""):
assert len(mac) == 16
assert len(key) == 32
assert len(nonce) == 24
try:
macptr = self._alloc(16)
keyptr = self._alloc(32)
nonceptr = self._alloc(24)
adptr = self._alloc(len(ad))
textptr = self._alloc(len(text))
self._write(mac, macptr)
self._write(key, keyptr)
self._write(nonce, nonceptr)
self._write(ad, adptr)
self._write(text, textptr)
if self._unlock(
textptr,
macptr,
keyptr,
nonceptr,
adptr, len(ad),
textptr, len(text),
):
raise ValueError("AEAD mismatch")
return self._read(textptr, textptr+len(text))
finally:
self._reset()
Usage:
mc = Monocypher()
key = mc.generate_key()
message = "Hello, world!"
mac, nonce, encrypted = mc.aead_lock(message.encode(), key)
Transmit mac, nonce, and encrypted to the other party (or your
future self), who already has the key:
decrypted = mc.aead_unlock(encrypted, mac, key, nonce)
Find the complete source in my scratch repository.
While I have a few reservations about wasmtime-py, it fascinates me how well this all works. It’s been my hammer in search of a nail for some time now.