Go 运行时：内存分配器

原文

In the previous article we explored how the Go runtime bootstraps itself — how a Go binary goes from the operating system handing it control to your func main() running. During that bootstrap, one of the first things the runtime sets up is the memory allocator. And that’s what we’re going to explore today.

Think of the memory allocator as a warehouse manager. Your program constantly needs boxes of different sizes — sometimes tiny, sometimes huge — and it needs them fast. The allocator’s job is to hand out those boxes as quickly as possible, keep the warehouse organized so nothing goes to waste, and work with the garbage collector to reclaim boxes that nobody is using anymore.

But before we get into the warehouse itself, let’s talk about when things actually end up there.

When Does Memory Allocation Happen?

Not every variable in your program goes through the memory allocator. Go has two places to put data: the stack and the heap.

The stack is the easy one. Each function call gets its own little scratch space on the stack, and when the function returns, that space is automatically gone. It’s fast and simple — no bookkeeping needed.

But sometimes data needs to stick around after the function that created it is done. Maybe you’re returning a pointer to something, or storing a value that other parts of your program will use later. That data can’t live on the stack — it would vanish when the function returns. So it goes on the heap, which is a longer-lived region of memory.

The Go compiler is actually pretty smart about this. It analyzes your code at compile time to decide what can stay on the stack and what needs to go on the heap — this is called escape analysis (we covered it in detail in the IR article ).

Every time something ends up on the heap, that’s when the memory allocator comes into play. It’s the system that finds free space on the heap and hands it over. And that’s what the rest of this article is about.

A small simplification: the picture above is not the whole story. In Go, goroutine stacks are actually allocated from the heap — so the memory allocator provides the space where stacks live. But once a stack is allocated, the variables on it are managed very differently from heap objects: they’re just offsets within the stack frame, with no allocator involvement per variable. So while the allocator is responsible for the stack memory, it’s not involved in placing individual variables on the stack. For this article, we’ll focus on the heap side of things.

So the allocator manages heap memory. But where does that memory come from in the first place?

Why Not Just Ask the OS?

When your program needs memory, somebody has to provide it. Ultimately, that somebody is the operating system. The OS manages all the physical RAM on your machine, and any process that wants memory has to ask the OS for it through system calls like mmap on Linux/macOS or VirtualAlloc on Windows.

The problem is that system calls are slow. They involve switching from user space to kernel space, the OS doing its own bookkeeping, and then switching back. If Go made a system call every time you wrote make([]byte, 100) or &MyStruct{}, performance would be terrible — especially in a language designed for high concurrency, where thousands of goroutines might be allocating memory at the same time.

So the Go runtime takes a different approach: it asks the OS for large chunks of memory upfront (we’ll see later that these are 64MB on most 64-bit systems) and then manages the distribution internally. When your code needs 100 bytes, the allocator doesn’t go to the OS — it carves out 100 bytes from memory it already has. It only goes back to the OS when it runs out.

This is the fundamental idea behind the memory allocator. It sits between your program and the operating system, acting as a fast intermediary that makes allocation cheap by avoiding system calls on the hot path. But managing all that memory internally is not trivial — the allocator needs to keep track of what’s in use, what’s free, and do it all without becoming a bottleneck. Let’s see how.

Arenas and Pages

We said the runtime asks the OS for large chunks of memory. Those chunks are called arenas, and on most 64-bit systems each one is 64MB (4MB on Windows and 32-bit systems, 512KB on WebAssembly).

When your program starts and begins allocating, the runtime requests its first arena from the OS. As the program needs more memory, it requests additional arenas. They don’t need to be next to each other in memory — the runtime keeps track of all of them through an internal map.

Does this mean Go grabs 64MB of RAM right away? No. When the runtime “requests” an arena, it first just reserves 64MB of address space — think of it as putting your name on a plot of land without building anything on it yet. No physical memory is used at this point. Then, as the runtime actually needs to use parts of that arena, it tells the OS to make regions usable in chunks of about 4MB. And even then, the OS doesn’t allocate real memory until your program actually writes to those addresses — the physical memory shows up on demand, one OS page at a time, completely transparently. So the real cost is gradual: reserve the space (basically free), commit it in 4MB chunks as needed (one system call each), and let the OS fill in physical memory behind the scenes as it’s used. This is another reason the allocator is so fast — once memory is committed, everything the allocator does happens without talking to the OS at all.

But a 64MB block is way too big to hand out directly. If your program asks for 32 bytes, you don’t want to give it an entire arena. So each arena is divided into pages of 8KB (8192 bytes). This is Go’s own page size — not the same as the OS page size, which is typically 4KB.

Pages are the basic unit the allocator works with internally. When it needs to satisfy an allocation, it works in terms of pages — how many pages to grab, which pages are free, which are in use. An arena of 64MB contains 8192 pages (64MB / 8KB), and the runtime tracks the state of each one.

But 8KB is still too big for most allocations. You don’t need a whole page for a 32-byte struct. That’s where spans come in.

Spans: Where Objects Live

A span is one or more contiguous pages that are dedicated to holding objects of a single size. This is the level where the allocator actually gives memory to your program.

Let’s make this concrete. Say your program needs a bunch of 32-byte objects. The allocator will take one page (8KB), turn it into a span for 32-byte objects, and divide it into 256 slots (8192 / 32 = 256). Each slot can hold exactly one object. When you allocate a 32-byte object, the allocator just finds the next free slot in that span and returns it. When you need another one, it grabs the next free slot. Fast and simple.

This works because every slot in a span is the same size. There’s no need to search for a block that fits, no fragmentation within the span, no merging of adjacent free blocks. Just find a free slot and use it. And to find that free slot, each span keeps a bitmap called allocBits — one bit per slot, where 1 means “in use” and 0 means “free”. Finding the next free slot is just scanning for the next 0 bit. The span also tracks where it starts in memory, how many pages it covers, how many slots it has, and how many are currently allocated. There’s also a second bitmap called gcmarkBits that the garbage collector uses — but we’ll get to that later. All of this metadata is part of the span structure itself — a separate object that the runtime allocates to manage the span — not stored inside the pages that hold your objects. So the full 8KB of a page (or however many pages the span covers) is available for object slots.

Size Classes

Now, if each span only holds one size of object, we need different spans for different sizes. But the allocator can’t create a span for every possible byte count — that would be unmanageable. Instead, Go defines 68 size classes ranging from 8 bytes to 32KB. When you allocate, say, 20 bytes, Go rounds it up to the nearest size class (24 bytes in this case) and uses a span for that class. We lose a few bytes to rounding, but the simplicity and speed are worth it.

Here are some examples:

Class	Object Size	Span Size (Pages)	Objects per Span
1	8 B	8 KB (1 page)	1024
4	32 B	8 KB (1 page)	256
10	128 B	8 KB (1 page)	64
32	1024 B	8 KB (1 page)	8
41	3072 B	24 KB (3 pages)	8
46	5376 B	16 KB (2 pages)	3
51	8192 B	8 KB (1 page)	1
60	18432 B	72 KB (9 pages)	4
65	27264 B	80 KB (10 pages)	3
67	32768 B	32 KB (4 pages)	1

If you look at the table, the page counts might seem a bit random — why does an 18KB object need 9 pages, while an 8KB object only needs 1? The rule is actually simple: start with 1 page and keep adding pages until the wasted space at the end (the leftover bytes that can’t fit another object) is less than 12.5% of the span. For small objects like 32 bytes, one page fits 256 objects with zero leftover — perfect, no need for more pages. For mid-range sizes like 3KB or 5KB, a single page would leave too much unusable space at the end, so the span grows to 2, 3, or even up to 10 pages to bring the waste down.

You might notice that some classes only hold 1 object per span — like class 51 (8KB) or class 67 (32KB). That means after a single allocation, the span is full. Wouldn’t it be better to use more pages so the span holds more objects? Not necessarily. A bigger span means more memory sitting reserved even if you only need one or two objects of that size. For small objects like 32 bytes, where programs typically allocate hundreds at a time, packing 256 into a span makes sense. But for larger objects, most programs only need a few, so keeping the span small avoids wasting memory.

So the size class system handles the common range of 8 bytes to 32KB. But not everything fits neatly into that range.

The Edges: Large Objects and Tiny Objects

You might have noticed something odd: we said there are 68 size classes, but the table goes from 1 to 67 — that’s only 67. Where’s the missing one? That’s size class 0, reserved for objects larger than 32KB. Unlike the other classes, it doesn’t have a fixed object size or span size. Instead, a class 0 span gets exactly the number of pages the object needs, with no slot subdivision — one object, one span.

On the other end, really tiny objects like a bool or an int8 hit a different problem. The smallest size class is 8 bytes, so even a 1-byte value would get an 8-byte slot — a lot of waste for something so small. To deal with this, Go has a tiny allocator that packs multiple tiny objects (smaller than 16 bytes, without pointers) into a single 16-byte block. So several booleans or small integers can share one slot instead of each getting their own. This dramatically reduces waste for programs that allocate lots of small values.

We’ve seen how spans are organized by size — but size isn’t the only thing the allocator cares about.

Span Classes: Size + Pointers

The allocator doesn’t just care about object size — it also cares about whether the objects contain pointers. Why? Because the garbage collector needs to scan objects with pointers to follow references and find live data. Objects without pointers (like a [100]byte or a struct of only integers) can be safely skipped during GC — there’s nothing to follow.

So Go keeps separate spans for each: one for objects that need scanning (scan) and one for objects that don’t (noscan). The combination of a size class and the scan/noscan flag is called a span class. With 68 size classes and 2 variants each, that gives 136 span classes in total.

With all of that, the full picture looks like this:

Arena, pages, spans and objects layout

The allocator asks the OS for arenas, divides them into pages, groups pages into spans, and divides spans into fixed-size slots. Each level makes the next one manageable. But there’s still a big problem we haven’t talked about: Go programs run many goroutines at once, and they all need to allocate memory. How do you keep all of this organized without turning the allocator into a bottleneck?

The Locking Problem

So far we’ve seen how memory is organized — arenas, pages, spans, slots. But there’s a critical question we haven’t addressed: what happens when multiple goroutines try to allocate memory at the same time?

Imagine you have a single global list of spans. Every time any goroutine needs memory, it has to acquire a lock, find a free slot, and release the lock. With thousands of goroutines running concurrently, that lock becomes a bottleneck — goroutines spend more time waiting for each other than actually doing useful work.

This is the locking problem, and solving it is one of the most important aspects of the Go allocator’s design. The solution is a three-level hierarchy where each level has a different scope and different locking behavior:

Level 1: mcache (per-P, no locks)

Remember from the bootstrap article that Go’s scheduler has a fixed number of Ps (processors), typically one per CPU core. Each P has its own mcache — a private collection of spans, one for each span class. When a goroutine running on a P needs to allocate, it grabs a slot from its P’s mcache. Since only one goroutine runs on a P at a time, no lock is needed. This is the fast path, and it handles the vast majority of allocations.

Level 2: mcentral (per span class, brief locks)

When an mcache’s span for a particular span class is full, it needs a new one. That’s where mcentral comes in. There’s one mcentral for each of the 136 span classes, and it holds a shared pool of spans. The mcache returns its full span to mcentral and grabs a new one with free slots. This requires a lock, but it’s brief — just swapping one span for another. And since each span class has its own mcentral, goroutines allocating different sizes or scan/noscan variants don’t compete with each other.

Level 3: mheap (global, expensive locks)

When an mcentral has no more spans to hand out, it asks the mheap for fresh pages to create a new span. The mheap is the global page allocator — there’s only one, and accessing it requires a global lock. This is the slow path. It involves searching for free pages, potentially asking the OS for a new arena, and initializing a new span. But it happens rarely, because the levels above absorb most of the demand.

This is also where the large objects (>32KB) we mentioned earlier end up — they skip mcache and mcentral and go straight to mheap.

The whole design works like a chain of caches:

Memory allocator hierarchy levels

Each level serves as a cache for the one below it. The fast path is lock-free, the medium path uses fine-grained locks, and the slow path is rare enough that its cost doesn’t matter in practice. This is based on an approach called tcmalloc (Thread-Caching Malloc), originally designed by Google for C/C++ programs, but adapted for Go’s specific needs.

Now that we understand the structure and the hierarchy, let’s walk through what actually happens step by step.

The Allocation Flow

Everything goes through a single function in the runtime: mallocgc() in src/runtime/malloc.go .

The first thing it does is check the size of the allocation. Depending on how big the object is, it takes a very different path. Let’s start with the simplest case.

Zero-Sized Allocations

A fun edge case: what if you allocate something with zero size, like struct{}{}? Go doesn’t bother allocating anything — all zero-byte allocations return a pointer to the same global variable (zerobase). This is safe because you can never actually read or write through a zero-sized object.

Now for the real allocations, starting with the smallest ones.

Tiny Objects

For tiny objects without pointers — things like bool, int8, or small pointer-free structs — the allocator uses the tiny allocator we mentioned earlier. The mcache keeps track of a current tiny block (just a regular 16-byte slot from the size class 2 span — nothing special about it) and an offset that marks how much of that block has been used so far.

When a tiny allocation comes in, the allocator first rounds the size up for proper alignment (to 2, 4, or 8 bytes depending on the object size), then checks if the current tiny block has enough room from the current offset to the end. If it fits, the allocator returns a pointer to block + offset and advances the offset. So a 1-byte bool followed by a 1-byte int8 would be packed next to each other inside the same 16-byte block.

When the current tiny block doesn’t have enough room, the allocator grabs a new 16-byte slot from the mcache’s size class 2 span (using the normal small object path) and places the object at the beginning of it. But here’s a subtle detail: the allocator doesn’t blindly switch to the new block. It compares how much free space the old block has left versus how much the new block has left (which is 16 minus the object size just placed). Whichever block has more remaining space becomes the current tiny block. This minimizes waste — the allocator always prefers the block with the most room for future tiny allocations. Either way, the object you asked for is returned from the new slot; it’s just a question of which block stays “current” for the next tiny allocation.

Because the tiny allocator captures all pointer-free objects under 16 bytes, it ends up absorbing most of what you’d expect to land in the smallest size classes. In practice, the 8-byte size class (class 1) is exclusively used for 8-byte values that do contain pointers — like a *int or a single-pointer interface. An int64, despite being 8 bytes, goes through the tiny allocator instead.

Once the object is 16 bytes or larger (or contains pointers), the tiny allocator doesn’t apply and we enter the main allocation path.

Small Objects (16B to 32KB)

This is the most common path, and the one the whole hierarchy is optimized for. Here’s how it flows:

Round up to the nearest size class and determine the span class (size + scan/noscan).
Check the mcache: look at the span for that span class and find the next free slot using a bitmap. If there’s a free slot, return it. Done — no locks, just some bit manipulation.
If the span is full, the mcache returns it to the mcentral and asks for a new span with free slots. The mcentral first looks for spans it already has. If it finds one that hasn’t been swept yet by the garbage collector, it sweeps it first and then hands it over.
If mcentral has nothing, it asks the mheap to allocate fresh pages and create a new span.
If mheap doesn’t have enough free pages, it requests a new arena from the operating system.

Most allocations stop at step 2. Steps 3-5 happen progressively less often, which is why the system performs well.

And finally, the big ones.

Large Objects (> 32KB)

As we covered earlier, these skip mcache and mcentral entirely and go straight to the mheap, which allocates exactly the pages needed.

We’ve seen how memory gets allocated, but what about the other direction — how does memory get freed?

Garbage Collection Integration

The memory allocator doesn’t work alone — it’s tightly connected to the garbage collector. We’ll explore the garbage collector in detail in a future article, but it’s worth understanding the basics of how they interact, because it affects how the allocator behaves.

The garbage collector’s job is to figure out which objects on the heap are still in use and which are garbage. It does this by walking the object graph — starting from known roots (global variables, stack variables, etc.) and following pointers to find everything that’s reachable. Anything it can’t reach is dead and can be freed.

This is where the two bitmaps in each span come in. Every span has an allocBits bitmap (which slots are allocated) and a gcmarkBits bitmap (which slots the GC found to be live). During a GC cycle, the collector marks live objects in gcmarkBits. When marking is done, the runtime swaps the two bitmaps — so allocBits now reflects only the live objects, and everything that wasn’t marked is effectively freed. The allocator can then reuse those slots.

This also explains something you might have noticed in the allocation flow: when the mcentral hands a span to an mcache, it sometimes needs to sweep it first. Sweeping is the process of looking at a span’s bitmaps and figuring out which slots are free after a GC cycle. The allocator does this lazily — it doesn’t sweep all spans at once, but rather sweeps them on demand as they’re needed for new allocations. This spreads the cost of sweeping across all allocations instead of doing it all in one big pause.

If a span ends up completely empty after sweeping (every object was garbage), its pages are returned to the mheap and can be reused for different span classes.

So the GC figures out what’s dead, and the allocator reclaims those slots. But there’s one more question: does any of that memory ever go back to the operating system?

Memory Freeing and Scavenging

When the garbage collector frees objects, they don’t go back to the operating system. The slots just become available again in their span, ready for the next allocation. The pages stay with the runtime — from the OS’s perspective, your program is still using all that memory.

But what if your program had a big spike of activity, allocated a lot of memory, and now most of it is garbage? You’d have a bunch of free pages sitting in the mheap doing nothing, while the OS thinks your program is still using all of it.

That’s where the scavenger comes in. It’s a background goroutine that periodically looks for free pages that haven’t been used in a while and tells the OS it can reclaim them. The pages stay mapped in your program’s address space (so the runtime can reuse them later without a new system call), but the OS knows it can take back the physical memory behind them. On Linux, this is done with MADV_DONTNEED — a hint that says “I don’t need this memory right now, feel free to use it elsewhere.”

It’s a balancing act. Returning memory too eagerly would hurt performance — if the program needs that memory again soon, it’ll have to fault it back in. But holding onto too much unused memory wastes system resources. The scavenger tries to find the right balance.

Summary

Let’s recap what we’ve covered. At compile time, escape analysis decides which values need to live on the heap. At runtime, the memory allocator is the one that actually manages that heap memory. Instead of asking the OS every time, the runtime grabs large 64MB arenas upfront and subdivides them into 8KB pages. Pages are grouped into spans, where each span holds fixed-size slots for objects of a single size — one of 68 size classes ranging from 8 bytes to 32KB. The scan/noscan distinction doubles that to 136 span classes, so the garbage collector can skip objects without pointers.

To avoid lock contention, the allocator uses a three-level hierarchy: the mcache (per-P, lock-free) handles most allocations, the mcentral (per span class, brief locks) refills mcaches with fresh spans, and the mheap (global) allocates pages when everything else is exhausted. Tiny objects get packed together, large objects bypass the hierarchy entirely.

The allocator works hand-in-hand with the garbage collector through dual bitmaps on each span, and the scavenger makes sure unused memory eventually gets returned to the OS.

If you want to explore the implementation yourself, the runtime source in src/runtime/malloc.go , mheap.go , mcache.go , and mcentral.go is well-commented and surprisingly readable.

In the next article, we’ll look at the scheduler — the part of the runtime that decides which goroutine runs where, and how it multiplexes thousands of goroutines onto a handful of OS threads.