Go 运行时：垃圾回收器

原文

In the previous article we explored the Go scheduler — how goroutines get multiplexed onto OS threads, the GMP model, and all the tricks the runtime uses to keep your cores busy. But there’s a fundamental problem we haven’t addressed yet: all those goroutines allocate memory, and somebody has to clean it up. That’s the garbage collector’s job, and that’s what we’re exploring today.

In this article we’ll be looking at the garbage collector as it works in Go 1.26, which introduced the GreenTea GC. If you’re using an earlier version of Go, don’t worry — the overall structure is the same. The main difference is in the mark phase, and we’ll briefly explain how the older approach differs whenever we reach that point. If you want to hear more about GreenTea, Michael Knyszek gave a great talk about it at GopherCon 2025 .

Go’s garbage collector is a non-moving, concurrent, tri-color, mark-and-sweep collector. That’s a lot of adjectives — but what do they all mean? Let’s break them down one by one.

Non-moving means the GC never relocates objects in memory. Once an object is allocated at a particular address, it stays there for its entire lifetime. This is a big deal — it means pointers remain valid, which makes things like unsafe.Pointer and cgo interop much simpler. Some garbage collectors (like Java’s G1 or ZGC) do move objects around to compact memory and reduce fragmentation, but Go takes a different approach: it relies on its size-class-based allocator to minimize fragmentation without needing to move anything.

Concurrent means the GC does most of its work while your program keeps running. It doesn’t stop the world (pause all goroutines so only the GC runs) for the entire collection cycle — only for two very brief pauses. The rest of the time, GC goroutines run alongside your application goroutines, sharing CPU time.

Tri-color refers to the algorithm used during the mark phase. Every object is conceptually colored white, grey, or black — we’ll get into the details shortly.

Mark-and-sweep describes the two main operations: first, mark all the objects that are still reachable (alive); then, sweep through the heap and reclaim anything that wasn’t marked (garbage). Simple in concept, tricky in practice — especially when your program is mutating pointers at the same time the GC is trying to trace them.

Note: This article references concepts from Go’s memory allocator — particularly spans, size classes, and the mcache/mcentral/mheap hierarchy. It’s not strictly required reading, but I’d encourage you to check out The Memory Allocator first.

Now that we know what kind of collector Go uses, let’s see how it actually works. The whole process is organized into a cycle of four phases.

The Four Phases of Garbage Collection

The garbage collector runs in four phases. Let’s walk through a full GC cycle as if we’re watching it happen in real time.

It all starts with sweep termination. As we’ll see later, sweeping is done lazily — objects aren’t all freed at once, but rather reclaimed on demand as the allocator needs memory. This means that by the time a new GC cycle is triggered, there might still be work left over from the previous cycle that hasn’t been swept yet. So the runtime briefly stops the world to finish any remaining sweeps and make sure the heap is in a clean state. Once that’s done, it prepares the data structures for the new mark phase and enables the write barrier — so the GC can track any pointer mutations that happen while it works.

Then comes the mark phase, which is where the real work happens. The world starts again, and the GC runs concurrently with your program. It scans the roots — goroutine stacks, global variables, and other starting points — and traces the entire object graph from there. The GC aims to use about 25% of the available CPU during this phase, running dedicated GC goroutines alongside your application.

When all reachable objects have been traced, the runtime stops the world again for mark termination and sweep preparation. This short pause disables the write barrier, swaps the mark bitmaps, and prepares everything for sweeping.

Finally, the sweep phase kicks in — again concurrently. The GC walks through the heap, freeing any objects that weren’t marked as reachable. This phase actually overlaps with the beginning of normal program execution, and can even overlap with the start of the next GC cycle.

We mentioned that the first thing the GC does after stopping the world is enable the write barrier. But what exactly is a write barrier, and why does the GC need one?

The Write Barrier

Here’s the problem: the GC is tracing the object graph while your program is still running. And critically, the GC only scans each object once — once it has looked inside an object and found all the pointers it contains, it marks that object as “already scanned” and moves on. It will never come back to check it again.

Now imagine you have a User struct with a Team pointer. The GC scans that User, sees it points to Team A, notes Team A as alive, and marks the User as already scanned. But then your program runs and changes the User’s Team field to point to Team B — a team that the GC hasn’t visited yet. Since the GC already scanned the User and won’t look at it again, it has no idea that Team B is now reachable. As far as the GC is concerned, nothing points to Team B, so it would collect it as garbage — even though your User is actively referencing it. That would be catastrophic.

The write barrier is the mechanism that prevents this. Every time your program writes a pointer value to memory during a GC cycle, the write barrier intercepts that write and ensures that the pointers being changed are passed through the marking process just like any other pointer — so the GC won’t miss them.

Go uses a hybrid Yuasa-Dijkstra write barrier (you can find the implementation in src/runtime/mbarrier.go ). When a pointer write happens, the barrier marks both the old target (the object being pointed to before the write) and the new target (the object being pointed to after the write) as needing to be scanned. By marking both, the GC ensures it won’t miss any objects regardless of the order in which mutations and scanning happen. This is what allows the GC to run concurrently without missing reachable objects.

The write barrier adds a small overhead to every pointer write during a GC cycle — but it’s the price we pay for concurrent collection. Without it, we’d need to stop the world for the entire mark phase.

With the write barrier in place, the GC can safely start tracing the object graph while your program keeps running. Let’s look at the algorithm it uses.

The Tri-Color Marking Algorithm

The GC uses a tri-color marking algorithm. Every object in the heap is conceptually assigned one of three colors:

White: Not yet visited — potentially garbage
Grey: Known to be alive, but not yet scanned for pointers — it’s in the work queue
Black: Alive and fully scanned — all its references have been traced

The algorithm starts by coloring the roots grey — these are the objects that are definitely alive because something is directly referencing them (we’ll see exactly what these roots are and how the GC finds them in a moment). Then it repeatedly picks a grey object, scans it for pointers to other objects, marks any white targets as grey, and marks the scanned object as black. When no grey objects remain, marking is complete — every object still white is garbage.

The critical invariant is that no black object can ever point to a white object. The write barrier we just discussed is what maintains this invariant while your program runs concurrently with the GC.

Now let’s see how the mark phase actually plays out in practice, starting from the very beginning.

The Mark Phase

The mark phase is the most complex and interesting part of the garbage collector. It runs concurrently with your program, using dedicated GC goroutines that aim to consume about 25% of the available CPU.

So the mark phase has started and the GC is ready to trace the object graph. But where does it begin?

Finding the Roots

Before the GC can trace the object graph, it needs to find the starting points — the objects that are definitely alive. These are the roots, and they come from a few places:

Goroutine stacks: Every goroutine has a stack, and any pointer on that stack that points outside the stack is a root. The GC iterates over a snapshot of allgs — the global list of all goroutines that we saw in the scheduler article — and scans each alive goroutine’s stack to find all pointer values. This can require briefly pausing each goroutine (one at a time, not all at once) to get a consistent snapshot of its stack — though goroutines that are already stopped (e.g., blocked on a channel or waiting for I/O) can be scanned immediately without pausing them.
Global variables: Package-level variables that contain pointers are roots. The linker generates bitmaps for the .data and .bss sections of the binary — these are the regions of memory where initialized and uninitialized global variables live, respectively — that tell the GC exactly which words in those sections are pointers, so it can scan globals efficiently without guessing. If you’re curious about how Go binaries are structured, check out this talk about how the Go binary is shaped under the hood .
Finalizers and cleanups: Go allows you to attach cleanup functions to objects — via runtime.SetFinalizer or the newer runtime.AddCleanup (introduced in Go 1.24). For example, you might use these to close a file descriptor or release a C resource when the object is no longer needed. The GC tracks these registrations and treats the associated objects as roots, keeping them alive until their finalizer or cleanup function has had a chance to run.

Each heap object discovered through these roots gets added to the GC’s work queue, ready to be scanned. From these starting points, the GC will trace through the entire reachable object graph.

But where do all these discovered objects go? The GC needs a place to keep track of the work it still has to do.

The Work Queue

As the GC discovers objects that need scanning — starting with the roots, and then every object those roots point to, and so on — it needs somewhere to keep track of all the pending work. Think of it like a to-do list: every time the GC finds a new object, it adds “scan this object” to the list, and GC workers keep pulling items off until the list is empty.

Each P (processor) gets its own local work queue — the gcWork struct . Having per-P queues means GC workers mostly operate on their own local data without needing locks, which is important when multiple threads are marking concurrently.

There are two kinds of items in the queue. For very large or very small objects, the GC queues individual object pointers — a straightforward “scan this one object” entry.

But for most small objects (the 16-512 byte range that makes up the bulk of typical Go allocations), the GC does something smarter: instead of queuing each object separately, it queues the entire span (a contiguous block of memory holding objects of the same size class) that contains it. When the GC gets around to processing that span, it scans all the marked objects in it at once. Since those objects live next to each other in memory, this is much more CPU cache-friendly than jumping around the heap scanning one object here, another there.

The span queue uses a FIFO (first-in, first-out) policy, and this is a deliberate choice. A LIFO stack would process spans immediately after they’re queued, but FIFO lets spans sit in the queue for a while. Why is that good? Because while a span is waiting, more objects in that same span might get discovered and marked. By the time the GC gets around to scanning it, there might be a whole batch of objects ready to process together — better cache locality, less overhead per object, and the opportunity to use SIMD instructions for really dense spans.

Pre-1.26 difference: Before Go 1.26, the GC only had individual object queues. The span-based queue is the main innovation of the GreenTea garbage collector introduced in 1.26 — it’s what enables batch scanning and all the performance gains that come with it.

Now that we know where work is tracked, let’s look at how the GC decides whether an object is marked or not.

How Mark Bits Are Stored

The GC needs to track two things about each object: has it been discovered (we found a pointer to it), and has it been scanned (we looked inside it for more pointers). It needs somewhere to store this information — and where it puts it matters a lot for performance.

For eligible small objects (roughly in the 16-512 byte range), the GC stores this metadata as inline mark bits — a small structure tucked at the very end of the span itself, right next to the objects it describes. This is great for CPU cache locality: when the GC is scanning objects in a span, the mark metadata is already sitting in the same region of memory, so it’s likely already in the cache.

Pre-1.26 difference: Before Go 1.26, all objects used a separate gcmarkBits bitmap stored elsewhere in the mspan struct. The inline approach described here is one of the main changes introduced in Go 1.26.

The inline mark bits structure (defined in src/runtime/mgcmark_greenteagc.go ) contains two bitmaps — one called marks and one called scans. Why two? Because they track different things, and keeping them separate is what makes deferred scanning possible.

When the GC discovers a pointer to an object, it sets that object’s bit in the marks bitmap — “we know this object exists and is alive.” But it doesn’t scan the object right away. Instead, it queues the span (as we saw in the work queue section) and moves on. Later, when the GC gets around to processing that span, it looks at which objects are marked but not yet scanned — that’s just marks minus scans. It scans those objects, then updates the scans bitmap to record that they’ve been processed.

This two-bitmap approach means the GC can accumulate many marked objects in a span before scanning any of them. And if new objects in the same span get discovered while it’s sitting in the queue, they’ll be included in the next batch — no need to re-queue the span.

Objects outside the 16-512 byte range — very small objects and large ones — don’t use inline mark bits. They fall back to a traditional separate bitmap stored in the mspan struct, which works the same way conceptually but without the cache locality benefits.

We know how the GC tracks which objects are marked and scanned. But when it actually looks inside an object, how does it know which bytes are pointers and which are just data?

How the GC Knows Where Pointers Are

When the GC scans an object, it needs to know which words within that object are pointers and which are just regular data (integers, floats, strings of bytes). It can’t just treat every 8-byte word as a pointer — that would produce false positives and keep dead objects alive.

The answer is the pointer bitmap (also called heap bits). The compiler knows the layout of every struct, and it encodes which fields are pointers into the type’s GC metadata at compile time. How that metadata gets used at scan time depends on the object size:

For small objects (16 to 512 bytes), the runtime copies the type’s pointer bitmap into the end of the span when the object is allocated. This means the bitmap is physically right next to the objects it describes — great for CPU cache performance. When the GC scans one of these objects, it reads the bitmap directly from the span.

For larger objects, there is no stored bitmap at all. Instead, the GC reads the type’s GC metadata directly and “tiles” it across the object on the fly — repeating the type’s pointer pattern as many times as needed to cover the full object. This avoids wasting memory on bitmaps for large allocations that might have very repetitive layouts (think a big array of structs).

In both cases, the GC iterates over only the words marked as pointers. For each pointer it finds, it tries to defer the target to span-based scanning.

Now we have all the pieces — work queues, mark bits, pointer bitmaps. Let’s put them together and see what happens end-to-end when the GC encounters a pointer.

Marking and Scanning a Span

What actually happens when the GC finds a pointer to an object? Let’s walk through the process. The key function is tryDeferToSpanScan() .

First, the GC checks whether the span uses inline mark bits — a quick bitmap lookup. If it doesn’t (because the objects are too large or too small), the GC falls back to scanning the object individually. If it does, the GC computes which object within the span this pointer refers to (simple arithmetic, since all objects in a span are the same size) and atomically sets its mark bit.

There’s a nice shortcut for objects that contain no pointers at all (like a [256]byte or a struct with only integers). These live in noscan spans — the GC just marks them and moves on, since there’s nothing inside to scan.

For objects that do contain pointers, the GC needs to queue the span for later scanning. But here’s a concurrency challenge: multiple GC goroutines might discover different objects in the same span at the same time. We don’t want the same span queued multiple times — that would be wasteful.

The GC handles this with a simple ownership protocol. Each span tracks whether it’s unowned, has one mark, or has many marks. The first thread to mark an object in an unowned span acquires it and queues it. Any subsequent thread that marks another object in the same span just sets the mark bit and moves on — the span is already in the queue, and when it gets scanned, all the accumulated marks will be picked up together. Once scanning is done, ownership is released, and the cycle can repeat if new objects get discovered.

This ownership state also enables a fast path: if only one object was marked, the GC scans just that one directly. If many objects accumulated, it compares the marks and scans bitmaps to figure out exactly which objects still need scanning.

Pre-1.26 difference: Before Go 1.26, discovering a pointer would queue the individual object pointer into a LIFO work buffer. The span-based approach batches objects that are spatially close together, improving CPU cache locality and enabling SIMD optimizations.

We’ve seen how individual spans get marked and scanned. But who orchestrates all of this? And how does the GC decide what to work on next?

The Work Loop and Scanning Strategy

All of this marking and scanning is driven by a work loop — the gcDrain() function — that each GC goroutine runs continuously until there’s no work left.

The loop checks for work in a specific priority order: first local objects, then local spans (both are fast because they don’t need synchronization), then the global object queue, then global spans. If all of those are empty, it flushes the write barrier buffer (which might produce new work) and retries objects and spans. As a last resort, it steals work from other Ps — just like the scheduler does for goroutines. The GC iterates through other Ps in random order and takes about half of a victim’s available spans, refilling its own local queue.

When a span finally comes off the queue, the GC has to decide how to scan it. If only one object was marked, it just scans that one directly — simple and fast. If many objects accumulated, the GC picks a strategy based on density:

If the span is sparse (less than 12.5% of objects are marked), the GC walks through the marked objects one by one. When most slots are empty, this avoids wasting time on blank memory.
If the span is dense (12.5% or more objects are marked), the GC switches to SIMD-optimized scanning using AVX-512 instructions on x86-64. This processes multiple objects in parallel — entire cache lines at once — which can be 4-8× faster than scanning one object at a time. On platforms without AVX-512 support, the dense path isn’t available — the GC always takes the sparse path, scanning marked objects one by one.

This is exactly why the FIFO queue and deferred scanning pay off. By letting spans accumulate marks before scanning, the GC benefits from better cache locality (even on the sparse path, the objects are close together in memory) and is more likely to hit that density threshold where SIMD kicks in. For each pointer found during scanning, the whole process repeats — mark the target, queue the span, keep going — until no more grey objects remain anywhere.

The dedicated GC goroutines do most of the marking work. But sometimes they can’t keep up — what happens then?

Mark Assist

We said the GC aims to use about 25% of the CPU for its dedicated marking goroutines. But what happens if your program is allocating memory faster than the GC can mark it? If the GC falls behind, the heap would grow without bound before the cycle finishes.

This is where mark assist kicks in. When a goroutine tries to allocate memory during a GC cycle, the runtime checks whether the GC is keeping up. If it’s falling behind, the allocating goroutine is drafted into helping — before it gets its memory, it has to do some marking work first. The amount of work is proportional to how much memory the goroutine is allocating, so heavy allocators contribute more.

This creates a natural backpressure mechanism: the faster your program allocates, the more its goroutines get pulled into mark assist, which slows down allocation and gives the GC time to catch up. From the outside, this shows up as increased latency on allocations during GC — your goroutine might block for a bit doing mark work before getting the memory it asked for.

You can observe mark assist in action using Go’s execution tracer (runtime/trace). If you see your goroutines spending time in runtime.gcAssistAlloc , that’s mark assist — the GC is asking your goroutines for help because they’re allocating faster than the dedicated GC workers can keep up.

That was a lot of detail — let’s zoom out and see how all the pieces fit together.

Mark Phase Recap

Let’s take a step back and see the whole mark phase as one flow:

The GC identifies roots — scanning goroutine stacks (via allgs), global variables (using linker-generated bitmaps for .data and .bss), and finalizer/cleanup registrations.
Each root pointer goes into the work queue — either as an individual object or as a span entry.
The work loop pulls items from the queue. For spans, the GC sets mark bits in the inline mark bitmap and defers scanning until more objects in the same span accumulate.
When a span is dequeued for scanning, the GC uses the pointer bitmap to find which words are pointers, and picks a scanning strategy — one-by-one for sparse spans, or SIMD-accelerated for dense ones.
Any new pointers found during scanning go back to step 2, continuing the cycle.
Meanwhile, the write barrier intercepts pointer mutations from the application, feeding changed pointers back into the marking process so they’re not missed.
If the GC falls behind, mark assist drafts allocating goroutines into doing marking work, creating backpressure.
When no work remains across any P’s queues — local, global, and nothing left to steal — the mark phase is complete.

Once all the grey objects have been processed and no more work remains, the mark phase is done. The runtime stops the world one last time to wrap up marking and prepare for sweeping.

Mark Termination and Sweep Preparation

This is all one continuous stop-the-world pause. The runtime first verifies that no marking work remains — it flushes write barrier buffers one last time and confirms that all work queues are truly empty. It also flushes allocation counters from each P’s mcache so the pacer — the component that decides when to trigger the next GC cycle (we’ll talk more about it shortly) — has accurate numbers for the next cycle. Then the write barrier is disabled, since marking is fully complete and pointer writes no longer need to be intercepted. Finally, the runtime sets up the sweep state so the sweeper knows which spans to process, and restarts the world.

With the world running again, it’s time to reclaim the garbage. But the sweep phase doesn’t eagerly sweep the entire heap at once.

The Sweep Phase

The sweep phase runs concurrently with your program. Its job is to walk through the heap and reclaim any objects that weren’t marked as reachable. But what does “reclaim” actually mean at the span level? Let’s dig in.

Every span maintains two bitmaps: allocBits and gcmarkBits. The allocBits bitmap tracks which slots in the span currently hold an allocated object. The gcmarkBits bitmap — which was just populated during the mark phase — tracks which of those objects are still reachable.

When the sweeper processes a span, it first handles a crucial step for spans that used inline mark bits during the mark phase: merging the inline marks into the traditional gcmarkBits bitmap. The sweep phase works with gcmarkBits, not the inline ones, so the sweeper calls moveInlineMarks() to copy the inline marks over with a simple OR operation. There’s a nice safety check during this merge: the GC verifies that the marks and scans bitmaps are identical. If any object was marked but never scanned, something went seriously wrong — the GC missed a reachable object, and the runtime will panic rather than silently collecting live data.

With gcmarkBits up to date, the sweeper compares the two bitmaps. If a slot has its allocBits set (something was allocated there) but its gcmarkBits is not set (the GC didn’t find it reachable), that object is garbage. The sweeper then replaces allocBits with a copy of gcmarkBits — effectively, the set of “what’s allocated” becomes “what’s alive.” All the garbage slots simply disappear from allocBits, and those slots are now free for the allocator to reuse.

This is why Go’s GC is non-moving: the objects that survived don’t go anywhere. They stay exactly where they are, and the freed slots become available for future allocations within the same span. If all objects in a span are garbage — gcmarkBits is entirely zero — the whole span can be returned to the page allocator for reuse by a different size class or even returned to the OS.

Sweeping works lazily — spans are swept on demand, when the allocator needs them. The key guarantee is that a span is always swept before it can be used for allocation. When a goroutine asks for memory, the allocator must sweep any unswept spans it encounters before handing out slots, ensuring allocBits are properly reconciled with gcmarkBits first. To make this work, each P’s mcache is flushed after the world restarts — each P releases its cached spans back to the mcentral before its next allocation. This forces every P to re-acquire spans through the normal path, which guarantees they get swept first. The net effect is that sweeping cost gets spread across normal allocations rather than happening all at once.

Besides the allocator-driven sweeping, the runtime has a dedicated background sweeper goroutine (bgsweep) that is woken up at the end of mark termination. It works through unswept spans in batches, even when no allocation is happening, and parks itself when it’s done. It’s fine if it doesn’t get through everything — goroutines allocating memory will sweep what they need anyway.

Even with background sweeping, some spans might still be unswept by the time the next GC cycle triggers — so sweep termination exists at the beginning of each cycle to finish off any remaining work and ensure the heap is in a clean state before marking starts again.

We’ve seen the full GC cycle from start to finish — but we never addressed a fundamental question: what triggers a new cycle in the first place?

What Triggers a GC Cycle?

There are three ways a garbage collection cycle can start:

The most common one is automatic, driven by how much memory your program is using.

The GC Pacer

The GC pacer is the runtime’s built-in mechanism for deciding when to start the next cycle. Its goal is to start collecting early enough that it finishes before the heap grows too large, but late enough that it doesn’t waste CPU on unnecessary cycles. If you want a deep dive into how the pacer works, Madhav Jivrajani gave an excellent talk about it .

The pacer is controlled by two knobs:

GOGC (default: 100): Sets the heap growth target as a percentage. A value of 100 means the GC will trigger when the heap has grown to roughly double the size of the live data from the previous cycle. Setting it to 200 allows the heap to triple before collecting, reducing GC frequency at the cost of more memory. Setting it to 50 triggers collection sooner, using less memory but spending more CPU on GC.
GOMEMLIMIT: Sets an absolute memory limit for the Go heap. When the heap approaches this limit, the GC becomes more aggressive — triggering earlier and using more CPU to keep memory under the cap. This is useful in containerized environments where you have a hard memory budget.

The pacer runs as part of the allocation path, but it doesn’t check on every single object allocation — that would be too expensive. Instead, the trigger check happens when the allocator needs to grab a new span from the mcentral (meaning the current span ran out of slots). For large objects, which each get their own span, the check happens unconditionally on every allocation. In practice, this is frequent enough that the GC starts promptly when needed, while avoiding the overhead of checking on every tiny allocation.

But what if your program goes idle and stops allocating altogether? The pacer only checks during allocations, so it would never fire — even if the heap is full of garbage from the last burst of activity.

The System Monitor

This is where the system monitor (sysmon) comes in. We saw in the scheduler article that sysmon is a background thread that periodically checks on the health of the runtime. One of its jobs is to force a GC cycle if it’s been too long since the last one — typically if more than 2 minutes have passed without a collection. This ensures that even idle programs eventually clean up their heap.

There’s one more way — you can also take matters into your own hands.

Explicit Calls

You can trigger a GC cycle manually by calling runtime.GC(). This forces a full collection cycle regardless of heap size or timing. It’s occasionally useful for benchmarking or for situations where you know a large batch of objects just became unreachable and you want to reclaim the memory immediately — but in general, you should trust the pacer to do the right thing.

That covers everything — from what triggers a GC cycle to how it marks, terminates, and sweeps. Let’s wrap it all up.

Summary

Go’s garbage collector is a non-moving, concurrent, tri-color, mark-and-sweep collector that manages to do most of its work without stopping your program. The two stop-the-world pauses — one to set up the mark phase and one to finalize it — are extremely short.

The whole cycle starts when the GC pacer decides the heap has grown enough (or sysmon forces a collection, or someone calls runtime.GC()). The runtime stops the world briefly to finish any leftover sweeping, enables the write barrier, and kicks off the concurrent mark phase.

During marking, the GC finds the roots — goroutine stacks, global variables, and finalizer/cleanup registrations — and traces the entire object graph from there. It uses a work queue where each P has its own local queue of spans to process. Instead of scanning objects one by one, the GC batches them by span: it sets mark bits in the span’s inline bitmap and defers scanning until multiple objects accumulate. When a span is finally scanned, the GC picks a strategy based on density — scanning objects individually for sparse spans, or using AVX-512 instructions for dense ones on x86-64. The write barrier runs throughout this phase, catching any pointer mutations from the application and feeding them back into the marking process. And if the program is allocating faster than the GC can mark, mark assist drafts allocating goroutines into helping with the work.

Once marking is complete, the runtime stops the world one last time to verify no work remains, flush allocation counters for the pacer, disable the write barrier, and set up the sweep state. Then sweeping begins — lazily and concurrently. Each P flushes its mcache before its next allocation, forcing spans to be swept before reuse. As each span is swept, its inline mark bits are merged into the traditional bitmaps. A dedicated background sweeper goroutine helps clean up between allocations, and sweep termination at the start of the next cycle ensures any remaining stragglers are taken care of.

Want to dive deeper? Check out: