Notes from Optimizing CPU-Bound Go Hot Paths

原文

Notes from Optimizing CPU-Bound Go Hot Paths

May 03, 2026

Go does a lot of things right. And I love go because of that. But while porting Brotli to pure Go for go-brrr, I kept hitting the same pattern: idiomatic abstractions made hot paths slower, and the fastest version was often hand-duplicated and specialized.

Lack of zero-cost abstractions

In the hot loops I was optimizing, generics, polymorphic dispatch (via interface), and closures often prevented the compiler from producing the same code as the concrete version. The reason is that go doesn't inline these calls in the shapes I was using (we will see problems with inlining very often in this post because inlining is quite important). Yes, the compiler can sometimes inline a direct closure call or devirtualize an interface call, but in the patterns I actually ran into it didn't, and I ate the call overhead. It's clear why interface calls are not inlined. They enable swapping of implementation at runtime rather than compile time. But generics allow to swapping implementation at compile time. If you are coming from languages like c++ or rust you'd expect the generic functions to be monomorphized (all variants pre-generated as concrete functions at compile time) but in go it doesn't happen,at least not in that form. Go uses approach they called GC Shape Stenciling where some parts are pre-generated at compile time but method calls on type parameters end up going through interface-style dispatch (technically the itab is reached via a generics dictionary rather than an ordinary interface argument, but the effect on the hot path is the same). The impossibility of inlining is addressed in the proposal:

The one exception is that method calls won't be fully resolvable at compile time... inlining won't happen in situations where it could happen with a fully stenciled implementation.

So what do we do? Actually, no problem. We just don't use abstractions like generics and duplicate concrete functions. We just take a concrete function duplicate it completely with parts that we wanted to parametrize changed. Needless to say, this will cause lots of duplication. In the Brotli port there were 16 almost identical functions and the only difference between them was that they were calling different versions of the hash function. The 16 variants couldn't be collapsed into one via usage of abstractions because the function is used in hot path.

So performance problem is solved by duplication but this introduces a potentially big problem of maintenance. This can be somewhat mitigated by code generation of course but it's very likely that you will have big number of occurrences where you have only 2-3 duplicating variants which will not justify introduction of codegen.

The next section is a deep dive into benchmarking of the concrete-vs-generic-vs-interface approaches with some exploration of underlying assembly which you can happily skip.

Deep dive

Let's illustrate all described above by example. Here is the function I used in a real codebase reduced from unimportant stuff.

func StoreConcrete(t *Table, data []byte) {
	end := uint32(len(data))
	for i := uint32(0); i+4 <= end; i++ {
		v := binary.LittleEndian.Uint32(data[i:])
		key := (v * HashMul32) >> (32 - BucketBits)
		minor := uint32(t.Num[key]) & (BlockSize - 1)
		t.Buckets[minor+key<<BlockBits] = i
		t.Num[key]++
	}
}

Now imagine that we need to use several versions of hash functions. Which hash function to use is known at compile time. There are several options how we can parametrize.

One option is to use generics:

type Hasher interface {
	Hash(v uint32) uint32
}

type H5Hasher struct{}
func (H5Hasher) Hash(v uint32) uint32 {
	return (v * HashMul32) >> (32 - BucketBits)
}

func StoreGeneric[H Hasher](t *Table, data []byte) {
			key := h.Hash(v)
	}

Another option is to use polymorphic dispatch:

func StoreInterface(t *Table, data []byte, h Hasher) {
			key := h.Hash(v)
	}

Another option is to pass a closure to the function:

func StoreClosure(t *Table, data []byte, hash func(uint32) uint32) {
			key := hash(v)
	}

Expand to see full code

import "encoding/binary"

const HashMul32 = 0x1E35A7BD

const (
	BucketBits = 14
	BlockBits  = 4
	BucketSize = 1 << BucketBits
	BlockSize  = 1 << BlockBits
)

type Table struct {
	Num     [BucketSize]uint16
	Buckets [BucketSize * BlockSize]uint32
}

func StoreConcrete(t *Table, data []byte) {
	end := uint32(len(data))
	for i := uint32(0); i+4 <= end; i++ {
		v := binary.LittleEndian.Uint32(data[i:])
		key := (v * HashMul32) >> (32 - BucketBits)
		minor := uint32(t.Num[key]) & (BlockSize - 1)
		t.Buckets[minor+key<<BlockBits] = i
		t.Num[key]++
	}
}

type Hasher interface {
	Hash(v uint32) uint32
}

type H5Hasher struct{}

func (H5Hasher) Hash(v uint32) uint32 {
	return (v * HashMul32) >> (32 - BucketBits)
}

func StoreGeneric[H Hasher](t *Table, data []byte) {
	var h H
	end := uint32(len(data))
	for i := uint32(0); i+4 <= end; i++ {
		v := binary.LittleEndian.Uint32(data[i:])
		key := h.Hash(v)
		minor := uint32(t.Num[key]) & (BlockSize - 1)
		t.Buckets[minor+key<<BlockBits] = i
		t.Num[key]++
	}
}

func StoreInterface(t *Table, data []byte, h Hasher) {
	end := uint32(len(data))
	for i := uint32(0); i+4 <= end; i++ {
		v := binary.LittleEndian.Uint32(data[i:])
		key := h.Hash(v)
		minor := uint32(t.Num[key]) & (BlockSize - 1)
		t.Buckets[minor+key<<BlockBits] = i
		t.Num[key]++
	}
}

func StoreClosure(t *Table, data []byte, hash func(uint32) uint32) {
	end := uint32(len(data))
	for i := uint32(0); i+4 <= end; i++ {
		v := binary.LittleEndian.Uint32(data[i:])
		key := hash(v)
		minor := uint32(t.Num[key]) & (BlockSize - 1)
		t.Buckets[minor+key<<BlockBits] = i
		t.Num[key]++
	}
}

As it's known at compile time what hash function is used the compiler can produce optimal code, right? Wrong!

Let's benchmark it first.

Environment:

go version go1.26.2-X:nodwarf5 linux/amd64
goos: linux
goarch: amd64
pkg: hashdemo
cpu: 12th Gen Intel(R) Core(TM) i5-12500

Run with:

go test -bench=. -benchmem -count 6 -cpu 1 | tee bench.txt
benchstat -filter '.unit:B/s' -col .name bench.txt

The throughput numbers in the table below are the benchstat-reported values across 6 runs.

Expand to see the full benchmarking code

import "testing"

const benchSize = 1 << 16

func BenchmarkConcrete(b *testing.B) {
	data := makeData(benchSize)
	t := new(Table)
	b.SetBytes(int64(len(data)))
	for b.Loop() {
		StoreConcrete(t, data)
	}
}

func BenchmarkGeneric(b *testing.B) {
	data := makeData(benchSize)
	t := new(Table)
	b.SetBytes(int64(len(data)))
	for b.Loop() {
		StoreGeneric[H5Hasher](t, data)
	}
}

func BenchmarkInterface(b *testing.B) {
	data := makeData(benchSize)
	t := new(Table)
	var h Hasher = H5Hasher{}
	b.SetBytes(int64(len(data)))
	for b.Loop() {
		StoreInterface(t, data, h)
	}
}

func BenchmarkClosure(b *testing.B) {
	data := makeData(benchSize)
	t := new(Table)
	hash := func(v uint32) uint32 { return (v * HashMul32) >> (32 - BucketBits) }
	b.SetBytes(int64(len(data)))
	for b.Loop() {
		StoreClosure(t, data, hash)
	}
}

func makeData(n int) []byte {
	b := make([]byte, n)
	x := uint32(0xDEADBEEF)
	for i := range b {
		x = x*1664525 + 1013904223
		b[i] = byte(x >> 24)
	}
	return b
}

Variant	Throughput	Δ vs Concrete
Concrete	378.0 MiB/s
Generic	320.6 MiB/s	-15.18%
Closure	322.0 MiB/s	-14.82%
Interface	274.3 MiB/s	-27.44%

Whoa! That's pretty dramatic difference.

Assembly related to the Concrete function

PUSHQ BP		MOVQ SP, BP
MOVQ BX, 0x18(SP)
XORL DX, DX
JMP 0x52e4ed
TESTB AL, 0(AX)		MOVL 0(BX)(DX*1), R9
IMULL $0x1e35a7bd, R9, R9
SHRL $0x12, R9
MOVZX 0(AX)(R9*2), R10
ANDL $0xf, R10
MOVL R9, R11
SHLL $0x4, R9
ADDL R10, R9
MOVL R8, 0x8000(AX)(R9*4)
MOVZX 0(AX)(R11*2), R9
INCL R9
MOVW R9, 0(AX)(R11*2)
LEAL 0x1(R8), DX
MOVQ SI, CX
LEAL 0x4(DX), SI	CMPL CX, SI
JB 0x52e514
CMPQ CX, DX
JB 0x52e51b
MOVQ CX, SI
SUBQ DX, CX
MOVL DX, R8
SUBQ DI, DX
SARQ $0x3f, DX
ANDQ R8, DX
CMPQ CX, $0x3
JA 0x52e4ad
JMP 0x52e516
POPQ BP			RET
CALL runtime.panicBounds(SB)	NOPL 0(AX)(AX*1)		CALL runtime.panicBounds(SB)
NOPL

Assembly related to the Generic function

IMULL $0x1e35a7bd, AX, AX
SHRL $0x12, AX
RET

CMPQ SP, 0x10(R14)
JBE 0x52f0f0
PUSHQ BP
MOVQ SP, BP
SUBQ $0x10, SP
MOVQ AX, 0x20(SP)
MOVQ BX, 0x28(SP)
MOVQ CX, 0x30(SP)
MOVQ DI, 0x38(SP)
MOVQ SI, 0x40(SP)
XORL DX, DX
JMP 0x52f072
MOVZX 0(CX)(BX*2), R8	ANDL $0xf, R8
SHLL $0x4, AX
ADDL AX, R8
MOVL 0xc(SP), DX
MOVL DX, 0x8000(CX)(R8*4)
MOVZX 0(CX)(BX*2), R8
INCL R8
MOVW R8, 0(CX)(BX*2)
INCL DX
MOVQ 0x20(SP), AX
MOVQ 0x30(SP), CX
MOVQ 0x28(SP), BX
MOVQ 0x40(SP), SI
MOVQ 0x38(SP), DI
LEAL 0x4(DX), R8	CMPL DI, R8
JB 0x52f0cf
NOPL 0(AX)(AX*1)
CMPQ DI, DX
JB 0x52f0ea
SUBQ DX, DI
MOVL DX, R9
SUBQ SI, DX
SARQ $0x3f, DX
ANDQ R9, DX
CMPQ DI, $0x3
JBE 0x52f0e5
MOVL R9, 0xc(SP)
MOVQ 0(AX), BX
MOVL 0(CX)(DX*1), CX
MOVQ AX, DX
MOVL CX, AX
CALL BX
MOVQ 0x28(SP), CX
TESTB AL, 0(CX)
MOVL AX, BX
NOPW 0(AX)(AX*1)
NOPL
CMPQ BX, $0x4000
JB 0x52f02f
JMP 0x52f0d5
ADDQ $0x10, SP		POPQ BP
RET
MOVQ $0x4000, AX		NOPL 0(AX)
CALL runtime.panicBounds(SB)
CALL runtime.panicBounds(SB)	CALL runtime.panicBounds(SB)	NOPL
MOVQ AX, 0x8(SP)					MOVQ BX, 0x10(SP)
MOVQ CX, 0x18(SP)
MOVQ DI, 0x20(SP)
MOVQ SI, 0x28(SP)
CALL runtime.morestack_noctxt.abi0(SB)
MOVQ 0x8(SP), AX
MOVQ 0x10(SP), BX
MOVQ 0x18(SP), CX
MOVQ 0x20(SP), DI
MOVQ 0x28(SP), SI
JMP hashdemo.StoreGeneric[go.shape.struct {}](SB)


PUSHQ BP
MOVQ SP, BP
TESTQ AX, AX
JE 0x52f154
IMULL $0x1e35a7bd, BX, AX
SHRL $0x12, AX

POPQ BP
RET
CALL runtime.panicwrap(SB)	NOPL

Assembly related to the Closure function

CMPQ SP, 0x10(R14)
JBE 0x52e779
PUSHQ BP
MOVQ SP, BP
SUBQ $0x10, SP
MOVQ AX, 0x20(SP)
MOVQ SI, 0x40(SP)
MOVQ CX, 0x30(SP)
MOVQ BX, 0x28(SP)
MOVQ DI, 0x38(SP)
XORL DX, DX
JMP 0x52e710
MOVZX 0(CX)(BX*2), R8	ANDL $0xf, R8
SHLL $0x4, AX
ADDL AX, R8
MOVL 0xc(SP), DX
MOVL DX, 0x8000(CX)(R8*4)
MOVZX 0(CX)(BX*2), R8
INCL R8
MOVW R8, 0(CX)(BX*2)
INCL DX
MOVQ CX, AX
MOVQ 0x30(SP), CX
MOVQ 0x28(SP), BX
MOVQ 0x40(SP), SI
MOVQ 0x38(SP), DI
LEAL 0x4(DX), R8	CMPL CX, R8
JB 0x52e75c
CMPQ CX, DX
JB 0x52e773
SUBQ DX, CX
MOVL DX, R9
SUBQ DI, DX
SARQ $0x3f, DX
ANDQ R9, DX
CMPQ CX, $0x3
JBE 0x52e76e
MOVL R9, 0xc(SP)
MOVQ 0(SI), CX
MOVL 0(BX)(DX*1), AX
MOVQ SI, DX
CALL CX
MOVQ 0x20(SP), CX
TESTB AL, 0(CX)
MOVL AX, BX
CMPQ BX, $0x4000
JB 0x52e6cf
JMP 0x52e762
ADDQ $0x10, SP		POPQ BP
RET
MOVQ $0x4000, AX		CALL runtime.panicBounds(SB)
CALL runtime.panicBounds(SB)	CALL runtime.panicBounds(SB)
NOPL
MOVQ AX, 0x8(SP)			MOVQ BX, 0x10(SP)
MOVQ CX, 0x18(SP)
MOVQ DI, 0x20(SP)
MOVQ SI, 0x28(SP)
CALL runtime.morestack_noctxt.abi0(SB)
MOVQ 0x8(SP), AX
MOVQ 0x10(SP), BX
MOVQ 0x18(SP), CX
MOVQ 0x20(SP), DI
MOVQ 0x28(SP), SI
JMP hashdemo.StoreClosure(SB)

IMULL $0x1e35a7bd, AX, AX
SHRL $0x12, AX
RET

Assembly related to the Interface function

IMULL $0x1e35a7bd, AX, AX
SHRL $0x12, AX
RET

CMPQ SP, 0x10(R14)
JBE 0x52e650
PUSHQ BP
MOVQ SP, BP
SUBQ $0x18, SP
MOVQ AX, 0x28(SP)
MOVQ CX, 0x38(SP)
MOVQ BX, 0x30(SP)
MOVQ DI, 0x40(SP)
MOVQ R8, 0x50(SP)
MOVQ SI, 0x48(SP)
XORL DX, DX
JMP 0x52e5dd
MOVZX 0(CX)(DX*2), R9	ANDL $0xf, R9
SHLL $0x4, AX
ADDL AX, R9
MOVL 0x14(SP), R10
MOVL R10, 0x8000(CX)(R9*4)
MOVZX 0(CX)(DX*2), R9
INCL R9
MOVW R9, 0(CX)(DX*2)
LEAL 0x1(R10), DX
MOVQ CX, AX
MOVQ 0x38(SP), CX
MOVQ 0x30(SP), BX
MOVQ 0x48(SP), SI
MOVQ 0x40(SP), DI
MOVQ 0x50(SP), R8
LEAL 0x4(DX), R9	CMPL CX, R9
JB 0x52e62f
CMPQ CX, DX
JB 0x52e64a
SUBQ DX, CX
MOVL DX, R10
SUBQ DI, DX
SARQ $0x3f, DX
ANDQ R10, DX
NOPL 0(AX)(AX*1)
CMPQ CX, $0x3
JBE 0x52e645
MOVL R10, 0x14(SP)
MOVQ 0x18(SI), CX
MOVL 0(BX)(DX*1), BX
MOVQ R8, AX
CALL CX
MOVQ 0x28(SP), CX
TESTB AL, 0(CX)
MOVL AX, DX
CMPQ DX, $0x4000
JB 0x52e594
JMP 0x52e635
ADDQ $0x18, SP		POPQ BP
RET
MOVQ $0x4000, AX		NOPL 0(AX)
CALL runtime.panicBounds(SB)
CALL runtime.panicBounds(SB)	CALL runtime.panicBounds(SB)
NOPL
MOVQ AX, 0x8(SP)			MOVQ BX, 0x10(SP)
MOVQ CX, 0x18(SP)
MOVQ DI, 0x20(SP)
MOVQ SI, 0x28(SP)
MOVQ R8, 0x30(SP)
CALL runtime.morestack_noctxt.abi0(SB)
MOVQ 0x8(SP), AX
MOVQ 0x10(SP), BX
MOVQ 0x18(SP), CX
MOVQ 0x20(SP), DI
MOVQ 0x28(SP), SI
MOVQ 0x30(SP), R8
JMP hashdemo.StoreInterface(SB)

what we notice immediately is that all the variants contain almost double amount of the instructions present in the original concrete function. In this case the extra call, the args being reloaded from the stack every iteration, the nil check, and the extra bounds check are enough to show up clearly in throughput. But let's compare side by side what happens inside the hot loop - the most important and performance sensitive part of the code.

Concrete	Generic
LEAL 0x4(DX), SI CMPL CX, SI JB 0x52e514 ...	LEAL 0x4(DX), R8 CMPL DI, R8 JB 0x52f0cf ...	Loop condition
	MOVL R9, 0xc(SP) MOVQ 0(AX), BX MOVL 0(CX)(DX*1), CX MOVQ AX, DX MOVL CX, AX CALL BX	Making call to the non-inlined hash function
MOVL 0(BX)(DX*1), R9 IMULL $0x1e35a7bd, R9, R9 SHRL $0x12, R9	IMULL $0x1e35a7bd, AX, AX SHRL $0x12, AX RET	hash function is simply inlined in the concrete version and is non-inlined in the generic version
	MOVQ 0x28(SP), CX TESTB AL, 0(CX) MOVL AX, BX NOPW 0(AX)(AX*1) NOPL CMPQ BX, $0x4000 JB 0x52f02f JMP 0x52f0d5	Extra bounds check and nil check that the concrete version doesn't need.
MOVZX 0(AX)(R92), R10 ... MOVW R9, 0(AX)(R112)	MOVZX 0(CX)(BX2), R8 ... MOVW R8, 0(CX)(BX2)	Real work inside the loop
	MOVQ 0x20(SP), AX MOVQ 0x30(SP), CX MOVQ 0x28(SP), BX MOVQ 0x40(SP), SI MOVQ 0x38(SP), DI	Reloading the function arguments from the stack every iteration because the call trashes the registers.

Well, the hot loop assembly shows clearly that cpu handles more instructions in the generic version due to machinery that is required to execute a call to non-inlined hash function. No need to also include the interface and closure versions in the table above. Their hot loops are nearly identical to the generic version.

Most of the problems below come back to the same root cause we just saw: the compiler isn't inlining where you need it to, and there's no way to tell it to. So the rest of the post is mostly variations on this.

Lack of intrinsics

The previous problem could have easily side-stepped by code duplication. This one, however, truly hurts the performance. The underlying mechanism, though, is again the inability to inline.

Lots of cpus support instructions to load memory into L1, L2, L3 cache. It's super useful as not having needed data in the cpu cache causes cpu to stall while loading the data for about 100 cycles. If you know in advance that you will definitely need some piece of data handful of statements later you can prefetch that memory and do some useful work while that memory is being loaded in the background.

In other languages prefetch is exposed through intrinsics, pseudo-functions that the compiler recognizes and replaces with a single machine instruction emitted right at the call site. C and C++ have __builtin_prefetch in GCC/Clang and _mm_prefetch from the Intel intrinsics headers; Rust has core::intrinsics::prefetch_read_data and friends. They look like a function call in source but compile to one instruction with zero call overhead (yes, inlined).

Go doesn't expose a prefetch intrinsic to user code. The only way to get a PREFETCHT0 (or its friends) into your binary is to switch to assembly. But go assembly functions can't be inlined. Every call to your prefetch helper compiles to a real CALL with the full calling-convention machinery around it.

As the prefetch code can't be inlined it again is slowed down by all that call machinery and very often defeats the purpose of adding prefetch in the first place.

The funny thing is that the prefetch intrinsic is right there in the internals of the stdlib. Just expose it to us, please. There is the github issue, asking to expose it but it is still sitting there as an open proposal.

SIMD is the same story, same mechanism. But this time, great news!, things are moving. Go 1.26 ships an experimental SIMD package for AMD64 behind GOEXPERIMENT=simd. See this github issue. It's not yet the stable, portable thing you'd want for production code across architectures, but it's progress.

SIMD (Single Instruction, Multiple Data), by the way, is a mechanism widely supported on modern CPUs where a single instruction operates on several data elements at once, packed into a wide vector register. More info is widely available on the internet.

Lack of //go:inline

There is //go:noinline compiler hint. It forbids compiler to inline the function followed by the hint. But there is no //go:inline hint which would do the opposite, instruct compiler to inline the function that follows the hint. This asymmetry kills me. I don't know the reason for this asymmetry, most probably, again there is some sort of trade-off that go team decided to handle in the way that forbids having //go:inline.

How do we deal with this problem? Go compiler calculates a "cost" of every function (based on complexity of the function) and if the cost is below heuristically chosen limit of 80 then the function is inlined (unless there are some other conditions that forbid inlining - see the generics, closures and interfaces cases above). PGO can push the compiler to be more aggressive for hot calls, so 80 isn't the whole story, but in regular non-PGO builds it's still the budget you run into. So if the function that we need to be inlined in the hot path is above the inlining cost we try to reshape the function so that it's "squeezed" into inlining limit. Of course if you can't squeeze it then you just manually inline it which causes the problem of duplication again.

One more important technique: extracting the cold part of the hot function into non-inlinable function (you actually even want to make sure that the cold function is not inlined by accident by hinting with //go:noinline). This way you might reduce the "cost" of hot function. In fact this technique is important in the scope outside of the case when you try to make the hot function inlinable. I'll probably want to write a separate post about it but the technique is about making things intentionally un-inlined to reduce icache-misses.

Lack of //go:nobounds (and other opt-in hints)

Every slice or array access in go gets a bounds check. The compiler can skip it when it can prove the index is in range. This is called bounds check elimination (BCE). In tight loops the elided version is meaningfully faster. The check itself costs something, and the panic branch also stops the optimizer from doing more aggressive things with the surrounding code.

Sometimes the compiler can't see the proof but you, the programmer, can. The usual trick is to insert a "hint load" early, like the _ = b[3] line you can spot in the assembly listings above in this post. That single check tells the compiler that all of b[0] .. b[3] are in range, and the per-byte checks below it disappear.

Another interesting anecdote related to the compiler inserting additional instructions to guarantee safety: having x << n in the code will cause the compiler to insert 4 instructions (SHLQ + CMPQ + SBBQ + ANDQ) instead of a single SHLQ instruction if the compiler can't prove that n < 64. The workaround is to write x << (n & 63). The mask is a no-op for any value n could actually take, but it convinces the compiler the shift is in range. Of course, this is a valid workaround only if you truly know that n < 64 in all cases.

These tricks only work when you can phrase your invariant as something the compiler already understands - another bounds check, a mask. Which is not always the case.

When that doesn't work, you're stuck. There is no //go:nobounds directive that says "trust me, this access is in range, skip the check". C and C++ have __builtin_assume, Rust has get_unchecked / unreachable_unchecked. Go gives you nothing.

There is one more option: do unsafe pointer arithmetic on the underlying memory, which sidesteps the bounds checks entirely. It often works, but it's a topic for another post.

This is the same shape of problem as //go:inline: Go gives you the opt-out (//go:noinline) but not the opt-in. And it's not just inlining and BCE. There is also no //go:unroll to force loop unrolling, no way to mark a branch as unlikely, no way to assert a value's range. If the compiler's heuristics happen to land in the right place, great. If they don't, you reshape your source code until they do, or you give up and write assembly.

Lack of layout tooling

This one is less about a specific missing knob and more about a general pain I had through the whole project.

CPUs are weird. They don't just take instructions and run them in order. There are caches, branch predictors, prefetchers, yada yada, and all of it is sensitive to where exactly your code sits in memory. The same hot loop at one address can be a few percent slower at another, just because it crossed some invisible boundary somewhere.

So you make a change that has nothing to do with your hot path, e.g. add a helper somewhere else in the file. and your benchmark moves. In either direction. For no real reason.

This drove me nuts. I'd do an "optimization", see +4%, commit, push, open champagne. A few days later add some unrelated change, rerun, and +4% was now -2%. Or the other way around: I'd make a change I was sure was right, see a small regression, revert it, and only later realize the regression was just layout noise.

I never really solved this. The best strategy I've found:

run benchmarks for a lot longer than felt necessary,
be very suspicious of any number under ~3%,
when a result looked surprising, add or remove a tiny unrelated function on purpose and rerun, just to see how much the number changes.

What makes it worse in Go specifically is that other languages have tools for exactly this. C++ has BOLT and Propeller, which take profiling info from a real run and rearrange the layout of the compiled binary. Rust gets a lot of that for free via LLVM. Go has its own toolchain and most of those tools just don't work with it.

Some "optimizations" I shipped I'm honestly still not 100% sure are actually optimizations. They just happened to look like one on the day I measured.

Conclusion

In my opinion go shines in IO-bound world. It also has, in my opinion, made very good trade-off decisions which made go really great language. E.g. batteries-included stdlib, good package manager, easy to use async. However some trade-off decisions have made life a bit harder for people who try to do optimization of the CPU-bound workloads.

The first problem I described might not even be considered an issue for some people. Codegen exists, after all. And duplication isn't always pure cost: in go-brrr, skipping codegen let each copy specialize to the exact workload it handled. The variants ended up diverging far enough that a single template was not an option, but the specialization paid off.

Because of these trade-offs (I also didn’t cover runtime and GC effects, which are separate topics), some cpu-bound go code can require more specialization and lower-level work to approach what c, c++, or rust expose more directly, and your code won't be looking very idiomatic, as your code will very likely have:

giant functions that would normally be split up,
duplicated loops where a shared helper would force a slow path,
hand-specialized code for hot shapes,
APIs structured around escape analysis and inlining rather than aesthetics.

My conclusion is not "don't write CPU-bound code in Go." I did, and the result is fast. But the path to fast Go often looks less like elegant abstraction and more like specialization, duplication, BCE tricks, and occasionally assembly.

Update

A reader on Reddit added several good observations worth preserving here (comment).

On why the gap exists:

The mechanism behind the generic/interface penalty isn't compiler laziness, it's tied to Go's runtime model. GC Shape Stenciling exists because the GC walks stacks and needs every value to have a known size and pointer-bitmap; full monomorphization across all type parameters would force either heavy binary bloat or punt some shapes to interface-style dispatch anyway. The 27% interface gap is the cost of preserving uniform stack-walking, not a fixable bug. PGO is the right escape hatch here: the 80-unit budget is conservative because the compiler has no idea which sites are hot, and PGO raises the threshold per-edge on profile evidence. Without it, the compiler is forced to be conservative everywhere.

My take on the PGO: it helps much more when you ship the final binary. As a library author you can't be sure that the end-user will run PGO.

On the duplication problem:

The "16 nearly identical functions" pattern is hand-rolled monomorphization. You've rebuilt what Rust and C++ do automatically, paying in maintenance cost (16 copies that drift over time). The standard mitigation is go generate from a single template, which gives back the maintenance story without losing the inline-budget win. Worth doing if those specializations stay around long term.

This is fair. I considered templating the implementations, but the copies diverged significantly during tuning and I was worried the template would turn into something harder to maintain than the duplication itself.

On when abstraction overhead actually matters:

The 27% interface number is hot-loop-specific. Interface dispatch is roughly 3 to 4 ns per call from the itab lookup; in a per-byte hash kernel that's a huge fraction, but in a per-cache-line kernel (64 bytes of work per dispatch) it shrinks to low single digits. So the abstraction tax scales inversely with work-per-call, which is the right framing for deciding when to specialize.

I think this is the right framing. In tiny byte-oriented kernels, even small dispatch costs become visible quickly.

On the asm pattern that actually works:

The PREFETCHT0 problem also explains why Go's asm escape hatch feels fake. Plan 9 asm functions never inline, so any asm helper carries a full call and ret. The pattern that actually works is build-tagged _amd64.s files for entire hot loops, with portable Go fallbacks; you commit to maintaining both, but you stop fighting the inliner.

This also matches my experience. Small asm escape hatches often end up fighting the inliner more than helping it.

On layout sensitivity:

Last one: the 3 to 4% layout sensitivity is the alignment-frontend issue every unmanaged compiler has. C++ has BOLT, Rust gets some via LLVM. Go's internal linker doesn't expose those hooks. The pragmatic fix isn't tooling, it's benchmarking protocol: average across N small unrelated commits and treat single-commit deltas under 5% as noise.

Agreed, though in practice this stretches the optimization feedback loop quite a bit while iterating.