Go 并发哈希图实现基准测试

Go 并发哈希图实现基准测试
Benchmarks for concurrent hash map implementations in Go

原始链接: https://github.com/puzpuzpuz/go-concurrent-map-bench

## Go 并发哈希表基准测试总结本文详细介绍了 Go 中几种并发哈希表实现的基准测试比较，包括 `sync.Map`、`xsync.Map`、`cornelk/hashmap`、`alphadose/haxmap` 和 `orcaman/concurrent-map`。目标是在不同的读/写比例和映射大小（100 到 1,000,000 个条目）下提供公平的性能比较。主要发现表明，在读、写和迭代操作方面，**`xsync.Map` 通常表现最快**，并且在非分片映射中**分配次数最少**。**`sync.Map`** 自 Go 1.24 以来，受益于其 HashTrieMap 实现，提供了出色的读取扩展性，并且是一个强大的通用选择。**`cornelk/hashmap`** 在小尺寸下具有竞争力，但在较大的映射和写密集型工作负载下性能显著下降。**`alphadose/haxmap`** 在只读场景中表现出色，但在写竞争方面存在困难。**`orcaman/concurrent-map`** 由于在分片中使用标准的 Go 映射，因此分配次数为零，但其固定的分片数量限制了可扩展性。基准测试是在不同的 `GOMAXPROCS` 值和工作负载（从 100% 读取到 75% 读取，并对应相应的写入/删除百分比）下进行的，并且测试了预热和无预热场景。所有库报告的读/写基准测试操作每次分配次数为零。

## Concurrent Hash Map Benchmarks in Go A recent Hacker News discussion focused on benchmarks comparing the performance of different concurrent hash map implementations in Go. Specifically, `xsync.Map` was compared to `orcaman/concurrent-map`. The benchmarks revealed that `orcaman/concurrent-map` excels in minimizing memory allocation, particularly in pure overwrite scenarios (0 B/op vs. 24 B/op for `xsync.Map`). However, under a mixed workload (80% overwrites, 20% new), the difference narrows. `xsync.Map` trades higher allocation rates for speed, achieving roughly 2x performance under contention, at a cost of 24 bytes per operation. The optimal choice depends on the application’s workload. If writes dominate, a standard Go map with a `RWMutex` or `orcaman/concurrent-map` is recommended for minimal allocation. If reads are more frequent, `xsync.Map`’s read scalability might be preferable, as seen in its use within the Otter cache library. Further considerations include CPU usage, latency, and the potential downsides of using libraries that rely on `unsafe` operations. There's ongoing discussion about potential improvements to the standard library’s `sync.Map`.

原文

Benchmarks for concurrent hash map implementations in Go.

Disclaimer: I'm the author of xsync, one of the libraries benchmarked here. I did my best to keep this benchmark neutral and fair to all implementations. If you spot any issues or have suggestions for improvements, please open an issue.

Since Go 1.24, sync.Map is backed by a HashTrieMap — a concurrent hash trie with 16-way branching at each level. Reads are lock-free via atomic pointer traversal through the trie nodes. Writes acquire a per-node mutex, affecting only a small subtree. The trie grows lazily as entries are inserted.

A hash table organized into cache-line-sized buckets, each holding up to 5 entries. Each bucket has its own mutex for writes, while reads are fully lock-free using atomic loads. Lookups use SWAR (SIMD Within A Register) techniques on per-entry metadata bytes for fast key filtering. Table resizing is cooperative: all goroutines help migrate buckets during growth.

A lock-free hash map combining a hash table index with sorted linked lists for collision resolution. All mutations (insert, update, delete) use atomic CAS operations. A background goroutine triggers table resize when the fill factor exceeds 50%. The sorted list ordering enables efficient concurrent traversal.

Based on Harris's lock-free linked list algorithm with a hash table index layer. Uses xxHash for hashing and atomic CAS for all mutations. Deletions are lazy (nodes are logically marked before physical removal). Auto-resizes when the load factor exceeds 50%.

A straightforward sharded design with 32 fixed shards. Each shard is a regular Go map protected by a sync.RWMutex. Keys are assigned to shards using FNV-32 hashing. The fixed shard count makes it simple and predictable, but limits scalability under high parallelism.

Each benchmark uses permille-based random operation selection:

100% reads — all loads (warm-up only)
99% reads — 99% loads, 0.5% stores, 0.5% deletes
90% reads — 90% loads, 5% stores, 5% deletes
75% reads — 75% loads, 12.5% stores, 12.5% deletes
Range under contention — all goroutines iterate the map while a single background goroutine continuously updates random keys

string keys (with a long prefix to stress hashing)
int keys

Size	Approx Footprint	Target
100	~15 KB	Fits in L1
1,000	~150 KB	Fits in L2
100,000	~15 MB	Fits in L3
1,000,000	~150 MB	Spills to RAM

WarmUp — map is pre-populated before the benchmark starts (all workloads)
NoWarmUp — map starts empty (mixed workloads only, 100% reads is skipped)

# Run all benchmarks
go test -bench . -benchtime 5s

# Run a specific library
go test -bench BenchmarkXsyncMapOf -benchtime 5s

# Run only string key benchmarks
go test -bench 'StringKeys' -benchtime 5s

# Run only warm-up benchmarks at size=1000
go test -bench 'WarmUp.*size=1000' -benchtime 5s

The benchmark results in this repository were collected on the following setup:

CPU: AMD Ryzen 9 7900 12-Core Processor (24 threads)
OS: Linux (amd64)
Go: go1.26.0
GOMAXPROCS: 1, 4, 8, 12

Each benchmark was run with -benchtime 3s -count 3 and results were collected for each GOMAXPROCS value separately.

Each plot shows ops/s (Y axis) vs GOMAXPROCS (X axis) for all libraries. For read/write workloads, rows correspond to read percentages and columns to map sizes. Range plots have a single row with columns for map sizes.

cornelk/hashmap was benchmarked at sizes 100 and 1,000 only due to significant performance degradation at larger sizes.

Range under contention, string keys

Range under contention, int keys

All libraries report 0 allocs/op across every read/write benchmark. The tables below show B/op (bytes allocated per operation, amortized) for the WarmUp variant. Values are consistent across map sizes and GOMAXPROCS values. Range benchmarks are excluded as they allocate due to iteration overhead (closures, internal snapshots, channel buffers).

String keys:

Library	90% reads	75% reads
`sync.Map`	3	9
`xsync.Map`	1	2
`cornelk/hashmap`*	1	4
`alphadose/haxmap`	1	3
`orcaman/concurrent-map`	0	0

Int keys:

Library	90% reads	75% reads
`sync.Map`	3	8
`xsync.Map`	0	2
`cornelk/hashmap`*	1	4
`alphadose/haxmap`	1	3
`orcaman/concurrent-map`	0	0

* cornelk/hashmap was benchmarked at sizes 100 and 1,000 only due to significant performance degradation at larger sizes.

orcaman/concurrent-map shows zero allocations because its shards use regular Go maps, which don't allocate when overwriting existing keys. Other libraries allocate small amounts during writes due to their internal data structure overhead. sync.Map has the highest per-write allocation cost, while xsync.Map has the lowest among the non-sharded implementations.

Library	Strengths	Weaknesses
`sync.Map`	Stdlib, no dependencies; excellent read scaling; solid all-round since Go 1.24	Highest per-write allocation cost; slower than xsync.Map under all workloads
`xsync.Map`	Fastest in nearly every scenario; best read, write, and iteration scaling; lowest allocations among non-sharded designs	External dependency; writes allocate
`cornelk/hashmap`	Competitive at small sizes with read-heavy workloads	Significant performance degradation at sizes ≥100K with writes; limited to small maps in practice
`alphadose/haxmap`	Good read-only performance at small sizes; lock-free design	Poor write scaling under contention; falls behind at higher parallelism
`orcaman/concurrent-map`	Zero allocations (read/write); simple and predictable; decent single-threaded performance	Fixed 32 shards limit scalability; worst read-only throughput due to mutex overhead; write scaling plateaus early; slowest iteration due to channel-based API