使用分组SIMD元数据扫描的高性能C++哈希表

使用分组SIMD元数据扫描的高性能C++哈希表
High-performance C++ hash table using grouped SIMD metadata scanning

原始链接: https://github.com/Cranot/grouped-simd-hashtable

## 分组SIMD弹性哈希表：新的SOTA 这个C++17哈希表实现通过利用**分组SIMD元数据扫描**，在规模较大的表（>500k元素）中实现了最先进的性能。它在读取密集型工作负载中优于`ankerl::unordered_dense`（一个Robin Hood哈希表），在100万个元素下，读取速度提升高达**1.69倍**。关键创新解决了SIMD对连续内存访问的需求。传统的二次探测会分散内存访问，阻碍SIMD效率。该实现使用**分组探测**——一次扫描16个连续的槽——模仿了Google的Swiss Tables中的方法。每个槽位的1字节元数据标签可以快速过滤掉不匹配项，然后再进行完整的键比较。虽然插入速度略慢（相比`ankerl`慢0.72倍），但在读取方面的性能提升使其非常适合读取占主导地位的应用。该表是header-only的，需要SSE2支持（x86-64标准配置），目前提供固定容量，未来暂无计划支持删除功能。它源于一项研究，推翻了关于均匀探测的长期猜想。

这个Hacker News讨论围绕一个新的C++哈希表实现（[github.com/cranot](https://github.com/cranot)），它利用分组SIMD元数据扫描来提升性能。许多评论者指出，所使用的技术——特别是用于更快bucket检查的SIMD——已经被成熟的哈希表库采用，例如Boost.Unordered和Google的SwissTable（Rust的Hashbrown的基础）。人们对提供的基准测试的真实性表示担忧，并认为该项目专注于静态大小（不允许删除），这限制了它的实际应用。关于二次探测与线性探测的优劣，存在争论，一些人认为线性探测的局部性优势超过了聚类问题。一个反复出现的主题是对该项目原创性的怀疑，一位评论员建议代码可能由AI（Claude）生成。讨论还涉及并发哈希表设计以及在多线程环境中可能出现的性能问题。最后，一些评论员赞赏该项目专注于效率，而非基于Web的臃肿。

原文

A high-performance C++ hash table that beats state-of-the-art at scale using grouped SIMD metadata scanning.

vs ankerl::unordered_dense (Robin Hood, considered SOTA):

Size	ankerl	GroupedSIMD	Winner
100k	2.76 ms	3.36 ms	ankerl
500k	29.2 ms	34.0 ms	ankerl
1M	72.3 ms	71.9 ms	GroupedSIMD
2M	157.4 ms	156.3 ms	GroupedSIMD

Operation breakdown at 1M elements (80% load):

Operation	ankerl	GroupedSIMD	Speedup
Insert	50.8 ms	70.1 ms	0.72x
Lookup Hit	104 ms	61.5 ms	1.69x
Lookup Miss	5.4 ms	4.5 ms	1.21x

Use this when: Table size > 500k elements, lookup-heavy workloads. Insert overhead (0.72x) is acceptable when lookups dominate.

#include "grouped_simd_elastic.hpp"

// Create table with capacity for 1M elements
GroupedSIMDElastic<uint64_t, uint64_t> table(1200000);

// Insert
table.insert(key, value);

// Lookup
uint64_t* result = table.find(key);
if (result) {
    // Found: *result is the value
}

// Check existence
if (table.contains(key)) { ... }

// Subscript operator
table[key] = value;

The Problem with SIMD in Hash Tables

Traditional quadratic probing accesses scattered memory locations:

Probe sequence: h, h+1, h+4, h+9, h+16, h+25...

To use SIMD, you'd need to gather from 16 random positions—slower than scalar code.

The Solution: Grouped Probing

Probe 16 contiguous slots as a group, then jump to the next group:

Group 0: [h+0,  h+1,  ..., h+15]   ← SIMD scan (1 load)
Group 1: [h+16, h+17, ..., h+31]   ← SIMD scan (1 load)
Group 2: [h+32, h+33, ..., h+47]   ← SIMD scan (1 load)

Within each group, SSE2 scans all 16 metadata bytes in ~3 instructions:

__m128i metadata = _mm_loadu_si128((__m128i*)&metadata_[base]);
__m128i matches  = _mm_cmpeq_epi8(metadata, target);
uint32_t mask    = _mm_movemask_epi8(matches);

This is the same insight behind Google's Swiss Tables.

Each slot has a 1-byte metadata tag:

Bit 7: Occupied flag (1 = occupied, 0 = empty)
Bits 0-6: 7-bit hash fragment

This filters out 127/128 non-matches before comparing keys.

template <typename K, typename V, typename Hash = std::hash<K>>
class GroupedSIMDElastic {
    // Constructor: capacity and delta (1 - max_load_factor)
    explicit GroupedSIMDElastic(size_t capacity, double delta = 0.15);

    // Insert key-value pair. Returns false if table is full.
    bool insert(const K& key, const V& value);

    // Find value by key. Returns nullptr if not found.
    V* find(const K& key);
    const V* find(const K& key) const;

    // Check if key exists
    bool contains(const K& key) const;

    // Subscript operator (inserts default value if not found)
    V& operator[](const K& key);

    // Statistics
    size_t size() const;
    size_t capacity() const;
    double load_factor() const;
    size_t max_probe_used() const;
};

C++17 or later
SSE2 support (standard on all x86-64 CPUs)
Header-only, no dependencies

# Compile
g++ -O3 -march=native -msse2 -std=c++17 -o benchmark benchmark_final_sota.cpp

# Run
./benchmark

This implementation emerged from exploring the February 2025 "Elastic Hashing" paper that disproved Yao's 40-year-old conjecture about uniform probing.

What we tried:

Variant	Result	Learning
Non-greedy probing	5% faster at 1M	O(1) amortized works
SIMD (scattered)	0.18x (5x slower)	Gather overhead kills SIMD
Memory prefetching	0.41x (2.5x slower)	Hardware prefetcher already wins
Robin Hood	2x faster miss	Low probe variance matters
Grouped SIMD	1.5x faster lookup	Contiguous access enables SIMD

Key insight: SIMD requires contiguous memory access. Quadratic probing scatters accesses, defeating SIMD. Grouped probing (Swiss Tables' approach) solves this.

Trade-off	Impact	Notes
Insert overhead	0.72x vs ankerl	Non-greedy candidate collection
Small tables	Loses below 500k	Crossover at ~500k-1M elements
No deletion	Not implemented	Planned for future
No resizing	Fixed capacity	Must pre-size
SSE2 only	x86-64 only	No ARM NEON version

Technical: Why Quadratic Group Jumps?

Groups use quadratic jumps to avoid clustering:

Group 0: h + 16×0² = h
Group 1: h + 16×1² = h+16
Group 2: h + 16×2² = h+64
Group 3: h + 16×3² = h+144

Linear jumps (h, h+16, h+32...) caused 42% insert failure rate due to probe sequence overlap. Quadratic jumps spread groups across the table, ensuring all slots are reachable.

grouped_simd_elastic.hpp    # Main implementation (ship this)
hybrid_elastic.hpp          # Non-SIMD baseline
benchmark_final_sota.cpp    # Benchmark vs ankerl
INSIGHTS.md                 # Full research log
EXPERIMENT_RESULTS.md       # All experiment data

MIT

Grouped SIMD Hash Table - SOTA hashing at scale
Made by Dimitris Mitsos & AgentsKB.com
Using Deep Research