A high-performance CUDA implementation of the Cuckoo Filter data structure, developed as part of the thesis "Design and Evaluation of a GPU-Accelerated Cuckoo Filter".
This library provides a GPU-accelerated Cuckoo Filter implementation optimized for high-throughput batch operations. Cuckoo Filters are space-efficient probabilistic data structures that support insertion, lookup, and deletion operations with a configurable false positive rate.
- CUDA-accelerated batch insert, lookup, and delete operations
- Configurable fingerprint size and bucket size
- Multiple eviction policies (DFS, BFS)
- Sorted insertion mode for improved memory coalescing
- Multi-GPU support via gossip
- IPC support for cross-process filter sharing
- Header-only library design
Benchmarks at 80% load factor on an NVIDIA GH200 (H100 HBM3, 3.4 TB/s). The GPU Cuckoo Filter is compared against:
| Comparison | Insert | Query | Delete |
|---|---|---|---|
| GPU vs CPU Cuckoo | 360× faster | 973× faster | N/A |
| Cuckoo vs TCF | 6× faster | 42× faster | 100× faster |
| Cuckoo vs GQF | 585× faster | 6× faster | 273× faster |
| Cuckoo vs Bloom | 0.6× (slower) | 1.4× faster | N/A |
| Comparison | Insert | Query | Delete |
|---|---|---|---|
| GPU vs CPU Cuckoo | 583× faster | 1504× faster | N/A |
| Cuckoo vs TCF | 1.9× faster | 11.3× faster | 35.3× faster |
| Cuckoo vs GQF | 9.6× faster | 2.6× faster | 3.8× faster |
| Cuckoo vs Bloom | 0.7× (slower) | 1.0× (equal) | N/A |
Note
For a more comprehensive evaluation including additional systems and analysis, see the accompanying thesis.
- CUDA Toolkit (>= 12.9)
- C++20 compatible compiler
- Meson build system (>= 1.3.0)
meson setup build
meson compile -C buildBenchmarks and tests are built by default. To disable them:
meson setup build -DBUILD_BENCHMARKS=false -DBUILD_TESTS=false#include <CuckooFilter.cuh>
// Configure the filter: key type, fingerprint bits, max evictions, block size, bucket size
using Config = CuckooConfig<uint64_t, 16, 500, 256, 16>;
// Create a filter with the desired capacity
CuckooFilter<Config> filter(1 << 20); // capacity for ~1M items
// Insert keys (d_keys is a device pointer)
filter.insertMany(d_keys, numKeys);
// Or use sorted insertion
filter.insertManySorted(d_keys, numKeys);
// Check membership
filter.containsMany(d_keys, d_results, numKeys);
// Delete keys
filter.deleteMany(d_keys, d_results, numKeys);The CuckooConfig template accepts the following parameters:
| Parameter | Description | Default |
|---|---|---|
T |
Key type | - |
bitsPerTag |
Fingerprint size in bits (8, 16, 32) | - |
maxEvictions |
Maximum eviction attempts before failure | 500 |
blockSize |
CUDA block size | 256 |
bucketSize |
Slots per bucket (must be power of 2) | 16 |
AltBucketPolicy |
Alternate bucket calculation policy | XorAltBucketPolicy |
evictionPolicy |
Eviction strategy (DFS or BFS) | BFS |
WordType |
Atomic type (uint32_t or uint64_t) | uint64_t |
For workloads that exceed single GPU capacity:
#include <CuckooFilterMultiGPU.cuh>
CuckooFilterMultiGPU<Config> filter(numGPUs, totalCapacity);
filter.insertMany(h_keys, numKeys);
filter.containsMany(h_keys, h_results, numKeys);include/ - Header files
CuckooFilter.cuh - Main filter implementation
CuckooFilterMultiGPU.cuh - Multi-GPU implementation
CuckooFilterIPC.cuh - IPC support
bucket_policies.cuh - Alternative bucket policies
helpers.cuh - Helper functions
src/ - Example applications
benchmark/ - benchmarks
tests/ - Unit tests
scripts/ - Scripts for running/plotting benchmarks