具有AVX-512的缓存友好IPv6最长前缀匹配（线性化B+树，真实BGP基准测试）

具有AVX-512的缓存友好IPv6最长前缀匹配（线性化B+树，真实BGP基准测试）
A cache-friendly IPv6 LPM with AVX-512 (linearized B+-tree, real BGP benchmarks)

原始链接: https://github.com/esutcu/planb-lpm

## planb-lpm：一个可移植的IPv6最长前缀匹配库 planb-lpm是PlanB IPv6查找算法（Zhang等人，NSDI '26）的C++17重新实现，采用MIT许可，旨在方便研究、教学和生产使用。原始算法利用线性化的B+树和AVX-512 SIMD指令进行高效的前缀匹配。此实现提供了一个可移植的核心，可以在没有AVX-512的情况下编译并回退到标量路径，同时通过pybind11提供Python绑定。主要特性包括使用重建和交换模型进行动态FIB，具有无等待查找功能，针对暴力方法的全面正确性测试，以及用于FIB/跟踪生成工具。在Intel i5-1035G7上的基准测试显示，使用真实BGP表时的性能约为65 MLPS，与Patricia trie相当，并可扩展到2线程时的~130 MLPS。观察到批量处理带来的性能提升，最佳批量大小约为8。该项目旨在解决原始参考代码的局限性（仅限Linux/AVX-512，无许可等），并为进一步研究提供基础，包括与其他LPM结构的比较以及服务器级硬件评估。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录一种缓存友好的 IPv6 LPM，带有 AVX-512（线性化 B+ 树，真实的 BGP 基准测试） (github.com/esutcu) 8 分，由 debugga 2 小时前发布 | 隐藏 | 过去 | 收藏 | 1 条评论帮助 debugga 2 小时前 [–] PlanB IPv6 LPM 算法的干净、可移植的 C++17 实现。包括： - AVX-512 SIMD 路径 + 标量回退 - 使用重建和交换动态 FIB 的无等待查找 - 在合成数据和真实的 RIPE RIS BGP（约 254K 前缀）上进行基准测试有趣的结果：在真实的 BGP + 均匀随机查找中，由于缓存局部性和提前退出，简单的 Patricia trie 有时可以匹配或胜过 SIMD 树。希望获得反馈，特别是与 PopTrie / CP-Trie 的比较。回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

IPv6 longest-prefix-match (LPM) using a linearized B+-tree with AVX-512 SIMD and a scalar fallback. Based on the algorithm from:

Zhihao Zhang, Lanzheng Liu, Chen Chen, Huiba Li, Jiwu Shu, Windsor Hsu, Yiming Zhang. PlanB: Efficient Software IPv6 Lookup with Linearized B+-Tree. NSDI '26. arXiv:2604.14650. Author's reference code: https://github.com/nicexlab/planb.

This repository is an independent, clean-room reimplementation of the algorithm as a portable, MIT-licensed C++17 library with extras that were not part of the reference:

Portable header-only core (include/lpm6.hpp) that compiles without AVX-512 and transparently falls back to a scalar path.
Dynamic FIB (lpm6::Dynamic) using the paper's rebuild-and-swap model with std::atomic<std::shared_ptr>; lookups are wait-free.
Python bindings via pybind11 (pip install -e .).
Correctness tests (tests/test_correctness.cpp) that verify the tree against a brute-force LPM reference on synthetic FIBs.
Sample FIB/trace generator (examples/generate_sample.py).

The PlanB paper is a solid engineering idea, but the reference code on GitHub is not something you can drop into another project: it is Linux- and AVX-512-only, has no license, no Python layer, and no dynamic-FIB path. planb-lpm re-expresses the algorithm from the paper as a portable, tested, and permissively licensed library so the technique is actually usable in research, teaching, and production software.

Research / simulation — drop-in fast LPM backend for ns-3 or a custom packet-level simulator, or as a reference to compare newer LPM structures against.
Control-plane tooling — SDN controllers and route-monitoring services that need to answer "which prefix covers this address?" over a large FIB without pulling in a full routing stack.
Network analysis — IPv6 scanners, traffic classifiers, and log-enrichment pipelines that tag flows with their covering prefix.
Teaching — a compact, readable implementation of a modern SIMD lookup structure suitable for networking and computer-architecture courses.

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
ctest --test-dir build --output-on-failure

Options:

CMake option	Default	Effect
`LPM6_BUILD_PYTHON`	`OFF`	Build the pybind11 extension

AVX-512 is auto-detected via check_cxx_compiler_flag; on CPUs or compilers without support, the scalar path is used.

python3 examples/generate_sample.py 100000 1000000 42
./build/planb-lpm examples/sample_fib.txt examples/sample_trace.txt 20

Synthetic FIB, 100 000 prefixes. Prefix length distribution is weighted to match the RIPE rrc00-25 table (roughly /48 45%, /32 11%, /40 10%, /44 10%, /36 4%, others), generated by examples/generate_sample.py.
Synthetic trace, 1 M uniform-random 64-bit addresses (upper half of an IPv6 address, which is all that drives /0../64 LPM).
Each benchmark does one warmup pass (discarded) and 20 timed passes. We report min / q1 / median / q3 / max + IQR across the 20 runs so variance is visible, not hidden under an average.
Pin to a single core with taskset -c <N> to cut scheduler noise. The binaries print the CPU they ran on, the affinity-mask width, the cpufreq governor, and the Intel turbo state so numbers are reproducible. (On WSL the /sys probes are unavailable and those lines are silently omitted.)
Hardware: laptop Intel i5-1035G7 (Ice Lake, 4C/8T, AVX-512), Ubuntu 24.04 on WSL, GCC 13.3, -O3. This is a consumer CPU; the paper's numbers are from a 24-core Xeon 8331C and a Zen5 Ryzen 9 370HX.

taskset -c 2 ./build/planb-lpm examples/sample_fib.txt examples/sample_trace.txt 20 (20 timed runs + 1 warmup, batch paths use the real public API returning next-hops):

Path	min	q1	median	q3	max	IQR
`single`	34	45	48	51	62	6.6
`batch<1>`	32	44	47	51	60	7.6
`batch<2>`	36	48	54	58	68	10.4
`batch<4>`	42	60	66	71	84	11.3
`batch<8>`	53	66	73	81	101	13.4
`batch<16>`	43	57	62	66	78	9.1
`batch<32>`	41	70	76	83	94	13.4

Units: MLPS; medians from a rolling average of four 20-run sweeps. Build time for that FIB is ~25 ms, tree depth 6, 531 k keys across 200 k edges.

Observations:

batch<1> ≈ single — the batched entry point adds no overhead at M=1, confirming the dispatch cost is negligible.
Sweet spot at M=8: ~1.5× single-core throughput over single. M=32 is essentially flat; going further would need larger-batch unrolling and register-allocation work.
M=16 dip is real and reproducible across runs — likely register pressure at the #pragma GCC unroll 16 boundary interacting with the 7-level traversal. A known-to-investigate item.

Paper's ablation reports batching worth 3–4.5× on AMD Zen5; we see ~1.5× on Ice Lake. Part of the gap is the paper's lookup_batch_checksum-style fast path (our binary uses the public API that also writes next-hops), part is AVX-512 sustained frequency differences between Ice Lake and Zen5. The paper reports 191–197 MLPS single-core on Xeon and 374–393 MLPS on Zen5 — a real server should land near those numbers.

Against a conventional baseline

The bench_naive binary builds a Patricia radix trie (tests/patricia.hpp) on the same FIB and runs the same trace through it. Patricia is a standard path-compressed binary trie — representative of the pointer-chasing LPM structures PlanB is designed to replace.

taskset -c 2 ./build/bench_naive … (same 20-run discipline):

Implementation	min	median	max	IQR
`lpm6::Tree`	23	50	63	8.2
Patricia trie	2.4	2.4	2.6	0.06

Units: MLPS. Tree-vs-Patricia ≈ 20× on the median, bracketing the upper end of the paper's reported 1.6–14× advantage over PopTrie, CP-Trie, Neurotrie, and HBS. Patricia isn't one of those baselines — it's a conventional-radix-trie stand-in — so the exact multiplier will shift when we benchmark against the paper's actual algorithms (on the roadmap, see plan.md). The binary also prints a linear-scan number on a 50 000-address slice; that is O(N) vs. O(log N) and is only a sanity check, not a comparison.

Same 100 000-prefix workload. footprint_bytes() counts the algorithmic cost (our flat key + next-hop arrays for the tree; one Node struct per Patricia node); RSS delta is the process-level VmRSS change across the build and captures allocator overhead:

Implementation	Algorithmic	RSS delta	Per-prefix
`lpm6::Tree`	4.82 MB	5.05 MB	~50 B/prefix
Patricia trie	6.04 MB	9.02 MB	~90 B/prefix

At 100 K, the tree's flat cache-aligned layout costs roughly half the RSS of the pointer-chasing Patricia. This does not hold at scale — see FIB-size sweep below: the linearized B+-tree grows in discrete depth steps (9^depth buckets), so its footprint is bounded from below by depth, not prefix count. Once depth transitions 6→7 (around 250 K prefixes) the tree's algorithmic size jumps to ~38 MB and stays there until depth 7 fills up. Patricia scales linearly with prefixes, so at 250 K the tree is actually larger than Patricia. Paper reports 56–92% memory reductions vs PopTrie/CP-Trie/Neurotrie/HBS; those are heavier than plain Patricia, so the paper's advantage likely still holds against them even at our 250 K crossover.

examples/run_sweep.sh drives planb-lpm, bench_update, and bench_naive across five FIB sizes sharing one 1 M-address trace, same 20-run discipline, pinned to core 2. Numbers are medians of the timed runs; throughput units are MLPS:

FIB	depth	single	batch<8>	batch<32>	rebuild	tree alg.	Patricia alg.
10 K	5	94	165	148	1.5 ms	0.53 MB	0.61 MB
100 K	6	46	69	75	17.7 ms	4.82 MB	6.04 MB
250 K	7	20	29	41	51.8 ms	38.40 MB	15.00 MB
500 K	7	12	18	27	107.9 ms	n/m	n/m
1 M	7	10	14	22	213.4 ms	n/m	n/m

(n/m = not measured; Patricia allocation and the brute-force sanity check become impractical at 500 K+ on a laptop, so the sweep script only runs them for ≤250 K. Tree footprint at 500 K / 1 M is dominated by the same depth-7 bucket skeleton as 250 K and grows only with the filled-leaf count; we haven't added a separate probe yet.)

Observations:

Throughput falls as the FIB exceeds L3. The laptop has ~6 MB L3; the tree's flat footprint is below that at 100 K (4.8 MB) and above it from 250 K onward. Single-lookup throughput drops 4.6× going from 10 K → 100 K and another 2.3× going from 100 K → 250 K, dominated by memory traffic once we spill out of L3.
Batching helps most at scale. At 10 K, batch<8> beats batch<32> (165 vs 148 MLPS) — everything's in L1/L2 and large unrolls just add register pressure. At 1 M, batch<32> is the best path (22 vs 14 for batch<8>); the long-latency DRAM fetches benefit from more in-flight loads.
Depth transitions matter. Depth goes 5→6 between 10 K and 100 K, and 6→7 between 100 K and 250 K. Each extra level is one more cache-line indirection per lookup and bumps the sentinel-padded key count (531 K → 4.78 M going 6→7), which drives the tree-above- Patricia memory crossover.
Rebuild scales roughly linearly. 1.5 / 17.7 / 51.8 / 107.9 / 213.4 ms for 10 K / 100 K / 250 K / 500 K / 1 M; the paper's Zen5 number is 850 ms for 1 M, so our Ice Lake laptop is actually ~4× faster on the rebuild path (likely because our tree build elides some of the preprocessing the paper bundles into rebuild time).

Dataset caveat: these are synthetic FIBs with a RIPE-weighted length distribution, not real BGP tables. Real BGP has stronger prefix-length clustering and spatial locality in advertised ranges, both of which change cache behavior — see the next section for a reproduction on an actual RIPE RIS RIB.

Real BGP reproduction (RIPE RIS `rrc00`)

examples/mrt_to_fib.py parses a TABLE_DUMP_V2 / RIB_IPV6_UNICAST MRT dump (RFC 6396) into a planb-lpm FIB. On the rrc00 full-table latest-bview.gz of 2026-04-19 we get 254 197 unique IPv6 prefixes — right in the range (≈235 K) the paper cites for rrc00 in its evaluation. Same 1 M uniform-random 64-bit trace, same 20-run discipline, pinned to core 2:

Path	min	median	max	IQR
`single`	48.7	67.5	74.4	3.6
`batch<8>`	115.3	137.2	151.0	8.7
`batch<32>`	100.5	118.6	137.1	8.3

Units: MLPS. Tree depth 7, 4.78 M sentinel-padded keys, 38.4 MB algorithmic footprint (same skeleton as the synthetic 250 K row). Build is 63 ms, full rebuild under bench_update is median 40 ms (paper 850 ms/1 M on Zen5 → ~213 ms/250 K scale; our Ice Lake laptop at 40 ms is ~5× faster — likely because the paper bundles more preprocessing into the rebuild number).

Throughput is roughly 3× higher on real BGP than on the same-size synthetic (67 vs 20 MLPS single). The synthetic distribution is length-weighted but picks each prefix's address uniformly; real BGP concentrates advertisements in a small fraction of the address space, so a uniform-random lookup trace against a real RIB tends to hit the same hot tree paths and stays L2-resident. This is the opposite direction of "real data is worse" that we expected — it mainly tells us our trace is the limiting factor, not the FIB.

Against real BGP the Patricia baseline gets faster too — by more:

Implementation	min	median	max	IQR
`lpm6::Tree`	43.5	65.3	75.4	5.1
Patricia trie	51.2	95.7	110.9	24.2

Patricia jumps from 2.4 MLPS on the 100 K synthetic to 95 MLPS on real BGP — 40× better on the same machine, and now ahead of the tree. Two things are going on:

Uniform 64-bit lookups bail out early in Patricia on a BGP table. Most random upper-64-bit addresses fall into a handful of heavily covered root regions (think /6, /8, /12) — Patricia returns after 1–2 pointer chases staying in L1. The tree always does 7 cache-line fetches no matter which prefix ultimately wins.
Our Patricia is a stand-in, not one of the paper's baselines. The paper's 1.6–14× advantage is over PopTrie / CP-Trie / Neurotrie / HBS — those structures add per-node SIMD and bitmap acceleration that a plain pointer trie doesn't have; on uniform lookups they pay more than our Patricia does. The gap to a genuine paper baseline is almost certainly larger than our gap to Patricia.

The honest conclusion: on this hardware, on a real BGP table, against a uniform 64-bit trace, a plain Patricia trie is roughly at parity with the tree. The tree's advantage shows up with non-uniform traces (real packet captures concentrate on long prefixes where Patricia walks deep) or with the paper's actual baselines. Both are on the roadmap.

bench_mt runs batch<8> across T threads, each pinned to a distinct logical CPU, all sharing one immutable lpm6::Tree. Threads release in lockstep per run so L3 / DRAM contention is measured, not dodged. Same 20-run discipline. Laptop: 4 physical cores / 8 logical (SMT):

FIB	T	per-thread min	per-thread max	aggregate MLPS	efficiency
BGP 254 K	1	95	140	130	1.00×
BGP 254 K	2	95	143	278	1.07×
BGP 254 K	4	68	142	396	0.76×
BGP 254 K	8	17	98	626	0.60×
synth 1 M	1	12	18	15	1.00×
synth 1 M	2	13	18	31	1.02×
synth 1 M	4	7	16	52	0.85×
synth 1 M	8	3	15	85	0.69×

Efficiency = aggregate at T ÷ (T × aggregate at T=1). Units MLPS.

Observations:

T=2 is super-linear on the BGP FIB (1.07×). At 1T, a single memory-bound lookup chain leaves functional units idle during DRAM waits; with 2T on 2 cores the hardware has more in-flight loads and can hide more of that latency. Similar effect is visible on synth.
4-of-4 physical cores is the cleanest scaling regime. At 4T the efficiency is ~0.76–0.85×, mainly limited by shared L3 once the hot set no longer fits per-core.
8T uses SMT (hyperthreading). Per-thread throughput collapses (the two sibling threads contend for L1/L2 and AVX-512 port bandwidth on the same physical core), but aggregate still rises because the second sibling occupies stall cycles. Efficiency vs single-core baseline drops to 0.60–0.69×.
Real BGP scales worse relatively than synth 1 M. BGP's hot set at T=1 already hits ~130 MLPS and stays in L2; more threads push it out and each one pays the L3 tax. Synth-1M is DRAM-bound even at T=1 (footprint 38 MB ≫ L3), so adding threads is cheaper in relative terms — the DRAM subsystem was the bottleneck anyway.

Paper reports 3.4 BLPS on 12 Xeon cores (§5.4). Our 8-thread laptop (4 real cores + SMT) hits 0.63 BLPS on the same-scale BGP table — roughly 5× below the paper, which tracks with 12 vs 4 real cores plus Xeon's wider L2 / higher sustained AVX-512 frequency. A proper comparison needs a server-class box; see Phase 2.6 in plan.md.

PlanB's update model is batched: N pending prefix changes are coalesced into one full rebuild (§3.6 of the paper). bench_update measures rebuild time and per-op latency on the 100 000-prefix FIB:

  tree rebuild          : min 17.2  med 18.8  max 25.0  ms    (IQR 1.2)
  dynamic insert (incl. rebuild) : med 14.4 ms
  amortized if coalesced per 1 K : ~14 µs / update
  amortized if coalesced per 10 K: ~1.4 µs / update

The paper reports 850 ms to rebuild a 1 M-prefix FIB on an AMD Zen5 server; linear scaling gives ~85 ms for our 100 K setup and we see ~19 ms on Ice Lake, so the rebuild path is at parity within hardware noise.

Limitations — what this is not

Not on the paper's hardware. Laptop Ice Lake (4 physical cores, ~6 MB L3) vs. the paper's 12-core Xeon 8331C / Zen5 server — ≈5–15× raw single-core throughput gap before algorithm and ~3× core-count gap for multi-core aggregate. A server-class run on GCP c2-standard-16 (Cascade Lake, AVX-512) is on the roadmap (Phase 2.6 in plan.md).
Not the paper's baselines. Patricia is a stand-in conventional pointer-chasing trie; we do not implement PopTrie, CP-Trie, Neurotrie, or HBS. PopTrie is the next item in plan.md (Phase 2.7); the others are tracked but not scheduled.
Synthetic FIBs are length-weighted, not address-distributed like real BGP. The "Real BGP reproduction" section above shows how much this matters: throughput on a real rrc00 table is ~3× higher than on a same-size synthetic for uniform 64-bit traces, and Patricia's relative position against the tree flips entirely. Both the synthetic and the real-BGP runs use a uniform-random lookup trace — real packet captures concentrate on long prefixes, which would penalize Patricia further and is the regime where the linearized tree's fixed 7 cache-line cost looks best.

Treat these numbers as a reproducibility check that the algorithm behaves as the paper predicts on commodity hardware, not as a competitive evaluation.

pip install pybind11 scikit-build-core
pip install -e .
python examples/basic_usage.py

from planb_lpm import Tree, Dynamic

t = Tree()
t.build([("2001:db8::", 32, 10), ("fe80::", 10, 20)])
t.lookup("2001:db8::1")   # -> 10

d = Dynamic()
d.load([("::", 0, 1)])
d.insert("2001:db8::", 32, 7)
d.lookup("2001:db8::abcd") # -> 7

Expand each prefix/len into a [start, end) interval on the upper 64 bits of the IPv6 address space.
Sort the 2·N boundary points; resolve prefix nesting with a stack so that each elementary interval knows its active next-hop.
Pack the sorted boundaries into the leaves of a 9-ary linearized B+-tree (8 keys per node, 64-byte cache-line aligned). Internal nodes hold separator keys copied from the leaves at strides of 9^L · 8.
Lookup is a pure predecessor search: at each internal node use a single AVX-512 vpcmpuq + popcnt to pick the child, then do the same at the leaf to locate the predecessor edge.

include/lpm6.hpp            core header-only library
src/main.cpp                CLI benchmark
src/ipv6_util.hpp           FIB/trace loaders
tests/test_correctness.cpp  brute-force vs tree vs patricia
tests/patricia.hpp          baseline path-compressed radix trie (benchmark only)
tests/bench_naive.cpp       three-way throughput: naive / patricia / tree
tests/bench_update.cpp      rebuild + dynamic-update latency
tests/bench_mt.cpp          multi-core scaling (shared const tree, per-thread slice)
tests/bench_stats.hpp       warmup / percentile / env-probe helpers
examples/mrt_to_fib.py      RFC 6396 TABLE_DUMP_V2 → FIB converter for RIPE RIS
python/binding.cpp          pybind11 module
examples/                   sample data + usage scripts

All files in this repository are original work under the MIT license — see LICENSE.md. The author's reference implementation at https://github.com/nicexlab/planb is not vendored; see the paper cited above for the algorithm itself.

The 0.1.0 tree has been built and tested in the following configurations on Ubuntu 24.04 / GCC 13.3 (Intel Ice Lake):

Configuration	Result
Release, AVX-512 auto-detect	ctest 1/1, bench OK
Release, AVX-512 forced off (scalar fallback)	ctest 1/1, 0 `zmm` instructions in binary
Debug + `-fsanitize=address,undefined`	ctest 1/1, no leaks / UB
`-Wall -Wextra -Wpedantic -Wshadow -Wconversion -Werror`	Clean build
`pip install -e .` in a fresh venv + `examples/basic_usage.py`	All lookups correct

The core algorithm is from the PlanB paper by Zhang et al. (see top of file); all source in this repository is an independent reimplementation.

Contributions of any size are welcome — bug reports, benchmark results on new hardware, alternative SIMD backends, language bindings, documentation fixes. See CONTRIBUTING for the dev workflow, commit conventions, and how to run the full verification suite (scalar fallback, ASAN/UBSAN, -Werror, pip install -e .) before opening a pull request.