Hann：一个用于Go语言的快速近似最近邻搜索库

Hann：一个用于Go语言的快速近似最近邻搜索库
Hann: A Fast Approximate Nearest Neighbor Search Library for Go

Hann是一个Go语言库，用于高效的近似最近邻 (ANN) 搜索，旨在作为向量数据库的核心组件。它为HNSW、PQIVF和RPT等多种索引提供统一的接口，支持对任意维度的向量进行索引和搜索。它利用SIMD（AVX）指令进行快速距离计算，并方便批量操作（插入、删除、更新）和磁盘持久化。 HNSW具有可定制的参数（M，Ef），用于平衡精度和速度。PQIVF使用粗粒度聚类 (coarseK) 和乘积量化 (numSubquantizers, pqK) 来高效地搜索大型数据集。RPT利用随机超平面划分 (leafCapacity, candidateProjections)，并可以选择并行构建。 Hann可以通过`go get github.com/habedi/hann@main`安装，需要Go 1.21+、C编译器和AVX支持。您可以使用HANN_LOG环境变量控制Hann生成的日志的详细程度，并通过设置HANN_SEED环境变量来确保结果一致性。提供了演示使用各种数据集的索引的示例代码。

Hacker News 最新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 Hann：一个用于 Go 语言的快速近似最近邻搜索库 (github.com/habedi) habedi0 2 小时前 6 分 | 隐藏 | 过去 | 收藏 | 讨论加入我们 6 月 16-17 日在旧金山举办的 AI 初创公司学校！指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系我们搜索：

我正在编写一个新的矢量搜索 SQLite 扩展 2024-05-04

（评论） 2025-03-18

（评论） 2024-01-27

VectorDB：Kagi Search 构建的矢量数据库 2023-11-28

原文

Hann is a high-performance approximate nearest neighbor search (ANN) library for Go. It provides a collection of index data structures for efficient similarity search in high-dimensional spaces. Currently, supported indexes include Hierarchical Navigable Small World (HNSW), Product Quantization Inverted File (PQIVF), and Random Projection Tree (RPT).

Hann can be seen as a core component of a vector database (like Milvus, Pinecone, Weaviate, Qdrant, etc.). It can be used to add fast in-memory similarity search capabilities to your Go applications.

Unified interface for different indexes (see core/index.go)
Support for indexing and searching vectors of arbitrary dimension
Fast distance computation using SIMD (AVX) instructions (see core/simd_distance.c)
Support for bulk insertion, deletion, and update of vectors
Support for saving indexes to disk and loading them back

Index Name	Space Complexity	Build Complexity	Search Complexity
HNSW	$O(nd + nM)$	$O(n\log n)$	$O(\log n)$ average case $O(n)$ worst case
PQIVF	$O(nk + kd)$	$O(nki)$	$O(\frac{n}{k})$
RPT	$O(nd)$	$O(n\log n)$	$O(\log n)$ average case $O(n)$ worst case

$n$: number of vectors
$d$: number of dimensions (vector length)
$M$: links per node (HNSW)
$k$: number of clusters (PQIVF)
$i$: iterations for clustering (PQIVF)

The HNSW index supports the use of Euclidean, squared Euclidean, Manhattan, and cosine distances. If cosine distance is used, the vectors are normalized on read (before they are used in the index or for search). Note that squared Euclidean distance is slightly faster to compute than Euclidean distance and gives the same order of closest vectors as Euclidean distance. It can be used in place of Euclidean distance if only the order of closest vectors to query vector is needed, not the actual distances.

The PQIVF and RPT indexes support Euclidean distance only.

Hann can be installed as a typical Go module using the following command:

go get github.com/habedi/hann@main

Hann requires Go 1.21 or later, a C (or C++) compiler, and a CPU that supports AVX instructions.

Example File	Description
simple_hnsw.go	Create and use an HNSW index with inline data
hnsw.go	Create and use an HNSW index
hnsw_large.go	Create and use an HNSW index (using large datasets)
pqivf.go	Create and use a PQIVF index
pqivf_large.go	Create and use a PQIVF index (using large datasets)
rpt.go	Create and use an RPT index
rpt_large.go	Create and use an RPT index (using large datasets)
load_data.go	Helper functions for loading example datasets
utils.go	Extra helper functions for the examples
run_datasets.go	The code to create different indexes and try them with different datasets

Use the following commands to download the datasets used in the examples:

# Only needed to run the examples that use large datasets
make download-data-large

Note that to run the examples using large datasets, possibly a machine with large amounts of memory is needed like 32 GB or more.

Check the data directory for information about the datasets.

The detailed documentation for Hann packages is available on pkg.go.dev.

The hnsw package provides an implementation of the HNSW graph index introduced by Malkov and Yashunin (2016). HNSW organizes data into multiple layers of a proximity graph, which allows fast approximate nearest neighbor searches by greedily traversing the graph from top to bottom.

The index has the following parameters:

M: Controls the maximum number of neighbor connections per node. Higher values improve accuracy but increase memory and indexing time (typical range: 5–48).
Ef: Defines search breadth during insertion and searching. Higher values improve accuracy but increase computational cost (typical range: 10–200).

The pqivf package provides an implementation of the PQIVF index introduced by Jegou et al. (2011). PQIVF first clusters data into coarse groups (inverted lists), then compresses vectors in each cluster using product quantization. This allows fast approximate nearest neighbor searches by limiting queries to relevant clusters and efficiently comparing compressed vectors, which reduces search time and storage requirements.

The index has the following parameters:

coarseK: Controls the number of coarse clusters for initial quantization. Higher values improve search performance but increase indexing time (typical range: 50–4096).
numSubquantizers: Determines the number of subspaces for product quantization. More subquantizers improve compression and accuracy at the cost of increased indexing time (typical range: 4–16).
pqK: Sets the number of codewords per subquantizer. Higher values increase accuracy and storage usage (typical value: 256).
kMeansIters: Number of iterations used to train the product quantization codebooks (recommended value: 25).

The rpt package provides an implementation of the RPT index introduced by Dasgupta and Freund (2008). RPT recursively partitions data using randomly generated hyperplanes to build a tree structure, which allows efficient approximate nearest neighbor searches through a tree traversal process.

The index has the following parameters:

leafCapacity: Controls the maximum number of vectors stored in each leaf node. Lower values increase tree depth, improving search speed but slightly increasing indexing time (typical range: 5–50).
candidateProjections: Number of random projections considered at each tree split. Higher values improve split quality at the cost of increased indexing time (typical range: 1–10).
parallelThreshold: Minimum number of vectors in a subtree to trigger parallel construction. Higher values lead to better concurrency during indexing but use more memory (typical value: 100).
probeMargin: Margin used to determine additional branches probed during searches. Higher values improve recall but increase search overhead because of additional distance computations (typical range: 0.1–0.5).

The verbosity level of logs produced by Hann can be controlled using the HANN_LOG environment variable. Possible values include:

0, false, or off to disable logging altogether;
full or all to enable full logging (DEBUG level);
Use any other value (including not setting the HANN_LOG environment variable) to enable basic logging (INFO level; default behavior).

For more consistent indexing and search results across different runs, set the HANN_SEED environment variable to an integer. This will initialize the random number generator, but some variations are still possible (for example, due to multithreading).

See CONTRIBUTING.md for details on how to make a contribution.

The logo is named the "Hiking Gopher" and was created by Egon Elbre.

Hann is licensed under the MIT License (LICENSE).

Hann：一个用于Go语言的快速近似最近邻搜索库 Hann: A Fast Approximate Nearest Neighbor Search Library for Go

Hann：一个用于Go语言的快速近似最近邻搜索库
Hann: A Fast Approximate Nearest Neighbor Search Library for Go