Show HN:约500行代码实现用于向量嵌入的HNSW索引
Show HN: HNSW index for vector embeddings in approx 500 LOC

原始链接: https://github.com/dicroce/hnsw

这段C++代码实现了一个基于分层可导航小世界图(HNSW)的最近邻搜索索引。该索引旨在对高维向量空间进行快速近似最近邻搜索。它使用多层图结构,较高层具有较粗糙(稀疏)的图,较低层具有较密集的图。搜索从顶层开始,导航图以找到一个近邻节点,然后下降到较低层,细化搜索。Eigen用于SIMD加速距离计算,从而优化性能。示例代码演示了如何为128维浮点向量创建HNSW索引,添加10,000个随机生成的向量,然后对给定的查询向量执行最近邻搜索,检索k个最近邻。然后打印结果,包括ID和距离。提供的构建命令设置构建目录,执行CMake配置项目,使用Make编译代码,并返回到之前的目录。

Hacker News用户“dicroce”分享了他用大约500行代码实现的向量嵌入HNSW(分层可导航小世界)索引。HNSW被描述为一种分层图结构,其顶层节点稀疏,底层节点密集。节点连接到同一层内的附近节点。插入操作涉及随机选择一个层级,并将节点插入到该层级及其下方的所有层级。搜索从顶层开始,找到最近的节点,然后向下搜索附近的节点,同时跟踪已看到的K个最近节点。搜索返回level 0上的值或最近值以及K个最近节点。一位评论者赞赏了对该数据结构的清晰解释。另一位建议使用ann-benchmarks.com将该实现的性能和召回特性与现有实现进行比较。

原文

Nearest neighbor search for vector embeddings

  • Single header, modern C++.
  • Approximately 500 lines of code.
  • Fast (uses Eigen for SIMD acceleration of distance calculations).
// Level 2           *                    *        *                                        *
// Level 1        *  *              *     *        *        *          **                 * *
// Level 0  *     ** *     * * *    *    **   *  * *  * ** **          **     ** *    *   * *   * * *

HNSW is a graph structure that consists of levels that are more sparsely populated at the top and more densely populated at the bottom. Nodes within a layer have connections to other nodes that are near them on the same level. When a node is inserted a random level is picked and the node is inserted there. It is also inserted into all levels beneath that level down to 0.

When searches arrive they start at the top and search that level (following connections) until they find the closest node in the top level. The search then descends and keeps searching nearby nodes. As the search progresses the code keeps track of the K nearest nodes it has seen. Eventually it either finds the value OR it finds the closest value on level 0 and the K nearest nodes seen are returned.

#include "hnsw/hsnw.h"

int main(int argc, char* argv[])
{
    // Create an index for 128-dimensional vectors
    dicroce::hnsw<float> index(128);
        
    const size_t num_vectors = 10000;
    std::vector<std::vector<float>> vectors(num_vectors) = generate_random_vv(num_vectors, 128);
    
    // Build the index    
    for (size_t i = 0; i < vectors.size(); ++i)
        index.add_item(vectors[i]);
    
    // Search for nearest neighbors
    std::vector<float> query(128) = generate_random_v(128);
    
    const size_t k = 10;
    auto results = index.search(query, k);
    
    for (const auto& result : results)
        printf("ID: %lu, Distance: %f\n", result.first, result.second);
}

mkdir build && pushd build && cmake .. && make && popd

联系我们 contact @ memedata.com