显示 HN：使用 ACORN-1 的预过滤 HNSW DuckDB 社区扩展

显示 HN：使用 ACORN-1 的预过滤 HNSW DuckDB 社区扩展
Show HN: DuckDB community extension for prefiltered HNSW using ACORN-1

原始链接: https://github.com/cigrainger/duckdb-hnsw-acorn

这个DuckDB扩展通过使用HNSW索引增强了向量相似性搜索，解决了原始`duckdb/duckdb-vss`实现中的一个限制。核心改进是集成了ACORN-1算法，该算法**将过滤谓词推入HNSW图遍历中**。这确保了带有`WHERE`子句的查询（例如，`SELECT ... WHERE category = 'X' ORDER BY distance LIMIT 10`）返回预期的结果数量，而上游版本通常返回的结果较少。该分支采用了基于选择性的策略：对于选择性在1-60%之间的过滤器使用ACORN-1，对于小于1%的选择性使用暴力扫描，对于大于60%的选择性使用后过滤。可配置的阈值（`hnsw_acorn_threshold`，`hnsw_bruteforce_threshold`）允许进行调整。基准测试（使用228k电影和Nomic嵌入）表明，召回率得到了显著提高，尤其是在高度选择性的过滤器（例如，日本或韩国电影）的情况下。该扩展不需要特殊的SQL语法；优化器会自动为合适的查询利用过滤后的HNSW搜索。它支持预处理语句，并允许通过`CREATE INDEX ... USING HNSW`创建索引。

一个新的 DuckDB 社区扩展程序，实现了 ACORN-1，用于预过滤近似最近邻 (ANN) 搜索，现已发布。该扩展程序由一位对现有混合搜索解决方案（如 pgvector）的局限性感到沮丧的用户开发，它允许在 ANN 搜索*之前*使用“WHERE”子句进行预过滤，从而显著提高效率。目前，带有过滤器的标准 HNSW 往往会导致对所有候选对象的扫描，从而抵消了 ANN 的优势。ACORN-1 旨在解决这个问题，但对于高度选择性的过滤器（例如，消除索引的 95%）的有效性仍有待考量。该扩展程序现在可以通过 `INSTALL hnsw_acorn FROM community; LOAD hnsw_acorn;` 获取。讨论强调了其在分析工作负载（如金融时间序列数据，可能替代 BigQuery 用于较小的数据集）中的潜在用例，以及与 Lance 等数据格式的集成。进一步的开发可能会探索高效的二进制向量支持和 RaBitQ 算法。

原文

This is a fork of duckdb/duckdb-vss that adds ACORN-1 filtered HNSW search.

The upstream extension has a critical limitation: WHERE clauses are applied after the HNSW index returns results, so SELECT ... WHERE category = 'X' ORDER BY distance LIMIT 10 often returns fewer than 10 rows. This fork pushes filter predicates into the HNSW graph traversal using the ACORN-1 algorithm, ensuring filtered queries return the correct number of results with high recall.

What changed:

Filter predicates are evaluated during HNSW graph traversal, not after

ACORN-1 two-hop expansion through failed neighbors recovers graph connectivity under selective filtering

Selectivity-based strategy switching: >60% selectivity uses post-filter, 1-60% uses ACORN-1, <1% uses brute-force exact scan

Per-node expansion threshold (Lucene's 90% rule) skips two-hop when the neighborhood is already well-connected

Configurable thresholds: SET hnsw_acorn_threshold = 0.6 and SET hnsw_bruteforce_threshold = 0.01

Benchmark (228k movies, 768-dim Nomic embeddings):

Filter Selectivity Upstream ACORN-1

English only ~60% ~10/10 10/10

Japanese only ~3% 0-1/10 10/10

Korean only ~1% 0/10 10/10

Rating >= 8.0 ~5% 0/10 10/10

Query: movies similar to The Matrix, filtered by language → returns Matrix Revolutions, Gunhed (ja), Savior of the Earth (ko). See test/benchmark/movies_real_benchmark.sql for the full benchmark.

Filter	Selectivity	Upstream	ACORN-1
English only	~60%	~10/10	10/10
Japanese only	~3%	0-1/10	10/10
Korean only	~1%	0/10	10/10
Rating >= 8.0	~5%	0/10	10/10

Original README follows.

Vector Similarity Search for DuckDB

This is an experimental extension for DuckDB that adds indexing support to accelerate Vector Similarity Search using DuckDB's new fixed-size ARRAY type added in version v0.10.0. This extension is based on the usearch library and serves as a proof of concept for providing a custom index type, in this case a HNSW index, from within an extension and exposing it to DuckDB.

Filtered Search (ACORN-1)

This fork adds support for filtered vector search. Queries with WHERE clauses now push filter predicates into the HNSW index traversal:

-- This now returns exactly 10 results with category = 'X',
-- ordered by distance. Upstream would return 0-2 results.
SELECT * FROM my_table
WHERE category = 'X'
ORDER BY array_distance(vec, [1,2,3]::FLOAT[3])
LIMIT 10;

No special syntax required — the optimizer automatically detects WHERE + ORDER BY distance + LIMIT patterns and uses filtered HNSW search.

Prepared statements work for parameterized query vectors:

PREPARE search AS SELECT * FROM my_table
WHERE category = $2
ORDER BY array_distance(vec, $1::FLOAT[3])
LIMIT 10;

EXECUTE search([1,2,3], 'X');

Note: Subquery query vectors (ORDER BY distance(vec, (SELECT q FROM ...))) currently fall back to sequential scan. Use prepared statements or literal vectors instead.

-- Selectivity above which ACORN-1 is skipped (default 0.6 = 60%)
SET hnsw_acorn_threshold = 0.6;

-- Selectivity below which brute-force exact scan is used (default 0.01 = 1%)
SET hnsw_bruteforce_threshold = 0.01;

To create a new HNSW index on a table with an ARRAY column, use the CREATE INDEX statement with the USING HNSW clause. For example:

CREATE TABLE my_vector_table (vec FLOAT[3]);
INSERT INTO my_vector_table SELECT array_value(a,b,c) FROM range(1,10) ra(a), range(1,10) rb(b), range(1,10) rc(c);
CREATE INDEX my_hnsw_index ON my_vector_table USING HNSW (vec);

The index will then be used to accelerate queries that use a ORDER BY clause evaluating one of the supported distance metric functions against the indexed columns and a constant vector, followed by a LIMIT clause. For example:

SELECT * FROM my_vector_table ORDER BY array_distance(vec, [1,2,3]::FLOAT[3]) LIMIT 3;

# We can verify that the index is being used by checking the EXPLAIN output 
# and looking for the HNSW_INDEX_SCAN node in the plan

EXPLAIN SELECT * FROM my_vector_table ORDER BY array_distance(vec, [1,2,3]::FLOAT[3]) LIMIT 3;

┌───────────────────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             #0            │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            vec            │
│array_distance(vec, [1.0, 2│
│         .0, 3.0])         │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│      HNSW_INDEX_SCAN      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│   t1 (HNSW INDEX SCAN :   │
│           my_idx)         │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            vec            │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           EC: 3           │
└───────────────────────────┘

By default the HNSW index will be created using the euclidean distance l2sq (L2-norm squared) metric, matching DuckDBs array_distance function, but other distance metrics can be used by specifying the metric option during index creation. For example:

CREATE INDEX my_hnsw_cosine_index ON my_vector_table USING HNSW (vec) WITH (metric = 'cosine');

The following table shows the supported distance metrics and their corresponding DuckDB functions

Description	Metric	Function
Euclidean distance	`l2sq`	`array_distance`
Cosine similarity	`cosine`	`array_cosine_distance`
Inner product	`ip`	`array_negative_inner_product`

Inserts, Updates, Deletes and Re-Compaction

The HNSW index does support inserting, updating and deleting rows from the table after index creation. However, there are two things to keep in mind:

Its faster to create the index after the table has been populated with data as the initial bulk load can make better use of parallelism on large tables.
Deletes are not immediately reflected in the index, but are instead "marked" as deleted, which can cause the index to grow stale over time and negatively impact query quality and performance.

To address this, you can call the PRAGMA hnsw_compact_index('<index name>') pragma function to trigger a re-compaction of the index pruning deleted items, or re-create the index after a significant number of updates.

Only vectors consisting of FLOATs are supported at the moment.
The index itself is not buffer managed and must be able to fit into RAM memory.

With that said, the index will be persisted into the database if you run DuckDB with a disk-backed database file. But there is no incremental updates, so every time DuckDB performs a checkpoint the entire index will be serialized to disk and overwrite its previous blocks. Similarly, the index will be deserialized back into main memory in its entirety after a restart of the database, although this will be deferred until you first access the table associated with the index. Depending on how large the index is, the deserialization process may take some time, but it should be faster than simply dropping and re-creating the index.

To build the extension, run:

The main binaries that will be built are:

./build/release/duckdb
./build/release/test/unittest
./build/release/extension/vss/vss.duckdb_extension

duckdb is the binary for the duckdb shell with the extension code automatically loaded.
unittest is the test runner of duckdb. Again, the extension is already linked into the binary.
vss.duckdb_extension is the loadable binary as it would be distributed.

To run the extension code, simply start the shell with ./build/release/duckdb.

Thes SQL tests can be run using: