SQLite 中混合搜索的汉明距离
Hamming Distance for Hybrid Search in SQLite

原始链接: https://notnotp.com/notes/hamming-distance-for-hybrid-search-in-sqlite/

## SQLite 中的语义搜索:混合方法 本文详细介绍了在 SQLite 中直接实现语义搜索,无需外部向量数据库。核心思想是利用 SQLite 的 FTS5 文本搜索,以及基于意义的检索,通过 **二进制嵌入** 和 **汉明距离** 实现。 传统的语义搜索使用基于浮点数的嵌入,需要大量的存储空间。该实现将嵌入量化为单个比特,大大减少了存储空间(1024 维占用 128 字节),但牺牲了一些准确性——对于速度和空间而言,这是一个值得的权衡。汉明距离利用位运算高效地测量这些二进制向量之间的相似度,并由现代 CPU 优化。 该解决方案被实现为一个自定义的 SQLite 扩展,添加了一个 `hamming_distance` 函数。在 100 万行数据上的性能测试显示,查询时间约为 35 毫秒(包括排序),不包括排序则为 28 毫秒,即使进行全表扫描,也显示出可行的速度。 最后,**倒数排名融合 (RRF)** 结合了 FTS5 的关键词搜索 (BM25) 和语义搜索的结果,创建了一个强大的 **混合搜索**,能够处理精确和细微的查询。这种方法非常适合于 1000 万行以下的数据集,或者在避免外部依赖比线性扫描的成本更重要的情况下。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 SQLite 中混合搜索的汉明距离 (notnotp.com) 12 分,由 enz 发表于 3 小时前 | 隐藏 | 过去 | 收藏 | 讨论 帮助 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

This article shows how I implemented semantic search in SQLite using binary embeddings and Hamming distance, enabling hybrid search without external vector databases.

SQLite's FTS5 extension provides excellent text search with BM25 ranking. However, it doesn't support semantic search, which means you can't combine keyword matching with meaning-based retrieval, a technique known as hybrid search.

Binary embeddings and Hamming distance

Semantic search works by converting text into numerical vectors (embeddings) that capture meaning (in my case, I do that via an API). Similar texts (i.e., talking about the same things) produce similar vectors. Embeddings generally use float32 values, that is, a 1024-dimensional embedding requires 4KiB per document.

Binary embeddings quantize each dimension to a single bit (0 or 1). This reduces storage dramatically, because 1024 dimensions become only 128 bytes. The similarity metric changes from cosine distance to Hamming distance, so it also allows using fast simple bit operations vs more expensive floating-point arithmetic.

The tradeoff is accuracy: binary quantization loses information. For many applications (including mine), this is acceptable given the storage and speed benefits, especially if it's combined with another classic text search algorithm like BM25.

Hamming Distance

Hamming distance counts the number of bit positions where two binary vectors differ.

Example with 8-bit vectors:

Position:  0 1 2 3 4 5 6 7
Vector A:  1 0 1 1 0 1 1 0
Vector B:  1 0 0 1 1 0 1 0
               ^   ^ ^

Bits differ at positions 2, 4, and 5. Hamming distance = 3. For semantic search, lower Hamming distance means higher similarity.

To compute Hamming distance efficiently, we XOR the two vectors (1 where bits differ, 0 where same), then count the 1s (population count or "popcount"). Modern CPUs have dedicated popcount instructions, making this very fast.

Example:

Vector A:  1 0 1 1 0 1 1 0
Vector B:  1 0 0 1 1 0 1 0
A XOR B:   0 0 1 0 1 1 0 0

popcount(A XOR B) = 3 ones = distance of 3

Implementation as a SQLite extension

SQLite extensions are dynamically loaded shared libraries (.so on Linux, .dylib on macOS) that can register custom SQL functions. This is simpler than forking SQLite or using external tools. The extension needs to: Register the function name with SQLite, implement the computation logic, and finally handle BLOB inputs and return an integer result.

Registration code:

#include <stdint.h>
#include <sqlite3ext.h>
SQLITE_EXTENSION_INIT1

int sqlite3_extension_init(
  sqlite3 *db, 
  char **pzErrMsg, 
  const sqlite3_api_routines *pApi
){
  SQLITE_EXTENSION_INIT2(pApi);
  return sqlite3_create_function(
    db, "hamming_distance", 2,     /* Name when called from SQL */
    SQLITE_UTF8 | SQLITE_DETERMINISTIC,
    0,
    hamming_distance,
    0, 0
  );
}

The core function receives two BLOB arguments (the binary vectors) and returns their Hamming distance:

static void hamming_distance(sqlite3_context *context, int argc, sqlite3_value **argv) {
  if (argc != 2) {
      sqlite3_result_error(context, "hamming_distance() requires 2 arguments", -1);
      return;
  }

    const unsigned char *blob1 = sqlite3_value_blob(argv[0]);
    const unsigned char *blob2 = sqlite3_value_blob(argv[1]);
    int len1 = sqlite3_value_bytes(argv[0]);
    int len2 = sqlite3_value_bytes(argv[1]);

     if (len1 != len2) {
        sqlite3_result_error(context, "vectors must be same length", -1);
        return;
    }

    /*
       Hamming Distance

       This cast assumes the architecture supports unaligned 64-bit access,
       which is true for x86_64 and ARMv8-A+, including Apple Silicon and
       Raspberry Pi 4 I tested against.
    */
    const uint64_t *v1 = (const uint64_t *)blob1;
    const uint64_t *v2 = (const uint64_t *)blob2;

    int chunks = len1 / 8;
    uint64_t distance = 0;

    // Process 8-byte chunks
    for (int i = 0; i < chunks; i++) {
        distance += __builtin_popcountll(v1[i] ^ v2[i]);
    }

    // Remaining bytes
    for (int i = chunks * 8; i < len1; i++) {
        distance += __builtin_popcount(blob1[i] ^ blob2[i]);
    }

    sqlite3_result_int64(context, distance);
}

This implementation does the following:

  • XORs corresponding 64-bit chunks from both vectors
  • Uses __builtin_popcount[ll]() for bit counting (compiles to the adequate CPU instruction)
  • Accumulates the total distance
  • Handles any remaining bytes that don't fit in 64-bit chunks

Here are the commands I used to compile the file for macOS (hamming.dylib) and for Linux (hamming.so):

hamming.dylib: hamming.c
        clang -g -fPIC -dynamiclib hamming.c -o hamming.dylib \
              -undefined dynamic_lookup -march=native -O3

hamming.so: hamming.c
        gcc -g -fPIC -shared hamming.c -o hamming.so \
        -march=native -O3

Performance on large datasets

Testing with 1 million rows of 128-byte binary vectors (1024 dimensions):

.load hamming -- load hamming.so or hamming.dylib

CREATE TABLE documents (
    rowid INTEGER PRIMARY KEY,
    embedding BLOB NOT NULL  -- 128 bytes for 1024-bit binary
);

-- Populate with 1M random embeddings
WITH RECURSIVE 
  cnt(x) AS (
    SELECT 1
    UNION ALL
    SELECT x+1 FROM cnt
    LIMIT 1000000
  )
INSERT INTO documents (rowid, embedding)
SELECT x, randomblob(128) FROM cnt;

.timer on

-- Searching
SELECT rowid, hamming_distance(:vec, embedding) as dist
FROM documents
ORDER BY dist
LIMIT 10

Note: Instead of :vec, I hardcoded a raw 128-byte blob by hand. I omitted it to avoid visual clutter.

Having .timer on tells SQLite to report the time taken by the following queries. To avoid disk latency and get consistent results, I ran tests on in-RAM (`:memory:`) databases.

% sqlite3 :memory: < bench.sql
Run Time: real 0.035056 user 0.035006 sys 0.000053

Not bad: 35ms for 1 million rows on my Apple M4 chip, including the sort to get top-10 results. To measure just the Hamming distance computation without sorting overhead, we can put an impossible WHERE-clause that is always false, thus forcing SQLite to scan all rows (but returning none):

SELECT rowid, hamming_distance(:vec, embedding) as dist
FROM documents
WHERE dist < 0; -- Impossible condition, no sorting needed
% sqlite3 :memory: < bench-no-sort.sql
Run Time: real 0.028491 user 0.028477 sys 0.000017

28ms without sorting. The ORDER BY adds approximately 7ms to sort 1M rows.

Is O(n) acceptable?

Whether we use an ORDER BY + LIMIT or not, this scans every row, there is no indexing and no pruning. But, it's still pretty fast.

Implementing HNSW or IVF indexing would add complexity that didn't seem justified for my project, along with memory overhead (maybe 2-5x the data size) and build-time cost. The boring O(n) simplicity is often worth it at this scale.

Limitations

This implementation has a significant limitation: it's just a function callable from SQL, not a full-fledged virtual table. SQLite computes hamming_distance() for every row, then sorts all results, then takes the top 10.

For a true top-k selection, even without any indexing, I believe the extension would need to be a virtual table that can: recognize the ORDER BY + LIMIT k pattern, maintain a heap of k best candidates during scanning, and avoid sorting the full result set

Hybrid search with RRF

Combining FTS5 BM25 ranking with semantic search requires merging two different relevance signals. The standard approach is Reciprocal Rank Fusion (RRF). I went with that.

Run both searches independently: BM25 search on tokenized text (keyword matching), and then Hamming distance on binary embeddings (semantic similarity).
Each returns its top-k candidates (e.g., 50 results each).

The query text needs to be transformed into a binary vector first. I use an external API for this, and run the BM25 query while awaiting the response to hide the API latency.

RRF merges ranked lists without needing to normalize scores across different retrieval systems. It works by converting ranks into scores. For each document, calculate: score(doc) = 1/(k + rank_bm25 + 1) + 1/(k + rank_semantic + 1) Where: rank is the position in that retrieval method's results (0-indexed) and k is a constant (typically 60) that controls how quickly scores decay. Documents appearing in both lists receive additive boosts. Documents in only one list still contribute. The final ranking sorts by combined score.


function mergeRRF(
  semanticIds: number[], // Sorted by relevance, assume no dup
  bm25Ids: number[],
  k: number = 60, // RRF constant
): number[] {
  const scores = new Map<number, number>();

  semanticIds.forEach((id, rank) => {
    scores.set(id, (scores.get(id) || 0) + 1 / (k + rank + 1));
  });

  bm25Ids.forEach((id, rank) => {
    scores.set(id, (scores.get(id) || 0) + 1 / (k + rank + 1));
  });

  return Array.from(scores.entries())
    .sort((a, b) => b[1] - a[1])
    .map(([id]) => id);
}

Example real queries demonstrating hybrid benefits

  • "Apple founder": Semantic search disambiguates (Steve Jobs, not fruit)
  • "Python snake habitat": BM25 keyword "snake" filters out programming content
  • "Python programming tutorial": Semantic understanding recognizes programming context

The combination handles both precise keyword requirements and fuzzy semantic understanding, all within a single SQLite database.

When to use this approach

This works well when the dataset size is small (let's say <10M rows on beefy hardware), or even more if queries are infrequent enough that O(n) + sort time latency is acceptable, and if you want to avoid external vector database dependencies.

联系我们 contact @ memedata.com