深入 FAISS：十亿级相似度搜索

深入 FAISS：十亿级相似度搜索
Inside FAISS: Billion-Scale Similarity Search

原始链接: https://fremaconsulting.ch/blog/faiss

为了压缩高维向量（例如 128 维的 SIFT 描述符），乘积量化（Product Quantization, PQ）提供了一种强大的替代方案，避免了存储未压缩数据时可能耗费的数百 GB 内存。类似于 GIF 的调色板——将 24 位像素映射到 256 色索引，PQ 通过使用来自“码本（codebook）”的代表性质心索引来替换向量。虽然使用单一的扁平码本来实现高压缩率（例如 64 位编码）需要天文数字般的质心数量，这会超出存储限制和训练数据的可用性，但 PQ 通过一种分解技巧解决了这个问题。通过将高维向量拆分为 $m$ 个较小的子向量，并为每个子向量分配一个小型的专属码本，PQ 在仅占用极小、可缓存内存的情况下，实现了巨大的有效搜索空间（例如 $256^8$ 种组合）。这使得压缩比达到 64 倍，将包含十亿级 SIFT 描述符的海量数据集压缩至可管理的范围内，同时还能保持有效的距离估算能力。

这篇 Hacker News 帖子讨论了一份名为“深入 FAISS：十亿级相似度搜索”的新交互式指南，该指南旨在作为 2017 年 FAISS（Facebook AI 相似度搜索）研究论文的可视化配套资料。作者解释说，该指南专注于简化诸如 IVFPQ/IVFADC 等复杂概念，这些概念通常很难仅从学术文本中理解。该项目因其高质量的可视化效果而受到了社区的高度赞扬。在讨论中，评论者谈到了 FAISS 的现状。尽管该库仍然是向量搜索的基础工具，但一些用户指出其维护力度有所下降。作者认为，这可能是由于 Meta 在组织架构调整后内部优先级的改变，以及行业向托管向量数据库服务（如 Pinecone、MongoDB Atlas 和 pgvector）的转向所致。虽然作者计划在未来的文章中介绍 NSG 和 FastScan 等更高级的 FAISS 主题，但当前的资源为那些希望了解大规模相似度搜索底层机制的人提供了一个绝佳的切入点。

IVF makes search fast by skipping most of the database, but leaves every vector uncompressed. One billion SIFT descriptors still cost 512 GB of RAM. Product Quantization (PQ), introduced by Jégou, Douze and Schmid (2011), is the compression trick FAISS builds on to shrink each vector to 8 bytes while keeping distance estimates meaningful.

Same centroids, a different job

§4 used centroids to partition the search space. PQ uses them to compresseach vector. Same K-Means math, new purpose: the centroid's index becomes the vector. If your codebook has centroids, any vector collapses to one integer in

A pixel-sized analogy

A 24-bit RGB pixel can be any of 16.7 million colors. A GIF gives up some of that range: it picks a 256-color palette and replaces every pixel with an 8-bit index into it. Storage drops 3×, the picture still looks like the picture. That palette is a codebook; the index is a code.

A 128-D SIFT descriptor is just a very long pixel. We want the same trick: find a palette of representative vectors, replace every descriptor with its nearest palette index.

How many bits does an index cost?

To label centroids uniquely you need enough bits to distinguish all of them. Each bit doubles the number of labels, so the code length is $bits. Concretely:$

$centroids \to$
$centroids \to$

More centroids means the quantizer discriminates finer detail, and the code grows accordingly. The codebook designer is playing a bit-budget game.

The paper's aggressive target

Jégou et al. want each 128-D SIFT descriptor to become a 64-bit code, which is 0.5 bit per dimension, a 64× compression ratio against the 512-byte original. Half a bit per dimension is ambitious: it means the quantizer has to encode all the variation along each axis with a single binary-ish choice.

Hitting 64 bits with one flat codebook means picking

Flat codebook feasibility

Pick a vector dimension and a bit budget per dimension to see the centroid count and storage footprint a single flat quantizer would need.

Vector dimension (D)

Bits per dimension

Total bits per code64bits

Centroids needed (k = 2^64)1.84 × 10^19centroids

Codebook storage9.44ZB

Impossible

Bigger than all cloud storage on Earth. Lloyd's algorithm would need roughly 5.53 × 10^20 training samples.

Why 2⁶⁴ is not a codebook

Three walls hit simultaneously when gets that large:

Storage.Each centroid is 128 floats × 4 bytes = 512 bytes. Times $centroids gives about 9.4 zettabytes, more than all cloud storage on Earth in 2025. You cannot persist the codebook, let alone page through it at query time.$
Training data.Lloyd's algorithm needs at least tens of samples per centroid to converge. That is
Query cost.To encode one descriptor you compare it to every centroid. 18 quintillion distance computations per vector is not a latency you can ship.

The flat codebook dies on all three fronts. The paper's move is to keep the effective vocabulary the same size while shrinking the stored codebook by many orders of magnitude.

The factoring trick

Back to the GIF analogy. Instead of finding one 256-color palette that works for the whole image, cut the image into 8 tiles and give each tile its own 256-color palette. Each tile is now one byte; the image is 8 bytes; every tile had access to a full 256-color palette tuned to its own pixels. PQ does exactly this for vectors.

Formally: split into sub-vectors

With $and$

Each encoded vector fits in

The walkthrough below builds the full index on the paper's reference SIFT setting (