非对称量化:实现 97% 存储压缩的近无损检索
Asymmetric Quantization: Near-Lossless Retrieval with 97% Storage Reduction

原始链接: https://www.mixedbread.com/blog/asymmetric-quant

像 Wholembed v3 这类后期交互模型通过保留细粒度的文档信息,显著提高了检索精度,但由于每个文档会生成数百个向量,导致其存储成本高昂。 为了使该技术在十亿级文档规模下具备实用性,Mixedbread Search 团队在其“Silo”引擎中实现了**非对称量化**。通过保持查询向量的高精度(int8),并将文档向量存储为 1 位二进制符号,该系统实现了每个文档 32 倍的存储缩减,从 393 KiB 降至 12.28 KiB。 这种方法在极小程度降低 NDCG@10(从 90.26 降至 89.65)的同时保留了排名质量,并显著提升了性能。由于文档向量是持久的而查询是短暂的,这种权衡优化了系统的主要成本驱动因素:存储、IO 和缓存空间。此外,二进制文档格式允许使用优化的评分内核,以简单的选择与求和操作取代复杂的乘法运算。最终,该方法在利用高质量多向量表示的同时,保持了大规模生产搜索系统所需的高效率和低成本。

Hacker News:最新 | 过往 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 非对称量化:实现 97% 存储压缩的近无损检索 (mixedbread.com) 由 breadislove 发布于 1 小时前 | 5 分 | 隐藏 | 过往 | 收藏 | 1 条评论 | 帮助 johnathan101 4 分钟前 | 下一条 [-] 97% 的压缩率令人印象深刻,但我很好奇在生产环境中延迟方面的取舍如何。对于检索系统而言,存储只是问题的一半。 回复 准则 | 常见问题 | 列表 | API | 安全 | 法律 | 加入 YC | 联系 搜索:
相关文章

原文

Late interaction models like Wholembed v3 make retrieval much more precise, because they preserve fine-grained information instead of compressing a whole document into one vector. But they also change the storage economics. A single document produces more then one embedding, depending on the complexity of the document it can produce hundreds or thousands of vectors. Each vector has to be stored and later used for retrieval.

Mixedbread Search runs on silo, our retrieval engine for multimodal late interaction at billion-document scale. Silo stores vectors for more than 2.5 billion documents in object storage and hydrates them into faster tiers as queries need them. At that scale, every extra byte per document is repeated billions of times, and it shows up directly in cost per stored document, shard cold-start time, and the bytes each query has to read. We need to work around the tradeoff making the whole system cheap while maining the quality which makes late interaction worth running.

This post walks through asymmetric quantization. One of the optimizations that makes running late interaction practical in production. We keep the query vectors at higher precision and store the document vectors as binary signs. In our internal benchmark suite that cuts raw document-vector storage on average by 32x from 393 KiB to 12.28 KiB per document, while holding retrieval quality at 89.65 NDCG@10 versus 90.26 for fp32.

Per-document storage for a multi-vector document drops from 393 KiB in fp32 to 12.28 KiB with int8 query against binary documents, a 32-fold, 97% reduction. fp32 is drawn as 32 cells, each the size of the single int8-by-binary cell.

Quantization means representing high-precision floating point vectors with lower-precision values such as int8, or even 1-bit signs. The goal is to preserve ranking quality while reducing payload size. This matters especially for silo. Object storage gives us durable, low-cost persistence. In order to make it suitable for real workloads, we need compact indexes to serve it fast enough. And on the document side, payload size is what dominates the cost.

Naive late interaction is expensive because it stores more vectors. A standard single-vector embedding with 3072 dimensions in fp32 takes 12 KiB per document. A multi-vector representation with 786 vectors of 128 dimensions carries much more information, but uncompressed it is about 33x larger.

RepresentationDimensionsStorage per documentRelative to 3072-d fp32 single vector
Single vector, fp32307212,288 B / 12 KiB1.0x
Single vector, int830723,072 B / 3 KiB0.25x
Multi-vector, fp32786 × 128402,432 B / 393 KiB32.75x
Multi-vector, int8786 × 128100,608 B / 98.25 KiB8.19x
Multi-vector, binary786 × 12812,576 B / 12.28 KiB1.02x

Storage numbers here refer to raw vector payloads only. Production indexes also include document IDs, metadata, and layout overhead.

With binary document vectors, a 786-token multi-vector document is only about 2% larger than a 3072-dimensional fp32 single vector. Which means, that you can pay roughly single-vector storage and get late interaction quality. This helps us to change the tradeoff. Late interaction becomes practical to run by default, instead of something reserved for cases that justify the storage.

This is not a new direction for late interaction, ColBERTv2 showed that aggressive compression can reduce the footprint of late interaction models while preserving quality. PLAID showed that late interaction retrieval can be engineered down to practical latency using optimized retrieval and pruning. For production systems, both lessons matter: the model has to be precise, and the representation has to be cheap enough to move through hardware.

Compressing the document vectors saves storage, IO, cache space, and cold-start time across the entire corpus. Compressing the query vectors saves almost nothing because the query is small, short-lived, and never stored in the index.

This is also why we do not binarize both sides. Fully binary retrieval is the most compact option, but dropping the query to single bits throws away the magnitude information the ranking depends on, and it costs far more quality than binarizing documents alone (as shown later).

So we keep the query in int8 and store only the document vectors as binary signs. The query stays precise enough to preserve ranking, while the document side gets the storage reduction that matters for serving.

Binary document vectors are smaller and thus cheaper to store.

For int8 x int8 scoring, modern ARM CPUs give us direct support through NEON dot-product instructions. Our AArch64 kernel uses SDOT to accumulate sixteen int8 multiplications into int32 lanes, then horizontally reduces the result with vaddvq_s32. For int8 x binary scoring, the useful identity is simpler. If each document dimension is stored as a sign bit, with b_i in {-1, +1}, then:

qb=iqibi=2i:bi=+1qiiqi\begin{aligned} \mathbf{q} \cdot \mathbf{b} &= \sum_i q_i b_i \\ &= 2 \sum_{i\,:\,b_i = +1} q_i - \sum_i q_i \end{aligned}
联系我们 contact @ memedata.com