我们处于矢量数据库的峰值吗？

我们处于矢量数据库的峰值吗？
Are we at peak vector database?

原始链接: https://softwaredoug.com/blog/2024/01/24/are-we-at-peak-vector-db

概括：有了众多可用的载体数据库和库选项，人们想知道我们是否已经达到了所需数量的“峰值”。在搜索各种资源时，包括 Pinecone、QDrant、Milvus、Weaviate、Turbopuffer、MyScale 等 Pure Vector DB 以及 Solr、Elasticsearch、Vespa、Annoy、FAISS、NMSLib、HNSWlib、Lucene、Chroma 等开源搜索引擎， Redis、PGVector、Cassandra、Mongo、Azure AI Search 和 Google Vertex，数量之多已经让人难以招架。然而，虽然已经实现了高效的向量检索，但进一步的挑战在于意图分类、推理和重新排序、多样性以及词汇检索/混合搜索。尽管如此，基于矢量的应用程序的创建需要研究新的评估和与数据交互的方法。最终，缺乏统一的解决方案凸显了专家解决复杂多样的检索问题的必要性。因此，不能仅仅专注于改进传统方法，而必须做出更广泛的努力来创建能够同时处理多个问题的完全集成和全面的向量检索和排序系统。通过这样做，我们有可能产生一个全新的检索系统，旨在为用户提供增强的人工智能和聊天体验，这将代表相关性驱动的应用程序开发方面的重大进步。

hnswlib 是专门为与分层 Navie Kmeans 算法或 hnsw 聚类技术一起使用而构建的 Python 包，用于实现余弦相似性搜索和使用 k-means 聚类训练词嵌入。虽然从技术上讲它可以在需要少量移植的设备上运行，但它通常部署在后端服务器环境中。直接在移动或物联网设备上运行复杂的聚类或矢量搜索算法通常是为具有特定计算要求的特殊用途应用程序保留的，例如实时视频分类或对象跟踪算法。正如给定材料中提到的，人们正在努力开发可能在边缘环境上运行的 hnswlib 轻量级版本，但这些版本仍处于开发阶段，目前尚未在主流消费应用程序中广泛采用。尽管如此，与将所有处理任务发送到云或后端服务器相比，在设备上运行推理或计算可以带来更低的延迟和能源效率优势，具体取决于特定的架构和用例。总体而言，目标是实现高精度和快速响应时间，同时最大限度地减少计算复杂性和资源消耗，从而无论设备位置或网络条件如何，都能提供无缝且响应迅速的体验。

原文

As both a search person and something of a veteran of the NoSQL days, I wonder to myself, often, “how can so many vector databases need to exist?”.

Even in our current AI age, where everyone and their mom is trying to build the next chat / AI powered whatever. Even today when it seems everyone is putting vectors somewhere to retrieve them…

I have to ask the tough question - when have we reached peak vector DB? Do we have enough choices for the specific task of storing and retrieving vectors?

Just on cursory listing, I can think of the following vector databases, libraries, whatever:

Pure Vector DBs

Pinecone
QDrant
Milvus / Zilliz
Weaviate
Turbopuffer
MyScale

Open Source search engines

Solr
Elasticsearch
OpenSearch
Vespa

Libraries

Annoy,
FAISS,
NMSLib
HNSWLib
Lucene
Chroma
(a million others)

Open Source DBs

Redis
PGVector
Cassandra, Mongo, etc etc (every DB seems to be getting its vector index :wink: )

Cloud solutions

Google Vertex
Azure AI Search

(more at awesome vector search)

If we take vector search as one type of data store in the NoSQL paradigm, we might put it in its own category. We would say Mongo is a document database alongside CouchDB and pals. We would say Cassandra is a columnar data store, alongside the Scylla or HBase.

So in each category, we have a handful (2-3). Yet in vector search, we have dozens upon dozens of options. As a “customer” of such options, the field becomes overwhelming.

And vector retrieval increasingly isn’t the problem. The hard problems of solving real-world retrieval are not related to just getting vectors, it’s everything around it

Intent classification - given a “query” how do I know whether I can solve the problem or not (RAG) or how to route the query to the correct place Inference and reranking - given a “query”, and some candidate retrieved vectors / items, how do I perform inference on say, an arbitrary tensorflow model, to give the most relevant items? Diversity - given a “query” how do I broaden the candidate pool to more than just “similar to vectors” - to get at not just one intent, but all possible intents Lexical retrieval / hybrid search - given natural language, how do I use direct lexical signals (boring old BM25, just filtering, whatever) to give relevant results

OK and that’s just the lexical side. We’re inventing new ways of interacting with data. Nobody I talk to has really created a robust way to evaluate quality of context for RAG. There’s new UX paradigms out there - chat and chat-adjacent - that we’re experimenting with.

The challenge being, the world is wide open for experimentation, yet on first blush, all the money is being concentrated in one part of the stack. We’re not looking at the problem holistically.

OK, that’s one argument, sure.

Here’s the other point of view.

There needs to be a place to focus on, and rethink, retrieval problems. In the same way NoSQL forced us to rethink databases. Capital and brainpower need a place to zero in on how to solve this next generation of retrieval + relevance problems.

So the old, curmudgeonly search person in me would say “well whatever, people realize they need search engines and use Solr / Elasticsearch anyway”.

But that’s not good enough. These search tools feel esoteric, in the average “AI Engineers” mind, they will think first of vector retrieval, then stumble into all the problems I list above. They’ll learn they need to care about all the things beyond ANN, but only after their app is stood up. In the same way the search engineers of yore backed into all kinds of problems only after comitting to a big Solr or Elasticsearch installation.

Additionally, I suspect, more and more surfaces will be driven by some retrieval-ish thing. Search-but-not-search. Real-time recommendations, but driven by vector (and other kinds of) retrieval that looks more like a search engine - not batch computed, nightly jobs common these days. I wrote about the coming revolution of “Relevance Driven Applications” in 2016, and now, its happening - not with boring old search engines as I once thought - but by reinventing our whole notion of the retrieval layer that drives user experiences.

So, I suspect the smart people at these companies will branch out beyond “making ANN better / more scalable” to building complete retrieval and ranking systems solving a tremendous array of problems. The money and effort will flow to where the customers see the problem.

In the end, like in NoSQL, we may end up with SQL again (but with all the NoSQL innovations). Or, in other words, we may end up with these vendors building a full blown “search engine”. Yet these new search engines will have many more batteries included the AI/Chat/RAG/whatever experiences people increasingly reach for.

我们处于矢量数据库的峰值吗？ Are we at peak vector database?

我们处于矢量数据库的峰值吗？
Are we at peak vector database?