上下文检索

上下文检索
Contextual Retrieval

原始链接: https://www.anthropic.com/news/contextual-retrieval

为了使人工智能模型在特定环境中发挥作用，它们需要特定的背景知识。此类模型的示例包括客户服务聊天机器人，需要了解其所服务的特定业务，或者需要详细的历史法律案例的法律分析机器人。开发人员扩展人工智能模型知识的一种方法是通过检索增强生成（RAG），这种方法从知识库中提取相关信息并将其合并到用户的请求中，从而提高人工智能模型的响应质量。尽管 RAG 已被证明是有益的，但当前的方法在编码信息时往往会丢失关键上下文，从而导致不相关的响应。最近一项名为“上下文检索”的创新解决了这个问题。通过采用两种互补方法（上下文嵌入和上下文 BM25），上下文检索将错误信息检索减少了 49%，而与重新排名相结合，将错误信息检索减少了 67%。这些进步带来了卓越的检索精度，最终有助于提高后续任务的性能。部署上下文检索需要最少的努力，因为您可以与 Claude 一起使用它，如我们的食谱中所述。如果您的知识库少于 200,000 个标记（大约 500 页数据），则会出现另一种简单的解决方案。在这种情况下，您可以将完整的知识库添加到用户输入中，而无需使用 RAG 等其他方法。就在最近，我们为 Claude 引入了即时缓存，从而以更高的效率和节省加速了这一策略。然而，随着知识库的扩展，您将需要更灵活的选择。这正是上下文检索的价值所在。传统上，RAG 已被证明在处理超出单个提示范围的更大知识库时至关重要。大型知识库最初被分解为许多较小的文本片段，称为块。每个块的重要性是通过利用嵌入模型创建捕获含义的向量表示来确定的。将这些向量表示存储在数据库中可以实现基于语义相似性的快速搜索。在运行时，用户的查询会触发在数据库中搜索最相关的块。随后将顶部块添加到生成模型的输入中以产生答案。虽然嵌入模型擅长捕获语义链接，但它们可能会忽略重要的精确匹配。 BM25（最佳匹配

一个团队为政府组织开发了相关性和等级 (RAG) 系统。他们采用了各种技术，包括混合检索（结合语义和向量搜索）和基于语言模型 (LLM) 的重新排名。然而，当使用综合生成的评估问题进行测试时，这些方法产生的变化微不足道。另一种方法称为 HYDRA，它显着降低了响应的有效性和检索质量。尽管需要对专家和现实世界的问题进行进一步评估，但该研究证实，仅 Azure AI 搜索中的语义搜索就足以满足企业 RAG 应用程序的需求，补充了矢量相似性搜索。使用不同方法对 RAG 进行实验测试表明，混合检索对于使用综合评估问题几乎没有影响。此外，当使用综合查询进行评估时，HYDRA 技术导致答案和检索质量严重下降。为了证实这些发现，有必要利用现实世界和专家问题进行进一步检查。无论如何，研究人员发现，将 Azure AI 搜索中的语义搜索与向量相似性相结合，可为企业 RAG 应用程序带来有利的结果。根据具体情况，其他方法（例如 Best Match 25 或精细调整的查询后处理分数归一化矩阵）也可能被证明是有效的。本质上，关键信息是根据每个应用程序的独特需求来探索、试验并选择适当的选项。未来的计划包括实施 RAG 的递归答案传播转换器 (RAPTOR-RAG)、自相关评估生成器、Agentic RAG、查询细化（扩展和子查询）和 GraphRAG 学习算法。研究人员在尝试使用 RAGAS 等工具或类似指标来拒绝零假设时，强调基线和实验。此外，他们建议将评估问题分为三类：1）专家编写的问答，2）从用户日志数据中得出的问题，以及3）根据源材料综合创建的问答。尽管熟悉传统的搜索实践，但作者在考虑矢量搜索时发现了局限性。尽管如此，该文章在继续使用 Fusion 进行重新排名 (rrf) 之前，介绍了 Contextual BM25 与其他混合方法的对比。对于问答任务来说，毫无疑问向量搜索和语义搜索

原文

For an AI model to be useful in specific contexts, it often needs access to background knowledge. For example, customer support chatbots need knowledge about the specific business they're being used for, and legal analyst bots need to know about a vast array of past cases.

Developers typically enhance an AI model's knowledge using Retrieval-Augmented Generation (RAG). RAG is a method that retrieves relevant information from a knowledge base and appends it to the user's prompt, significantly enhancing the model's response. The problem is that traditional RAG solutions remove context when encoding information, which often results in the system failing to retrieve the relevant information from the knowledge base.

In this post, we outline a method that dramatically improves the retrieval step in RAG. The method is called “Contextual Retrieval” and uses two sub-techniques: Contextual Embeddings and Contextual BM25. This method can reduce the number of failed retrievals by 49% and, when combined with reranking, by 67%. These represent significant improvements in retrieval accuracy, which directly translates to better performance in downstream tasks.

You can easily deploy your own Contextual Retrieval solution with Claude with our cookbook.

A note on simply using a longer prompt

Sometimes the simplest solution is the best. If your knowledge base is smaller than 200,000 tokens (about 500 pages of material), you can just include the entire knowledge base in the prompt that you give the model, with no need for RAG or similar methods.

A few weeks ago, we released prompt caching for Claude, which makes this approach significantly faster and more cost-effective. Developers can now cache frequently used prompts between API calls, reducing latency by > 2x and costs by up to 90% (you can see how it works by reading our prompt caching cookbook).

However, as your knowledge base grows, you'll need a more scalable solution. That’s where Contextual Retrieval comes in.

A primer on RAG: scaling to larger knowledge bases

For larger knowledge bases that don't fit within the context window, RAG is the typical solution. RAG works by preprocessing a knowledge base using the following steps:

Break down the knowledge base (the “corpus” of documents) into smaller chunks of text, usually no more than a few hundred tokens;
Use an embedding model to convert these chunks into vector embeddings that encode meaning;
Store these embeddings in a vector database that allows for searching by semantic similarity.

At runtime, when a user inputs a query to the model, the vector database is used to find the most relevant chunks based on semantic similarity to the query. Then, the most relevant chunks are added to the prompt sent to the generative model.

While embedding models excel at capturing semantic relationships, they can miss crucial exact matches. Fortunately, there’s an older technique that can assist in these situations. BM25 (Best Matching 25) is a ranking function that uses lexical matching to find precise word or phrase matches. It's particularly effective for queries that include unique identifiers or technical terms.

BM25 works by building upon the TF-IDF (Term Frequency-Inverse Document Frequency) concept. TF-IDF measures how important a word is to a document in a collection. BM25 refines this by considering document length and applying a saturation function to term frequency, which helps prevent common words from dominating the results.

Here’s how BM25 can succeed where semantic embeddings fail: Suppose a user queries "Error code TS-999" in a technical support database. An embedding model might find content about error codes in general, but could miss the exact "TS-999" match. BM25 looks for this specific text string to identify the relevant documentation.

RAG solutions can more accurately retrieve the most applicable chunks by combining the embeddings and BM25 techniques using the following steps:

Break down the knowledge base (the "corpus" of documents) into smaller chunks of text, usually no more than a few hundred tokens;
Create TF-IDF encodings and semantic embeddings for these chunks;
Use BM25 to find top chunks based on exact matches;
Use embeddings to find top chunks based on semantic similarity;
Combine and deduplicate results from (3) and (4) using rank fusion techniques;
Add the top-K chunks to the prompt to generate the response.

By leveraging both BM25 and embedding models, traditional RAG systems can provide more comprehensive and accurate results, balancing precise term matching with broader semantic understanding.

A Standard Retrieval-Augmented Generation (RAG) system that uses both embeddings and Best Match 25 (BM25) to retrieve information. TF-IDF (term frequency-inverse document frequency) measures word importance and forms the basis for BM25.

This approach allows you to cost-effectively scale to enormous knowledge bases, far beyond what could fit in a single prompt. But these traditional RAG systems have a significant limitation: they often destroy context.

The context conundrum in traditional RAG

In traditional RAG, documents are typically split into smaller chunks for efficient retrieval. While this approach works well for many applications, it can lead to problems when individual chunks lack sufficient context.

For example, imagine you had a collection of financial information (say, U.S. SEC filings) embedded in your knowledge base, and you received the following question: "What was the revenue growth for ACME Corp in Q2 2023?"

A relevant chunk might contain the text: "The company's revenue grew by 3% over the previous quarter." However, this chunk on its own doesn't specify which company it's referring to or the relevant time period, making it difficult to retrieve the right information or use the information effectively.

Introducing Contextual Retrieval

Contextual Retrieval solves this problem by prepending chunk-specific explanatory context to each chunk before embedding (“Contextual Embeddings”) and creating the BM25 index (“Contextual BM25”).

Let’s return to our SEC filings collection example. Here's an example of how a chunk might be transformed:

original_chunk = "The company's revenue grew by 3% over the previous quarter."

contextualized_chunk = "This chunk is from an SEC filing on ACME corp's performance in Q2 2023; the previous quarter's revenue was $314 million. The company's revenue grew by 3% over the previous quarter."

It is worth noting that other approaches to using context to improve retrieval have been proposed in the past. Other proposals include: adding generic document summaries to chunks (we experimented and saw very limited gains), hypothetical document embedding, and summary-based indexing (we evaluated and saw low performance). These methods differ from what is proposed in this post.

Implementing Contextual Retrieval

Of course, it would be far too much work to manually annotate the thousands or even millions of chunks in a knowledge base. To implement Contextual Retrieval, we turn to Claude. We’ve written a prompt that instructs the model to provide concise, chunk-specific context that explains the chunk using the context of the overall document. We used the following Claude 3 Haiku prompt to generate context for each chunk:

<document> 
{{WHOLE_DOCUMENT}} 
</document> 
Here is the chunk we want to situate within the whole document 
<chunk> 
{{CHUNK_CONTENT}} 
</chunk> 
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.

The resulting contextual text, usually 50-100 tokens, is prepended to the chunk before embedding it and before creating the BM25 index.

Here’s what the preprocessing flow looks like in practice:

*Contextual Retrieval is a preprocessing technique that improves retrieval accuracy.*

If you’re interested in using Contextual Retrieval, you can get started with our cookbook.

Using Prompt Caching to reduce the costs of Contextual Retrieval

Contextual Retrieval is uniquely possible at low cost with Claude, thanks to the special prompt caching feature we mentioned above. With prompt caching, you don’t need to pass in the reference document for every chunk. You simply load the document into the cache once and then reference the previously cached content. Assuming 800 token chunks, 8k token documents, 50 token context instructions, and 100 tokens of context per chunk, the one-time cost to generate contextualized chunks is $1.02 per million document tokens.

Methodology

We experimented across various knowledge domains (codebases, fiction, ArXiv papers, Science Papers), embedding models, retrieval strategies, and evaluation metrics. We’ve included a few examples of the questions and answers we used for each domain in Appendix II.

The graphs below show the average performance across all knowledge domains with the top-performing embedding configuration (Gemini Text 004) and retrieving the top-20-chunks. We use 1 minus recall@20 as our evaluation metric, which measures the percentage of relevant documents that fail to be retrieved within the top 20 chunks. You can see the full results in the appendix - contextualizing improves performance in every embedding-source combination we evaluated.

Performance improvements

Our experiments showed that:

Contextual Embeddings reduced the top-20-chunk retrieval failure rate by 35% (5.7% → 3.7%).
Combining Contextual Embeddings and Contextual BM25 reduced the top-20-chunk retrieval failure rate by 49% (5.7% → 2.9%).

*Combining Contextual Embedding and Contextual BM25 reduce the top-20-chunk retrieval failure rate by 49%.*

Implementation considerations

When implementing Contextual Retrieval, there are a few considerations to keep in mind:

Chunk boundaries: Consider how you split your documents into chunks. The choice of chunk size, chunk boundary, and chunk overlap can affect retrieval performance¹.
Embedding model: Whereas Contextual Retrieval improves performance across all embedding models we tested, some models may benefit more than others. We found Gemini and Voyage embeddings to be particularly effective.
Custom contextualizer prompts: While the generic prompt we provided works well, you may be able to achieve even better results with prompts tailored to your specific domain or use case (for example, including a glossary of key terms that might only be defined in other documents in the knowledge base).
Number of chunks: Adding more chunks into the context window increases the chances that you include the relevant information. However, more information can be distracting for models so there's a limit to this. We tried delivering 5, 10, and 20 chunks, and found using 20 to be the most performant of these options (see appendix for comparisons) but it’s worth experimenting on your use case.

Always run evals: Response generation may be improved by passing it the contextualized chunk and distinguishing between what is context and what is the chunk.

Further boosting performance with Reranking

In a final step, we can combine Contextual Retrieval with another technique to give even more performance improvements. In traditional RAG, the AI system searches its knowledge base to find the potentially relevant information chunks. With large knowledge bases, this initial retrieval often returns a lot of chunks—sometimes hundreds—of varying relevance and importance.

Reranking is a commonly used filtering technique to ensure that only the most relevant chunks are passed to the model. Reranking provides better responses and reduces cost and latency because the model is processing less information. The key steps are:

Perform initial retrieval to get the top potentially relevant chunks (we used the top 150);
Pass the top-N chunks, along with the user's query, through the reranking model;
Using a reranking model, give each chunk a score based on its relevance and importance to the prompt, then select the top-K chunks (we used the top 20);
Pass the top-K chunks into the model as context to generate the final result.

*Combine Contextual Retrieva and Reranking to maximize retrieval accuracy.*

Performance improvements

There are several reranking models on the market. We ran our tests with the Cohere reranker. Voyage also offers a reranker, though we did not have time to test it. Our experiments showed that, across various domains, adding a reranking step further optimizes retrieval.

Specifically, we found that Reranked Contextual Embedding and Contextual BM25 reduced the top-20-chunk retrieval failure rate by 67% (5.7% → 1.9%).

*Reranked Contextual Embedding and Contextual BM25 reduces the top-20-chunk retrieval failure rate by 67%.*

Cost and latency considerations

One important consideration with reranking is the impact on latency and cost, especially when reranking a large number of chunks. Because reranking adds an extra step at runtime, it inevitably adds a small amount of latency, even though the reranker scores all the chunks in parallel. There is an inherent trade-off between reranking more chunks for better performance vs. reranking fewer for lower latency and cost. We recommend experimenting with different settings on your specific use case to find the right balance.

Conclusion

We ran a large number of tests, comparing different combinations of all the techniques described above (embedding model, use of BM25, use of contextual retrieval, use of a reranker, and total # of top-K results retrieved), all across a variety of different dataset types. Here’s a summary of what we found:

Embeddings+BM25 is better than embeddings on their own;
Voyage and Gemini have the best embeddings of the ones we tested;
Passing the top-20 chunks to the model is more effective than just the top-10 or top-5;
Adding context to chunks improves retrieval accuracy a lot;
Reranking is better than no reranking;
All these benefits stack: to maximize performance improvements, we can combine contextual embeddings (from Voyage or Gemini) with contextual BM25, plus a reranking step, and adding the 20 chunks to the prompt.

We encourage all developers working with knowledge bases to use our cookbook to experiment with these approaches to unlock new levels of performance.

Appendix I

Below is a breakdown of results across datasets, embedding providers, use of BM25 in addition to embeddings, use of contextual retrieval, and use of reranking for Retrievals @ 20.

See Appendix II for the breakdowns for Retrievals @ 10 and @ 5 as well as example questions and answers for each dataset.