生产式 RAG：我从处理 500 多万份文档中学到的经验

生产式 RAG：我从处理 500 多万份文档中学到的经验
Production RAG: what I learned from processing 5M+ documents

原始链接: https://blog.abdellatif.io/production-rag-processing-5m-documents

## RAG 实现经验总结（来自 900万 & 400万页项目）在为 Usul AI 和一家法律企业构建检索增强生成 (RAG) 系统 8 个月后，本总结详细介绍了关键收获。最初使用 Langchain 和 LlamaIndex 进行的快速原型验证具有误导性——尽管小规模测试结果令人鼓舞，但生产结果最初较差。带来最高投资回报的改进来自于**查询生成**（使用 LLM 扩展初始查询）、**重排序**（显著提高块相关性——50进15出设置效果最佳）以及精心设计的**块策略**，侧重于逻辑单元并避免句子中间断裂。在块文本中注入**元数据**也显著提高了答案质量。实施了一个**查询路由器**来处理不适合 RAG 的问题，并将它们定向到更简单的 API/LLM 解决方案。团队从 Azure/Pinecone 迁移到 **Turbopuffer**，以实现具有原生关键词搜索的成本效益高的向量存储。最终，他们的经验以开源形式通过 [agentset-ai/agentset](https://github.com/agentset-ai/agentset) 提供。

生产级 RAG：从处理 500 万+ 文档中学到的经验 (abdellatif.io) 20 分，tifa2up 发表于 36 分钟前 | 隐藏 | 过去 | 收藏 | 1 条评论 manishsharan 发表于 2 分钟前 [–] 感谢分享。我才知道rerankers。分块策略是个大问题。我发现将大文本直接发送到 gemini flash，让它总结和提取分块，比我尝试过的任何文本分割器都好。我使用 Anthropic 发布的方法 https://www.anthropic.com/engineering/contextual-retrieval，即为每个embedding包含完整的摘要和分块。我还创建了一个工具，让 LLM 能够自行进行向量搜索。我不使用 Langchain 或 python。我使用 Clojure + LLM 的 REST API。回复考虑申请 YC 2026 冬季批次！申请截止日期为 11 月 10 日指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系方式搜索：

Langchain + Llamaindex

We started out with youtube tutorials. First Langchain -> Llamaindex. Got to a working prototype in a couple of days and were optimistic with the progress. We run tests on subset of the data (100 documents) and the results looked great. We spend the next few days running the pipeline on the production dataset and got everything working in a week — incredible.

Except it wasn't, the results were subpar and only the end users could tell. We spent the following few months rewriting pieces of the system, one at a time, until the performance was at the level we wanted. Here are things we did ranked by ROI.

What moved the needle

Query Generation: not all context can be captured by the user's last query. We had an LLM review the thread and generate a number of semantic + keyword queries. We processed all of those queries in parallel, and passed them to a reranker. This made us cover a larger surface area and not be dependent on a computed score for hybrid search.
Reranking: the highest value 5 lines of code you'll add. The chunk ranking shifted a lot. More than you'd expect. Reranking can many times make up for a bad setup if you pass in enough chunks. We found the ideal reranker set-up to be 50 chunk input -> 15 output.
Chunking Strategy: this takes a lot of effort, you'll probably be spending most of your time on it. We built a custom flow for both enterprises, make sure to understand the data, review the chunks, and check that a) chunks are not getting cut mid-word or sentence b) ~each chunk is a logical unit and captures information on its own
Metadata to LLM: we started by passing the chunk text to the LLM, we ran an experiment and found that injecting relevant metadata as well (title, author, etc.) improves context and answers by a lot.
Query routing: many users asked questions that can't be answered by RAG (e.g. summarize the article, who wrote this). We created a small router that detects these questions and answers them using an API call + LLM instead of the full-blown RAG set-ups.

Our stack

Vector database: Azure -> Pinecone -> Turbopuffer (cheap, supports keyword search natively)
Document Extraction: Custom
Chunking: Unstructured.io by default, custom for enterprises (heard that Chonkie is good)
Embedding: text-embedding-large-3, haven't tested others
Reranker: None -> Cohere 3.5 -> Zerank (less known but actually good)
LLM: GPT 4.1 -> GPT 5 -> GPT 4.1, covered by Azure credits