展示 HN：RepoReaper – 具有 AST 感知能力、JIT 加载的代码审计代理 (Python/AsyncIO)

展示 HN：RepoReaper – 具有 AST 感知能力、JIT 加载的代码审计代理 (Python/AsyncIO)
Show HN: RepoReaper – AST-aware, JIT-loading code audit agent (Python/AsyncIO)

原始链接: https://github.com/tzzp1224/RepoReaper

## RepoReaper：一个自主代码分析代理 RepoReaper是一个智能系统，旨在进行自动化架构分析和语义代码搜索，超越了简单的“与代码聊天”工具。它作为一个自主代理运行，模拟高级技术负责人，通过动态探索代码仓库而不是依赖静态索引。其核心创新在于将LLM视为CPU，将向量存储视为动态、智能缓存（RAG）。RepoReaper最初使用抽象语法树（AST）映射仓库结构并预取关键文件。在问答期间，一种“即时”（JIT）检索机制会在需要时通过GitHub API获取缺失的文件，并实时更新缓存。主要功能包括：感知AST的语义分块以保留代码逻辑、用于提高速度的异步并发流水线、用于提高准确性的混合搜索（BM25和向量嵌入）以及原生双语（英语/中文）支持。RepoReaper使用Python、FastAPI和ChromaDB构建，并通过会话管理和网络弹性来优先考虑性能。它可以通过Docker或直接从GitHub安装获得 ([https://github.com/tzzp1224/RepoReaper.git](https://github.com/tzzp1224/RepoReaper.git))。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录展示 HN: RepoReaper – 了解 AST，即时加载的代码审计代理 (Python/AsyncIO) (github.com/tzzp1224) 7 分，由 realdexter 1 天前发布 | 隐藏 | 过去 | 收藏 | 讨论OP 在这里。我构建 RepoReaper 是为了解决 RAG 中的代码上下文碎片化问题。与标准的“与仓库聊天”工具不同，它模拟了资深工程师的工作流程：它解析 Python AST 以进行逻辑感知的分块，使用 ReAct 循环即时从 GitHub 获取缺少的依赖文件，并采用混合搜索（BM25+向量）。它还生成 Mermaid 图表以进行架构可视化。后端完全是异步的，并通过 ChromaDB 持久化状态。链接：https://github.com/tzzp1224/RepoReaper 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系方式搜索：

原文

An intelligent, agentic system for automated architectural analysis and semantic code search.

This project transcends traditional "Chat with Code" paradigms by implementing an autonomous Agent that mimics the cognitive process of a Senior Tech Lead. Instead of statically indexing a repository, the system treats the Large Language Model (LLM) as the CPU and the Vector Store as a high-speed Context Cache. The agent dynamically traverses the repository structure, pre-fetching critical contexts into the "cache" (RAG) and performing Just-In-Time (JIT) reads when semantic gaps are detected.

🚀 Core Philosophy: RAG as an Intelligent Cache

In traditional code assistants, RAG (Retrieval-Augmented Generation) is often a static lookup table. In this architecture, we redefine RAG as a Dynamic L2 Cache for the LLM:

Cold Start (Repo Map): The agent first parses the Abstract Syntax Tree (AST) of the entire repository to build a lightweight symbol map (Classes/Functions). This serves as the "index" to the file system.
Prefetching (Analysis Phase): During the initial analysis, the agent autonomously selects the most critical 10-20 files based on architectural relevance, parses them, and "warms up" the vector store (the cache).
Cache Miss Handling (ReAct Loop): During user Q&A, if the retrieval mechanism (BM25 + Vector) returns insufficient context, the Agent triggers a Just-In-Time (JIT) file read. It autonomously tools the GitHub API to fetch missing files, updates the cache in real-time, and re-generates the answer.

🏗 System Architecture & Innovations

1. AST-Aware Semantic Chunking

Standard text chunking destroys code logic. We utilize Python's ast module to implement Structure-Aware Chunking.

Logical Boundaries: Code is split by Class and Method definitions, ensuring that a function is never severed in the middle.
Context Injection: Large classes are decomposed into methods, but the parent class's signature and docstrings are injected into every child chunk. This ensures the LLM understands the "why" (class purpose) even when looking at the "how" (method implementation).

2. Asynchronous Concurrency Pipeline

Built on top of asyncio and httpx, the system is designed for high-throughput I/O operations.

Non-Blocking Ingestion: Repository parsing, AST extraction, and vector embedding occur concurrently.
Worker Scalability: The application runs behind Gunicorn with Uvicorn workers, utilizing a stateless design pattern where the Vector Store Manager synchronizes context via persistent disk storage and shared ChromaDB instances. This allows multiple workers to serve requests without race conditions.

3. The "Just-In-Time" ReAct Agent

The Chat Service implements a sophisticated Reasoning + Acting (ReAct) loop:

Query Rewrite: User queries (often vague or in different languages) are first rewritten by an LLM into precise, English-language technical keywords for optimal BM25/Vector retrieval.
Self-Correction: If the retrieved context is insufficient, the model does not hallucinate. Instead, it issues a <tool_code> command to fetch specific file paths from the repository. The system intercepts this command, pulls the fresh data, indexes it, and feeds it back to the model in a single inference cycle.

4. Hybrid Search Mechanism

To balance semantic understanding with exact keyword matching, the retrieval engine employs a weighted hybrid approach:

Dense Retrieval (Vector): Uses BAAI/bge-m3 embeddings to find conceptually similar code (e.g., matching "authentication" to "login logic").
Sparse Retrieval (BM25): Captures exact variable names, error codes, and specific function signatures that vector embeddings might miss.
Reciprocal Rank Fusion (RRF): Results are fused and re-ranked to ensure the highest fidelity context is provided to the LLM.

5. Native Bilingual Support

The architecture is completely language-agnostic but optimized for dual-language environments (English/Chinese).

Dynamic Prompt Engineering: The system detects the user's input language and hot-swaps the System Prompts to ensure the output format, tone, and technical terminology align with the user's locale.
UI Integration: The frontend includes a dedicated language toggle that influences the entire generation pipeline, from the initial architectural report to the final Q&A.

Core: Python 3.10+, FastAPI, AsyncIO
LLM Integration: OpenAI SDK (compatible with DeepSeek/SiliconFlow)
Vector Database: ChromaDB (Persistent Storage)
Search Algorithms: BM25Okapi, Rank-BM25
Parsing: Python ast (Abstract Syntax Trees)
Frontend: HTML5, Server-Sent Events (SSE) for real-time streaming, Mermaid.js for architecture diagrams.
Deployment: Docker, Gunicorn, Uvicorn.

⚡ Performance Optimization

Session Management: Uses browser sessionStorage coupled with server-side persistent contexts, allowing users to refresh pages without losing the "warm" cache state.
Network Resilience: Implements robust error handling for GitHub API rate limits (403/429) and network timeouts during long-context generation.
Memory Efficiency: The VectorStoreManager is designed to be stateless in memory but stateful on disk, preventing memory leaks in long-running container environments.

Prerequisites:

Python 3.9+
Valid GitHub Token
LLM API Keys (DeepSeek-V3 & SiliconFlow bge-m3 recommended).

Clone the Repository

git clone [https://github.com/tzzp1224/RepoReaper.git](https://github.com/tzzp1224/RepoReaper.git)
cd RepoReaper

Install Dependencies Using a virtual environment is recommended:

# Create and activate venv
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install requirements
pip install -r requirements.txt

Configure Environment Create a .env file in the root directory:

# GitHub Personal Access Token
GITHUB_TOKEN=ghp_xxxxxxxxxxxxxxx

# LLM API Key (e.g., DeepSeek)
DEEPSEEK_API_KEY=sk-xxxxxxxxxxxxxxx

# Embedding API Key (SiliconFlow)
SILICON_API_KEY=sk-xxxxxxxxxxxxxxx

Start the Service

Option A: Local Run (Universal) Compatible with Windows, macOS, and Linux. Recommended for development:

(Note: Linux users can still use gunicorn -c gunicorn_conf.py app.main:app for production deployment)

Option B: Docker Run 🐳 Run in an isolated container:
```
# 1. Build Image
docker build -t reporeaper .

# 2. Run Container (loading env vars)
docker run -d -p 8000:8000 --env-file .env --name reporeaper reporeaper
```
Access Dashboard Navigate to http://localhost:8000. Enter a GitHub repository URL to trigger the autonomous analysis agent.