This repository contains our submission for the MMU-RAG Competition, a deep research agent named TTD-RAG. Our system is a faithful implementation of the framework proposed in the paper "Deep Researcher with Test-Time Diffusion (TTD-DR)". This README is generated by gemini 2.5.
It conceptualizes report generation as an iterative "denoising" process, starting with a preliminary draft and progressively refining it through cycles of targeted search, synthesis, and revision. This approach is designed to excel at complex, multi-hop reasoning tasks that require coherent, long-form answers.
- Test-Time Diffusion Framework: Models research report generation as an iterative process of refining a "noisy" draft with external information, ensuring coherence and reducing information loss.
- Report-Level Denoising with Retrieval: Uses an evolving draft to dynamically guide the search process, ensuring each retrieval step is targeted at filling specific knowledge gaps.
- Component-wise Self-Evolution: Enhances the quality of each step in the workflow (planning, synthesis) by generating diverse variants, critiquing them, and merging them into a superior output.
- High-Performance Serving: Utilizes vLLM to serve both the generative (
Qwen/Qwen3-4B-Instruct-2507) and reranking (tomaarsen/Qwen3-Reranker-0.6B-seq-cls) models for high throughput and low latency. - Competition Compliant: Fully supports both dynamic (streaming) and static evaluation endpoints as required by the competition rules, validated with the provided
local_test.pyscript.
The agent operates in a structured, multi-stage process orchestrated by src/pipeline.py:
-
Stage 1: Planning & Initial Drafting
- An initial Research Plan is generated to outline the key areas of investigation.
- A preliminary Noisy Draft is created based on the LLM's internal knowledge, serving as the starting point for the diffusion process.
-
Stage 2: Iterative Search & Denoising
- The system enters a loop, where for each iteration:
- A new search query is generated, informed by the current draft's deficiencies and the overall plan.
- Documents are retrieved from the FineWeb Search API.
- The retrieved documents are chunked and reranked using a specialized model to find the most relevant information.
- The top-ranked chunks are synthesized into a concise answer for the search query.
- The draft is revised ("denoised") by integrating this new information.
- The system enters a loop, where for each iteration:
-
Stage 3: Final Report Generation
- After the iterations complete, the agent synthesizes the final, refined draft, the initial plan, and the full history of questions and answers into a single, comprehensive report.
- Backend Framework: FastAPI
- LLM Serving: vLLM
- Generative LLM:
Qwen/Qwen3-4B-Instruct-2507 - Reranker Model:
tomaarsen/Qwen3-Reranker-0.6B-seq-cls - Retrieval Source: FineWeb Search API
- Containerization: Docker
- Docker and Docker Compose
- An NVIDIA GPU with 24GB+ VRAM
- NVIDIA Container Toolkit
First, create a local environment file from the example template. This file will store your API keys.
Now, open .env and add your API keys for:
FINEWEB_API_KEYOPENROUTER_API_KEY(used as a fallback for the generator)
We recommend using Docker Compose, which handles building the image and running the services as defined in compose.yml.
docker compose up --buildThis command will:
- Build the Docker image from the
Dockerfile. - Start the container.
- Execute the
start.shscript, which first launches the vLLM OpenAI-compatible server in the background to serve the Qwen models. - After a brief pause to allow the models to load, it starts the FastAPI application on port
5053.
Your API is now running and accessible at http://localhost:5053.
You can verify that your service is compliant with the competition requirements using the provided local_test.py script.
uv sync
source venv/bin/activate
# Test both the /run and /evaluate endpoints (full test)
python local_test.py --base-url http://localhost:5053
# Test only the dynamic /run endpoint
python local_test.py --base-url http://localhost:5053 --test-mode run
# Test only the static /evaluate endpoint
python local_test.py --base-url http://localhost:5053 --test-mode evaluateA successful run will confirm that both endpoints are functioning correctly and that the result.jsonl file is generated as expected for the static evaluation.
-
Health Check:
GET /health- A simple endpoint to confirm the service is running. Returns
{"status": "ok"}.
- A simple endpoint to confirm the service is running. Returns
-
Dynamic Evaluation:
POST /run- Input:
{"question": "string"} - Output: A Server-Sent Events (SSE) stream that provides real-time updates on the agent's progress, including intermediate steps, citations, and the final report.
- Input:
-
Static Evaluation:
POST /evaluate- Input:
{"query": "string", "iid": "string"} - Output: A single JSON response
{"query_id": "string", "generated_response": "string"}.
- Input:
The following AWS CLI commands are provided for pushing your final Docker image to the competition's ECR repository.
-
Sign in to AWS ECR
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <your-aws-account-id>.dkr.ecr.us-east-1.amazonaws.com
-
Build the Image (if not already built) Ensure you build for the correct platform.
docker build --platform linux/amd64 -t ttt-dr:latest . -
Tag the Image for ECR
docker tag ttt-dr:latest <your-aws-account-id>.dkr.ecr.us-east-1.amazonaws.com/neurips2025text/ttt-dr:latest
-
Push the Image to ECR
docker push <your-aws-account-id>.dkr.ecr.us-east-1.amazonaws.com/neurips2025text/ttt-dr:latest