展示HN:Mdarena – 对你的Claude.md进行基准测试,与你自己的PRs对比。
Show HN: Mdarena – Benchmark your Claude.md against your own PRs

原始链接: https://github.com/HudsonGri/mdarena

## mdarena:评估您的 CLAUDE.md 文件效果 **mdarena** 是一款工具,用于评估您的 `CLAUDE.md` 文件(您提供给 Claude 等代理的指令)相对于您的代码库实际 PR 的有效性。研究表明,这些文件通常会*降低*代理的成功率并增加成本。 **工作原理:** 1. **`mdarena mine`**: 提取合并的 PR 以创建任务集,自动从您的 CI/CD 配置中检测测试命令。 2. **`mdarena run`**: 通过将不同的 `CLAUDE.md` 配置(或无上下文)应用于 PR 之前的提交并评估生成的代理补丁来测试它们。它可以运行您现有的测试(如 SWE-bench),或退回到差异重叠评分。 3. **`mdarena report`**: 将代理补丁与原始 PR 差异进行比较,衡量测试通过/失败率、代码重叠、成本和统计显著性。 **生产 monorepo 测试的关键发现:** 与没有它们的基线相比,提供针对性上下文的按目录 `CLAUDE.md` 文件显著提高了测试解决率(约 27%),并且优于合并的单文件方法。 **mdarena 优先考虑安全性**,通过隔离检出防止访问未来的提交。它支持 SWE-bench 任务,并需要 Python 3.11+、`gh` 和 `claude` CLI。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 工作 | 提交 登录 展示 HN: Mdarena – 基准测试你的 Claude.md 与你自己的 PRs (github.com/hudsongri) 7 分钟前,hudsongr | 隐藏 | 过去 | 收藏 | 1 条评论 帮助 hudsongr 1 小时前 [–] 大家好!我构建了这个工具,因为现在每个人都在编写 CLAUDE.md 文件,但没有人知道他们的文件是否真的有效。研究结果也存在矛盾,一篇论文说它们会损害性能,另一篇论文说它们有帮助。所以我制作了一个工具,使用你自己的仓库、你自己的 PRs 和你自己的测试套件来衡量它。 结果是,你很少能指出一个 markdown 文件并说“这使代理在解决实际任务方面提高了 27%”。这就是我们在生产 monorepo 中看到的结果。我设想这是一种让团队真正改进代理编写代码的方式,而不是猜测。 回复 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系方式 搜索:
相关文章

原文

Benchmark your CLAUDE.md against your own PRs.

Most CLAUDE.md files are written blindly. Research shows they often reduce agent success rates and cost 20%+ more tokens. mdarena lets you measure whether yours helps or hurts, on tasks from your actual codebase.

pip install mdarena

# Mine 50 merged PRs into a test set
mdarena mine owner/repo --limit 50 --detect-tests

# Benchmark multiple CLAUDE.md files + baseline (no context)
mdarena run -c claude_v1.md -c claude_v2.md -c agents.md

# See who wins
mdarena report
mdarena mine     ->  Fetch merged PRs, filter, build task set
                     Auto-detect test commands from CI/package files

mdarena run      ->  For each task x condition:
                       - Checkout repo at pre-PR commit
                       - Baseline: all CLAUDE.md files stripped
                       - Context: inject CLAUDE.md, let Claude discover it
                       - Run tests if available, capture git diff

mdarena report   ->  Compare patches against gold (actual PR diff)
                       - Test pass/fail (same as SWE-bench)
                       - File/hunk overlap, cost, tokens
                       - Statistical significance (paired t-test)

mdarena can run your repo's actual tests to grade agent patches, the same way SWE-bench does it.

# Auto-detect from CI/CD
mdarena mine owner/repo --detect-tests

# Or specify manually
mdarena mine owner/repo --test-cmd "make test" --setup-cmd "npm install"

Parses .github/workflows/*.yml, package.json, pyproject.toml, Cargo.toml, and go.mod. When tests aren't available, falls back to diff overlap scoring.

Pass a directory to benchmark a full CLAUDE.md tree:

mdarena run -c ./configs-v1/ -c ./configs-v2/

Each directory mirrors your repo structure. Baseline strips ALL CLAUDE.md and AGENTS.md files from the entire tree.

We ran mdarena against a large production monorepo: 20 merged PRs, Claude Opus 4.6, three conditions (bare baseline, existing CLAUDE.md, hand-written alternative). Patches graded against real test suites. Not string matching, not LLM-as-judge.

Key findings:

  • The existing CLAUDE.md improved test resolution by ~27% over bare baseline
  • A consolidated alternative that merged all per-directory guidance into one file performed no better than no CLAUDE.md at all
  • On hard tasks, per-directory instruction files gave the agent targeted context, while the consolidated version introduced noise that caused regressions

The winning CLAUDE.md wasn't the longest or most detailed. It was the one that put the right context in front of the agent at the right time.

# Import SWE-bench tasks
pip install datasets
mdarena load-swebench lite --limit 50
mdarena run -c my_claude.md

# Or export your tasks as SWE-bench JSONL
mdarena export-swebench

Only benchmark repositories you trust. mdarena executes code from the repos it benchmarks (test commands run via shell=True, Claude Code runs with --dangerously-skip-permissions). Sandboxes are isolated temp directories under /tmp but processes run as your user.

Benchmark integrity: Because tasks come from historical PRs, the gold patch is in the repo's git history. Claude 4 Sonnet exploited this against SWE-bench by walking future commits via tags. mdarena prevents this with history-free checkouts: git archive exports a snapshot at base_commit into a fresh single-commit repo. Future commits don't exist in the object database at all. See tests/test_isolated_checkout.py for the integrity assertions.

  • Python 3.11+
  • gh CLI (authenticated)
  • claude CLI (Claude Code)
  • git
Command Description
mdarena mine <repo> Mine merged PRs into a task set
mdarena mine <repo> --detect-tests Mine with auto-detected test extraction
mdarena run -c file.md Benchmark a single CLAUDE.md
mdarena run -c a.md -c b.md Compare multiple files head-to-head
mdarena run --no-run-tests Skip test execution, diff overlap only
mdarena report Analyze results, show comparison
mdarena load-swebench [dataset] Import SWE-bench tasks
mdarena export-swebench Export tasks as SWE-bench JSONL
git clone https://github.com/HudsonGri/mdarena.git
cd mdarena
uv sync
uv run pytest
uv run ruff check src/

See ROADMAP.md.

MIT. See LICENSE.

联系我们 contact @ memedata.com