CC-Canary：检测 Claude 代码的早期回归迹象

CC-Canary：检测 Claude 代码的早期回归迹象
CC-Canary: Detect early signs of regressions in Claude Code

原始链接: https://github.com/delta-hq/cc-canary

## cc-canary：Claude Code 模型漂移检测 cc-canary 是一款本地、注重隐私的工具，用于检测 Claude Code 模型的漂移。它分析您现有的 Claude Code 会话日志 (~/.claude/projects/)，无需任何网络访问、账户或遥测数据。它被打包为两个可安装的 Agent Skills：`cc-canary`（Markdown 报告）和 `cc-canary-html`（交互式仪表盘）。该工具生成法医报告，详细描述模型在指定时间窗口（7-180 天）内的行为，通过成本、读/写比例、推理循环和 token 使用量等指标突出潜在的回归。报告包括“判决”（HOLDING，SUSPECTED/CONFIRMED REGRESSION，INCONCLUSIVE）以及跨模型版本的详细比较。 cc-canary 通过聚合会话数据、检测模型健康状况的拐点，并预渲染报告，然后由 Claude 填充叙述性分析来工作。它需要 Python 3.8+，并且可以通过 `npx skills add delta-hq/cc-canary` 轻松安装。目前处于预 Alpha（0.x）阶段，输出格式和指标可能会发生变化。更多信息和贡献请访问 [github.com/delta-hq/cc-canary/issues](https://github.com/delta-hq/cc-canary/issues)。

一种名为CC-Canary的新工具，由Delta-HQ开发并在Hacker News上分享，旨在检测Claude Code的回归问题。它帮助开发者跟踪代码生成质量受到的影响，例如添加技能或调整提示。讨论强调了一个常见挑战：评估LLM的性能成本高昂，尤其是对于个人开发者而言。一位评论员质疑使用LLM本身进行测量是否可靠，将其比作“自查”。其他人则对在不同编码环境（“工具”）中工作的类似工具表示兴趣。帖子还提醒大家，Y Combinator 2026年夏季项目的申请现已开放。

原文

Drift detection for Claude Code, packaged as two installable Agent Skills. Reads the JSONL session logs Claude Code already writes to ~/.claude/projects/, detects whether the model has been drifting on your own work, and produces a shareable forensic report.

No network, no account, no telemetry, no background daemon. Runs on the data already on your disk.

Status: 0.x / pre-alpha — output format and metric set may change.

Skill	Invocation	Output
`cc-canary`	`/cc-canary [window]`	forensic markdown writeup (`./cc-canary-<date>.md`) — paste-ready for GitHub issues or gists
`cc-canary-html`	`/cc-canary-html [window]`	same report as a dark-theme HTML dashboard (`./cc-canary-<date>.html`), auto-opens in your browser

Window defaults to 60d. Accepts 7d / 14d / 30d / 60d / 90d / 180d.

Each report includes:

Verdict — HOLDING / SUSPECTED REGRESSION / CONFIRMED REGRESSION / INCONCLUSIVE
Headline metrics table (pre vs post, with 🟢/🟡/🔴 band verdicts)
Weekly trend bars — cost (USD, verified against ccusage to the cent), read:edit ratio, reasoning loops, tokens/turn
Cross-version comparison — same user, different model versions, controlling for task mix
Auto-detected inflection date — composite health-score break
Findings with model-side / user-side / ambiguous classification
Appendices — hour-of-day thinking depth, word-frequency shift, three-period thinking-visibility transition, per-turn behavior rates, and more

npx skills add delta-hq/cc-canary

Install just one:

npx skills add delta-hq/cc-canary --skill cc-canary
npx skills add delta-hq/cc-canary --skill cc-canary-html

Then from any Claude Code session:

/cc-canary 60d
/cc-canary-html 30d

Requirements: python3 ≥ 3.8 on your PATH. macOS / Linux / WSL for the cc-canary-html auto-open step (it falls back to printing the path if open / xdg-open / start fails).

Scan. A bundled Python script (stdlib only — no pip, no Node) walks ~/.claude/projects/**/*.jsonl, filters by window and excludes subagent sessions by default.
Dedupe. Assistant messages are deduped on (message.id, requestId) — same scheme ccusage uses, because Claude Code writes the same message into multiple JSONLs when sessions are resumed or branched.
Aggregate. Per-session metrics: tool-mix, read:edit ratio, reasoning-loop phrases, self-admitted errors, premature stops, interrupts, token usage, cost (current Claude 4.x rates), hour-of-day thinking depth.
Detect inflection. Composite health score per day; argmax of |before − after| over candidate dates with a 0.75σ floor. Falls back to median-timestamp split if no break clears.
Pre-render the report. Script writes a markdown / HTML skeleton with every table and bar chart filled in. Only ~20 short narrative slots (marked ) are left for Claude to fill — verdict line, summary, per-finding reasoning, root-cause, appendix paragraphs.
Fill & save. Claude reads the skeleton, writes the narrative, saves the final file.

Total runtime: ~2.5s for the script + 10–20s for Claude to fill narrative.

Metrics in the headline table (with published healthy/transition/concerning bands where applicable):

Read:Edit ratio — file reads per edit. Proxy for how thoroughly the model investigates before mutating.
Write share of mutations — Write / (Edit + Write). High share = model rewriting files instead of surgical edits.
Reasoning loops / 1K tool calls — phrases like "let me try again", "oh wait", "actually,".
Frustration rate — rate of frustration words in your prompts.
Thinking redaction rate — fraction of thinking blocks that are redacted vs visible.
Mean thinking length — reasoning-depth proxy (via cryptographic signature length, r=0.97 with content length when visible).
API turns per user turn — how many API calls the model makes per user message.
Tokens per user turn — total token volume (input + output + cache) per user message.

Plus appendices with additional signals: premature stopping, self-admitted errors, shortcut vocabulary, user interrupts, hour-of-day thinking depth, per-word frequency shift, three-period thinking-visibility transition, per-turn behavior rates.

The script accepts flags you can pass via Bash(python3 scripts/compute_stats.py …) for custom runs:

Flag	Default	Purpose
`--window {Nd}`	`60d`	Window size (`7d / 14d / 30d / 60d / 90d / 180d`)
`--include-agents`	off	Include subagent sessions (default: excluded — they have no natural user prompts)
`--min-user-words N`	`10`	Drop sessions with fewer user-prompt words than this (filters trivial sessions)
`--render-md PATH`	—	Write the markdown skeleton to `PATH`
`--render-html PATH`	—	Write the HTML dashboard to `PATH`

Fully local. Zero network calls.
The script reads ~/.claude/projects/*.jsonl only. Nothing else.
Narrative prose is written by Claude during the skill invocation (inside your Claude Code session); it has access only to the on-disk files you explicitly point it at.
User-prompt content is truncated to ≤180 chars before being included in the skeleton, and redacted for /Users/… paths, emails, hex-like tokens.
Output files (./cc-canary-<date>.{md,html}) live in the directory where you invoked the skill. Nothing is uploaded anywhere.

Issues, metric suggestions, and PRs welcome: github.com/delta-hq/cc-canary/issues. Output format and metric set may change during 0.x.

Canaries were used in coal mines to detect early signs of danger. cc-canary does the same for drift in your Claude Code sessions.

MIT

CC-Canary：检测 Claude 代码的早期回归迹象 CC-Canary: Detect early signs of regressions in Claude Code

CC-Canary：检测 Claude 代码的早期回归迹象
CC-Canary: Detect early signs of regressions in Claude Code