我测试了Claude Code的caveman插件,对比了“简洁”的要求。
I benchmarked Claude Code's caveman plugin against "be brief."

原始链接: https://www.maxtaylor.me/articles/i-benchmarked-caveman-against-two-words

## Repo Caveman 插件:压缩基准测试 最近的一项基准测试对比了旨在压缩 Claude 响应的 Repo Caveman 插件与简单的“简洁明了”提示词以及 Claude 的默认设置。该研究使用严格的评分标准,评估了在六个类别中的表现:错误诊断、概念解释、架构权衡、多步骤设置、安全/破坏性操作以及错误解释,以评估质量和关键信息保留情况。 结果表明,Caveman 在整体 token 减少或质量方面并未始终优于“简洁明了”。虽然“简洁明了”实现了 34% 的 token 减少,但 Caveman 的“轻量”和“完整”模式与之相当。然而,Caveman 的“极致”模式,尽管旨在实现最大压缩,有时由于内置的“自动清晰化”功能而*增加了* token 数量。此功能有意放宽对安全关键指令(如安全警告或多步骤设置)的压缩,以确保清晰度。 最终,Caveman 的价值不仅仅在于压缩。它通过自动规则重新注入提供**一致的输出结构**和**跨会话的持久性**——这些是简单提示词所不具备的。虽然一个双词提示词可以在 token 数量和质量上与 Caveman 相匹配,但 Caveman 提供了更多的控制和可预测性,使其对于需要结构化 Claude 输出的应用来说很有价值。基准测试代码是开源的,可供进一步测试。

最近一项基准测试评估了Claude Code插件“Caveman”压缩回复的有效性,并将其与直接指示Claude“简洁明了”进行了对比。作者max-t-dev运行了24个提示,使用了5种不同的配置,并根据事实准确性、关键术语以及避免有害输出评估回复。 结果显示,“简洁明了”在token使用量(419 vs 401-449)和质量(0.985 vs 0.970-0.976)方面出人意料地与Caveman相匹配。虽然Caveman具有一致的结构和安全功能等优势,但其核心压缩能力并没有比简单的提示指令好很多。 评论者指出,由于AI回复本身存在固有的可变性,因此每个提示需要多次运行,质疑单次运行测试的统计意义。其他人批评了博客文章本身过于精炼和过度编辑的写作风格,一些用户发现Caveman没有帮助,更喜欢“简洁明了”的简单性。基准测试代码可在GitHub上获取,以供进一步研究。
相关文章

原文

Repo

Caveman is a popular Claude Code compression plugin. The pitch is in the name: ultra-compressed responses, ~75% fewer tokens, all the technical accuracy. Six modes, slash commands, intensity dials, classical Chinese variants.

I benchmarked it against two words: "be brief."

Same quality. Same range of tokens. The plugin didn't beat the boring default on either axis.

This article is the long version of the video. If you want the verdict in two minutes, watch it.

What I tested

CategoryFailure modeSkill claim testedn
Bug diagnosisDrops the why, gives fix without cause5
Concept explanationStrips nuance, edge cases, or compresses technical terms into plain EnglishTechnical terms exact5
Architectural tradeoffsDrops caveats that change the advice4
Multi-step setupCollapses or reorders steps4
Security / destructive opsMissing warnings on irreversible actionsAuto-Clarity escape3
Error interpretationParaphrases or truncates the error stringErrors quoted exact3

24 prompts across six categories: bug diagnosis, concept explanations, architecture tradeoffs, multi-step setup, security and destructive ops, error interpretation. Each prompt has a per-prompt rubric. Facts the answer must cover (key_points), terms it must use (must_use_terms), and dangerous wrong claims to avoid (must_avoid).

The dataset shape:

ts

interface PromptCase {
  id: string;
  category: string;
  prompt: string;
  key_points: string[];
  must_use_terms?: string[];
  must_avoid?: string[];
}

A real entry:

json

{
  "id": "bug_01",
  "category": "bug_diagnosis",
  "prompt": "I have `const [count, setCount] = useState(0); function handleClick() { setCount(count + 1); setCount(count + 1); }`. I expected count to go up by 2 per click but it only goes up by 1. Why?",
  "key_points": [
    "stale closure on count",
    "both calls set count to same value",
    "functional updater setCount(c => c + 1)"
  ]
}

Five arms:

  • baseline. Claude default, no instruction.
  • brief. "Be brief." prepended to every prompt.
  • lite, full, ultra. Caveman plugin at three intensity levels.

Each arm ran the full 24-prompt dataset through claude -p on claude-opus-4-7. A separate Claude (claude-sonnet-4-6) scored every response against its prompt's rubric. Semantic match on key points, literal match on required terms, trap detection on avoided claims.

harness diagram

The harness is open source here.

Quality didn't move

First check: did compression hurt correctness?

quality chart

Every arm scored within 1.5% of every other arm. Baseline 0.985. Brief 0.985. Lite 0.976. Full 0.975. Ultra 0.970. Every arm hit 100% of its key_points. Zero must_avoid triggers in 120 responses.

Compression didn't drop substantive content. Setting quality aside, the only axis worth comparing is tokens.

The headline result

mean tokens chart

Armmean tokens
baseline636
brief419
lite401
full404
ultra449

"Be brief." cut tokens 34% versus baseline. Caveman lite and full landed close to brief. Ultra, the strictest mode, produced the longest answers of the three caveman arms.

This looked bad for ultra. It's a false story.

The category split

Splitting tokens by category gives a clearer picture.

tokens by category

On bug diagnosis, concept explanations, architecture tradeoffs, and error interpretation, ultra is shortest or tied with the other caveman arms. Compression is working as advertised.

On multi-step setup and security warnings, every caveman mode gets more variable. Ultra catches the eye in the aggregate, but it's not specifically worse. All three caveman arms swing hard on these categories.

auto-clarity chart

The reason is in the skill itself. Caveman has an "Auto-Clarity" rule that explicitly drops compression for safety warnings, irreversible actions, and multi-step sequences. Exactly these two categories. When the safety escape engages, all three modes loosen toward natural prose. The compression just isn't running.

That's not a bug. It's a designed feature. Caveman knowing when to stop compressing.

So what's caveman actually for?

If a two-word prompt matches it on tokens and quality, the value isn't compression. It's structure.

Consistent output shape

Every caveman response follows the same pattern:

response pattern

Predictable in a way that "be brief." isn't. If you want a uniform feel across sessions, or have downstream tooling that consumes Claude output, that consistency is real value.

The intensity dial

Slash command to switch lite, full, ultra mid-session. Two words can't do that.

Persistence across long sessions

Caveman re-injects the ruleset on every prompt via SessionStart and UserPromptSubmit hooks.

hook re-injection

The goal is to keep the pattern from drifting across long sessions. My benchmark didn't test this. Every run was single-shot via claude -p. But the mechanism is real, and "be brief." in CLAUDE.md doesn't have an equivalent.

The safety escape

Auto-Clarity dropping compression on destructive ops is the variance you saw in the chart above. Caveman explicitly distinguishes when to stop compressing. Two words don't make that distinction. On my data this didn't change outcomes. "be brief." never tripped a must_avoid trap either. But the design exists.

What I cut from the video

A few findings that didn't earn their place in a two-minute video but are worth flagging here.

Lite missed a required term once. On a queue tradeoff question (SQS vs BullMQ vs Kafka), lite's markdown-table format compressed the comparison so tight it dropped the term "at-least-once". Score 0.70. The only row below 0.90 in the 120-row sweep. n=1, but it's a real failure mode for benchmarks that enforce specific terminology.

Ultra triggered tool-use behaviour the other modes didn't. On a Dockerfile setup question, ultra opened with "Need write perms. Retry after approve, or paste inline:". It tried to call the Write tool, got blocked, and dumped the file inline anyway. That single response added ~1300 tokens to ultra's setup category mean. Caveman's terse examples seem to prime tool-first behaviour, which is a side-effect of compression style I didn't see coming.

The arch_tradeoffs token inflation isn't what I thought. My initial findings doc claimed caveman's [thing] [action] [reason] pattern pushed the model toward bulleted enumerations on N-way comparison questions. Looking closer, lite and full have the same pattern but produced cleaner outputs (lite often wrote tables, full wrote prose). The pattern isn't the cause. I don't have a clean attribution.

What you should actually do

If all you want is shorter outputs, start with "be brief." in your prompt or CLAUDE.md. Two words. Matched caveman's tokens and quality.

Reach for caveman when you need consistent output structure across sessions. That's the differentiator that survived the benchmark.

The bigger lesson: most prompt-engineering advice hasn't been measured against the boring default. Measure it.


Repo: cc-compression-bench · Video: youtu.be/wijoYNiZq3M · Caveman plugin: juliusbrussee/caveman

If you've got a compression strategy you want benchmarked against the same dataset, the harness is strategy-agnostic. Adding an arm is one shell script. PRs welcome.

联系我们 contact @ memedata.com