你雇佣了人工智能来编写测试。当然,它们通过了。
Agents that run while I sleep

原始链接: https://www.claudecodecamp.com/p/i-m-building-agents-that-run-while-i-sleep

## AI生成代码的信任挑战 随着像Claude这样的人工智能代理越来越多地自主编写代码,一个关键问题出现了:我们如何*知道*代码是正确的?传统的代码审查难以跟上变化的数量,而依赖人工智能来测试其自身的工作会形成自我验证的循环,从而忽略了根本的误解。 解决方案不是增加审查员,而是回归测试驱动开发(TDD)的核心原则:**在编写代码*之前*定义什么是“正确”的。** 工程师不应提示解决方案,而应首先编写清晰、具体的**验收标准**——对所需功能的纯英文描述(例如,“用户在凭据错误时看到‘无效的电子邮件或密码’”)。 然后,人工智能可以根据这些标准构建,并使用自动化验证(例如,对于前端使用Playwright,对于后端使用`curl`)可以严格地针对这些标准进行测试。这会将审查重点从复杂的差异转移到简单的通过/失败报告,仅突出实际的失败。 虽然这不能消除所有错误(不正确的规范仍然会产生有缺陷的结果),但它通过捕获集成问题并确保代码按*预期*运行,从而显著提高可靠性,从而提供了一种比仅仅依赖人工智能驱动的代码审查更可靠的方法。像`opslane/verify`这样的工具使用Claude和Playwright简化了这个过程。

## AI 生成代码与测试挑战 一则 Hacker News 讨论围绕使用 AI(特别是 Claude)编写代码*和*测试的陷阱。核心问题在于,AI 生成的测试往往只是确认 AI 自身的工作(“测试剧场”),缺乏真正的验证。 多位用户建议了一些缓解策略,包括采用“红-绿-重构”方法,并使用专门负责每个阶段的 AI 子代理,以及利用不同的 LLM(如 Gemini 和 Codex)进行代码生成和审查,以引入不同的视角。 一个突出的挑战是 AI 产生的代码量巨大,使得彻底的人工审查变得困难。 提出的解决方案包括自动化验证工具(如作者的 [verify skill](https://github.com/opslane/verify))、用于针对外部服务进行测试的“数字孪生”,以及更严格的文件权限控制,以防止 AI 修改测试。 最终,对话指向了一种思维方式的转变——关注清晰的规范和*对照*这些规范进行验证,而不是仅仅依赖 AI 生成的测试。 越来越担心优先考虑速度而非正确性可能会导致软件越来越不可靠。
相关文章

原文

I've been building agents that write code while I sleep. Tools like Gastown run for hours without me watching. Changes land in branches I haven't read. A few weeks ago I realized I had no reliable way to know if any of it was correct: whether it actually does what I said it should do.

I care about this. I don't want to push slop, and I had no real answer.

I've run Claude Code workshops for over 100 engineers in the last six months. Same problem everywhere, just at different scales. Teams using Claude for everyday PRs are merging 40-50 a week instead of 10. Teams are spending a lot more time in code reviews.

As systems get more autonomous, the problem compounds. At some point you're not reviewing diffs at all, just watching deploys and hoping something doesn't break.

So the question I kept coming back to: what do you actually trust when you can't review everything?

The obvious answers don't work

You could hire more reviewers. But you can't hire fast enough. And making senior engineers read AI-generated code all day isn't worth it.

When Claude writes tests for code Claude just wrote, it's checking its own work. The tests prove the code does what Claude thought you wanted. Not what you actually wanted. They catch regressions but not the original misunderstanding.

When you use the same AI for both, you've built a self-congratulation machine.

This is exactly the problem code review was supposed to solve: a second set of eyes that wasn't the original author. But one AI writing and another AI checking isn't a fresh set of eyes. They come from the same place. They'll miss the same things.

The thing TDD got right

Write the test first, write the code second, stop when the test passes. Most teams don't do this because thinking through what the code should do before writing it takes time they don't have.

AI removes that excuse, because Claude handles the speed. The slow part is now figuring out if the code is right. That's what TDD was built for: write down what correct looks like, then check it.

TDD asks you to write unit tests, which means thinking about how the code will work before you write it. This is easier. Write down what the feature should do in plain English. The machine figures out how to check it.

"Users can authenticate with email and password. On wrong credentials they see 'Invalid email or password.' On success they land on /dashboard. The session token expires after 24 hours." You can write that before you open a code editor. The agent builds it. Something else checks it.

What this looks like in practice

For frontend changes, we generated acceptance criterias based on the spec file:

# Task
Add email/password login.

## Acceptance Criteria

### AC-1: Successful login
- User at /login with valid credentials gets redirected to /dashboard
- Session cookie is set

### AC-2: Wrong password error
- User sees exactly "Invalid email or password"
- User stays on /login

### AC-3: Empty field validation
- Submit disabled when either field is empty, or inline error on empty submit

### AC-4: Rate limiting
- After 5 failed attempts, login blocked for 60 seconds
- User sees a message with the wait time

Each criterion is specific enough that it either passes or fails. Once the agent builds the feature, verification runs Playwright browser agents against each AC, takes screenshots, and produces a report with per-criterion verdicts. If something fails you see exactly which criterion and what the browser saw.

For backend changes the same pattern works without a browser. You specify observable API behavior (status codes, response headers, error messages) that curl commands can check.

One thing worth being honest about: this doesn't catch spec misunderstandings. If your spec was wrong to begin with, the checks will pass even when the feature is wrong. What Playwright does catch is integration failures, rendering bugs, and behavior that works in theory but breaks in a real browser. That's a narrower claim than "verified correct," but it's more than a code review was reliably catching anyway.

The workflow: write acceptance criteria before you prompt, let the agent build against them, run verification, review only the failures. You review failures instead of diffs.

How to build it

I started building a Claude Skill (github.com/opslane/verify) that runs using claude -p (Claude Code's headless mode) plus Playwright MCP. No custom backend, no extra API keys beyond your existing Claude OAuth token. Four stages:

Pre-flight is pure bash, no LLM. Is the dev server running? Is the auth session valid? Does a spec file exist? Fail fast before spending any tokens.

The planner is one Opus call. It reads your spec and the files you changed. It figures out what each check needs and how to run it. It also reads your code to find the right selectors, so it's not guessing at class names.

Browser agents are one Sonnet call per AC, all running in parallel. Five ACs, five agents, each navigating and screenshotting independently. Sonnet costs 3-4x less than Opus here and works just as well for clicking around.

The judge is one final Opus call that reads all the evidence and returns a verdict per criterion: pass, fail, or needs-human-review.

claude -p --model claude-opus-4-6 \
"Review this evidence and return a verdict for each AC.
Evidence: $(cat .verify/evidence/*/result.json)
Return JSON: {verdicts: [{id, passed, reasoning}]}"

Install it as a Claude Code plugin:

/plugin marketplace add opslane/verify
/plugin install opslane-verify@opslane/verify

Or clone the repo and adapt it. Each stage is a single claude -p call with a clear input and structured output. You can swap models, add stages, or wire it into CI with --dangerously-skip-permissions.

The thing I keep coming back to: you can't trust what an agent produces unless you told it what "done" looks like before it started. Writing acceptance criteria is harder than writing a prompt, because it forces you to think through edge cases before you've seen them. Engineers resist it for the same reason they resisted TDD, because it feels slower at the start.

Without them, all you can do is read the output and hope it's right.

联系我们 contact @ memedata.com