衡量标准失效了:开发者感觉使用 AI 效率提升了 20%,但实际测量结果慢了 19%。
The gauge broke: devs felt 20% faster with AI, measured 19% slower

原始链接: https://intrepidkarthi.com/writing/the-gauge-broke/

METR 最近的一项研究揭示了开发者对 AI 在软件开发中实际影响的认知与现实之间存在危险的脱节。虽然经验丰富的开发者在使用 AI 工具时感觉速度提升了 20%,但受控测量显示,他们的实际工作效率反而下降了 19%。这种“失准的标尺”表明,速度感不仅存在误差,而且具有极强的误导性。 AI 加快了打字阶段的速度,但这从未是软件开发的主要瓶颈。相反,它将负担转移到了验证和审查阶段——这一环节不仅成本高昂、耗时,而且极易出错。来自 DORA 和 GitClear 等更广泛行业来源的数据也证实了这一点:尽管代码生成量和拉取请求(Pull Request)数量激增,但最终交付量却停滞不前,且代码变动率增加、稳定性下降。 目前,整个行业正处于“验证瓶颈”中,即审查 AI 生成内容的成本已高于工具本身节省的时间。为了应对这一挑战,工程领导者必须停止依赖主观的团队开发速度和员工自述的生产力感受。取而代之的是,应关注客观的产出指标——即真正进入生产环境并保持稳定的代码量,并重新调整人力配置,以支持那些被 AI 无意中加重的关键审查流程。

这篇 Hacker News 帖子讨论了一项 2025 年的研究,该研究声称开发人员在使用 AI 时感觉速度提升了 20%,但实际上效率却降低了 19%。讨论呈现出高度的两极分化,许多用户认为该研究已经过时或存在缺陷,并指出较新的数据表明效率提升了 18%。 辩论的主要观点包括: * **方法论质疑:** 批评者认为,跨不同任务衡量“速度提升”过于简化,且未考虑到软件工程的多样性;AI 可能擅长机械性任务,但在处理需要大量上下文的架构设计时却力不从心。 * **“验证”瓶颈:** 一些人认为,虽然 AI 降低了代码生成的成本,但开发人员目前花费更多时间在代价高昂的验证工作上;随着对 AI 生成代码信任度的提高,生产力有望提升。 * **感知与现实:** 用户将 AI 体验与其他工具(如 IDE 或快捷键)进行了对比,指出“生产力”感往往是一种主观心理状态,而非量化指标。 * **背景过时:** 许多参与者批评该帖引用了陈旧的数据,认为 AI 模型的快速迭代使得一年前的生产力基准在当前的开发环境下已不再适用。
相关文章

原文

For two years I argued the feeling of AI speed had come apart from the fact of it, from watching my own teams. This summer it stopped being an anecdote. A controlled trial measured experienced developers feeling about 20% faster while running about 19% slower. The instrument we steer by reads backward.

Short answer. AI speeds up typing, which was never the bottleneck for an expert in a codebase they already know. It adds overhead, prompting, waiting, and reviewing output that is often subtly wrong, at the exact stage that was already expensive. A controlled trial measured experienced developers feeling about 20% faster while running about 19% slower.

In December 2023 I wrote that the feeling of speed and the fact of speed had come apart on my teams, and I admitted it was an anecdote, a thing I saw but could not yet prove. This summer the anecdote got a stopwatch on it, and the result is worse than I guessed.

METR ran a randomized controlled trial on experienced open-source developers, working in codebases they knew well, with current frontier AI tools. Before the work, the developers expected the tools to speed them up. After the work, they reported the tools had sped them up, by around 20%. Measured against the clock, they were about 19% slower. The self-report and the stopwatch pointed in opposite directions by nearly 40 points. The study is small, 16 developers across 246 tasks, and the authors are careful to say it does not prove AI slows everyone everywhere. The effect flips positive for juniors and for greenfield work. Read the caveats. Then read the part that does not have a caveat: the people most confident the tool was speeding them up were the ones it was measurably slowing down.

That is the gauge breaking. The instrument every engineering leader steers by, the team’s own felt sense of velocity, does not just have noise in it. It reads backward, under exactly the conditions most real work happens in: experienced people, in code that already exists.

I had the shape of this for two years and I want to be precise about what I had wrong. I thought the felt-versus-real gap was a measurement problem, a thing you could fix by looking harder at the dashboard. It is worse than that. The feeling is not a noisy version of the truth. It is actively misleading, and it is the single input most decisions about AI adoption are running on. Every leadership deck claiming a team is twice as fast now is built on the one reading the data says is inverted.

The team-level telemetry says the same thing from the other side, and at a scale the small trial cannot. Faros AI, looking across more than 10,000 developers, found pull requests merged up 98%, pull request size up over 150%, and review time up 91%, for roughly no net change in delivery. 31% of pull requests merged with no review at all. DORA’s research found higher AI adoption associated with a measurable drop in delivery stability, and the damage persisted into this year. GitClear, reading 200 million changed lines, found copy-pasted code rising, code churn rising, and refactoring collapsing to under 10% of changes, with 2024 the first year on record that developers pasted more code than they reorganized. The pattern across every one of these is identical. More generated, more merged, more churned. Same amount delivered, shakier when it lands.

The sentence I have been circling since 2022 is now just true, with measurements under it. Generation got cheap. Verification got expensive. We removed the old bottleneck and shipped the work straight into a new one, and the new one is review. The volume exploded at the one stage we did not re-staff, and the dashboards we trust cannot see the cost because the cost lands downstream, in incidents and churn and reviewer burnout, on a different page from the velocity chart everyone is cheering.

You can watch the tool-builders concede the same point in where the money went this summer. Windsurf, the editor I have lived in since January, got pulled apart in a single weekend in July. Google paid billions to move its founders and core researchers into DeepMind, the remainder was absorbed by the maker of Devin, and the thing the founders left to build is an agent-first IDE. Strip the branding off agent-first and it says the quiet part out loud. You stop sitting at the keyboard generating, and you move to a dashboard where the job is to review what the agents produced and decide what to keep. The most aggressive bet in the tooling market is a bet that the work is now verification. They are building the cockpit for the exact bottleneck this study just put a stopwatch on.

The honest counter, and it matters here more than usual. This is most likely the dip in a J-curve, not the destination. New tools cost you before they pay you, and most of the felt-versus-real gap is the cost showing up before the payoff does. The trial’s effect flips for juniors and new code, which is a growing share of what gets built. DORA’s throughput recovered even as stability lagged, which is what a team climbing out of the dip looks like. I am not arguing the tool is bad. I am arguing the gauge is broken, which is a different and more dangerous claim, because a bad tool you eventually notice and a broken gauge you keep trusting.

So the discipline for whoever is running a team through this is one line. Stop steering by how fast it feels. The feeling is the one number we now know reads backward. Measure what reaches production and stays standing, re-staff the stage where the work actually piles up, and treat any productivity claim that lives in a feeling as unproven until the stopwatch agrees. The gauge broke this summer. The teams that win the next stretch are the ones that notice, and replace it, before they report a number they got from a feeling.

联系我们 contact @ memedata.com