开发技能在自主编码中的作用
The role of developer skills in agentic coding

原始链接: https://martinfowler.com/articles/exploring-gen-ai.html#memo-13

智能编码助手令人印象深刻,但需要开发者仔细监督。虽然它们可以提高速度,但往往需要人工干预才能确保代码质量、团队协作和长期可维护性。 常见的AI失误包括:生成无效代码、误诊问题、过早进行过多工作、采取蛮力修复而不是根本原因分析、使工作流程复杂化、误解需求、创建冗余测试、缺乏代码复用以及生成过于复杂的代码。 为了减轻这些风险,开发者必须仔细审查AI生成的代码,在需要时优先进行“手工编码”,并进行结对编程。组织应加强代码质量监控,实施预提交钩子和IDE集成代码审查,重新审视编码最佳实践,在编码助手自定义规则,并培养信任和开放沟通的文化,以有效地应对这项颠覆性技术。

这篇 Hacker News 讨论帖探讨了开发者技能在“自主编码”中的作用,并参考了 martinfowler.com 上的一篇文章。 usrbinbash 使用 AI 作为 IDE 内的写作助手和“橡皮鸭”,发现它在改进特定代码片段方面非常有效。然而,他们避免将其用于广泛的自主任务,因为需要大量时间来确保 AI 的输出符合他们的目标。具有讽刺意味的是,使用 AI 作为高级写作工具所带来的效率提升,让他们对管理更自主的 AI 系统所需付出的努力更为挑剔。 decompiled_dev 将现代开发者比作牧羊人,通过提供结构并确保 AI 系统(“机器”)保持正轨来引导它们。bob1029 强调了具有“人工监督的代理”的重要性,将其与开发者发出包含大量禁止事项的宽泛指令这种无效方法形成对比。他认为将 AI 用于特定任务(例如在定义的约束条件下生成自定义代码实现)很有价值。

原文

Generative AI and particularly LLMs (Large Language Models) have exploded into the public consciousness. Like many software developers I am intrigued by the possibilities, but unsure what exactly it will mean for our profession in the long run. I have now taken on a role in Thoughtworks to coordinate our work on how this technology will affect software delivery practices. I'll be posting various memos here to describe what my colleagues and I are learning and thinking.

Latest Memo: The role of developer skills in agentic coding

25 March 2025

As agentic coding assistants become more capable, reactions vary widely. Some extrapolate from recent advancements and claim, “In a year, we won’t need developers anymore.” Others raise concerns about the quality of AI-generated code and the challenges of preparing junior developers for this changing landscape.

In the past few months, I have regularly used the agentic modes in Cursor, Windsurf and Cline, almost exclusively for changing existing codebases (as opposed to creating Tic Tac Toe from scratch). I am overall very impressed by the recent progress in IDE integration and how those integrations massively boost the way in which the tools can assist me. They

  • execute tests and other development tasks, and try to immediately fix the errors that occur
  • automatically pick up on and try to fix linting and compile errors
  • can do web research
  • some even have browser preview integration, to pick up on console errors or check DOM elements

All of this has led to impressive collaboration sessions with AI, and sometimes helps me build features and figure out problems in record time.

However.

Even in those successful sessions, I intervened, corrected and steered all the time. And often I decided to not even commit the changes. In this memo, I will list concrete examples of that steering, to illustrate what role the experience and skills of a developer play in this “supervised agent” mode. These examples show that while the advancements have been impressive, we’re still far away from AI writing code autonomously for non-trivial tasks. They also give ideas of the types of skills that developers will still have to apply for the foreseeable future. Those are the skills we have to preserve and train for.

Where I’ve had to steer

I want to preface this by saying that AI tools are categorically and always bad at the things that I’m listing. Some of the examples can even be easily mitigated with additional prompting or custom rules. Mitigated, but not fully controlled: LLMs frequently don’t listen to the letter of the prompt. The longer a coding session gets, the more hit-and-miss it becomes. So the things I’m listing absolutely have a non-negligible probability of happening, regardless of the rigor in prompting, or the number of context providers integrated into the coding assistant.

I am categorising my examples into 3 types of impact radius, AI missteps that:

a. slowed down my speed of development and time to commit instead of speeding it up (compared to unassisted coding), or
b. create friction for the team flow in that iteration, or
c. negatively impact long-term maintainability of the code.

The bigger the impact radius, the longer the feedback loop for a team to catch those issues.

A visualisation of the 3 impact radius categories in concentric circles: Commit, Iteration, and Codebase Lifetime

Impact radius: Time to commit

These are the cases where AI hindered me more than it helped. This is actually the least problematic impact radius, because it’s the most obvious failure mode, and the changes most probably will not even make it into a commit.

No working code

At times my intervention was necessary to make the code work, plain and simple. So my experience either came into play because I could quickly correct where it went wrong, or because I knew early when to give up, and either start a new session with AI or work on the problem myself.

Misdiagnosis of problems

AI goes down rabbit holes quite frequently when it misdiagnoses a problem. Many of those times I can pull the tool back from the edge of those rabbit holes based on my previous experience with those problems.

Example: It assumed a Docker build issue was due to architecture settings for that Docker build and changed those settings based on that assumption — when in reality, the issue stemmed from copying node_modules built for the wrong architecture. As that is a typical problem I have come across many times, I could quickly catch it and redirect.

Impact radius: Team flow in the iteration

This category is about cases where a lack of review and intervention leads to friction on the team during that delivery iteration. My experience of working on many delivery teams helps me correct these before committing, as I have run into these second order effects many times. I imagine that even with AI, new developers will learn this by falling into these pitfalls and learning from them, the same way I did. The question is if the increased coding throughput with AI exacerbates this to a point where a team cannot absorb this sustainably.

Too much up-front work

AI often goes broad instead of incrementally implementing working slices of functionality. This risks wasting large upfront work before realizing a technology choice isn’t viable, or a functional requirement was misunderstood.

Example: During a frontend tech stack migration task, it tried converting all UI components at once rather than starting with one component and a vertical slice that integrates with the backend.

Brute-force fixes instead of root cause analysis

AI sometimes took brute-force approaches to solve issues rather than diagnosing what actually caused them. This delays the underlying problem to a later stage, and to other team members who then have to analyse without the context of the original change.

Example: When encountering a memory error during a Docker build, it increased the memory settings rather than questioning why so much memory was used in the first place.

Complicating the developer workflow

In one case, AI generated build workflows that create a bad developer experience. Pushing these changes almost immediately would have an impact on other team members’ development workflows.

Example: Introducing two commands to run an application’s frontend and backend, instead of one.

Example: Failing to ensure hot reload works.

Example: Complicated build setups that confused both me and the AI itself.

Example: Handling errors in Docker builds without considering how these errors could be caught earlier in the build process.

Misunderstood or incomplete requirements

Sometimes when I don’t give a detailed description of the functional requirements, AI jumps to the wrong conclusions. Catching this and redirecting the agent doesn’t necessarily need special development experience, just attention. However, it happened to me frequently, and is an example of how fully autonomous agents can fail when they don’t have a developer watching them work and intervening at the beginning, rather than at the end. In either case, be it the developer who doesn’t think along, or an agent who is fully autonomous, this misunderstanding will be caught later in the story lifecycle, and it will cause a bunch of back and forth to correct the work.

Impact radius: Long-term maintainability

This is the most insidious impact radius because it has the longest feedback loop, these issues might only be caught weeks and months later. These are the types of cases where the code will work fine for now, but will be harder to change in the future. Unfortunately, it’s also the category where my 20+ years of programming experience mattered the most.

Verbose and redundant tests

While AI can be fantastic at generating tests, I frequently find that it creates new test functions instead of adding assertions to existing ones, or that it adds too many assertions, i.e. some that were already covered in other tests. Counterintuitively for less experienced programmers, more tests are not necessarily better. The more tests and assertions get duplicated, the harder they are to maintain, and the more brittle the tests get. This can lead to a state where whenever a developer changes part of the code, multiple tests fail, leading to more overhead and frustration. I have tried to mitigate this behaviour with custom instructions, but it still happens frequently.

Lack of reuse

AI-generated code sometimes lacks modularity, making it difficult to apply the same approach elsewhere in the application.

Example: Not realising that a UI component is already implemented elsewhere, and therefore creating duplicate code.

Example: Use of inline CSS styles instead of CSS classes and variables

Overly complex or verbose code

Sometimes AI generates too much code, requiring me to remove unnecessary elements manually. This can either be code that is technically unnecessary and makes the code more complex, which will lead to problems when changing the code in the future. Or it can be more functionality than I actually need at that moment, which can increase maintenance cost for unnecessary lines of code.

Example: Every time AI does CSS changes for me, I then go and remove sometimes massive amounts of redundant CSS styles, one by one.

Example: AI generated a new web component that could dynamically display data inside of a JSON object, and it built a very elaborate version that was not needed at that point in time.

Example: During refactoring, it failed to recognize the existing dependency injection chain and introduced unnecessary additional parameters, making the design more brittle and harder to understand. E.g., it introduced a new parameter to a service constructor that was unnecessary, because the dependency that provided the value was already injected. (value = service_a.get_value(); ServiceB(service_a, value=value))

Conclusions

These experiences mean that by no stretch of my personal imagination will we have AI that writes 90% of our code autonomously in a year. Will it assist in writing 90% of the code? Maybe. For some teams, and some codebases. It assists me in 80% of the cases today (in a moderately complex, relatively small 15K LOC codebase).

Overview of the main points of the article - concentric circles showing the impact radiuses with each of the categories of missteps listed

What can you do to safeguard against AI missteps?

So how do you safeguard your software and team against the capriciousness of LLM-backed tools, to take advantage of the benefits of AI coding assistants?

Individual coder

  • Always carefully review AI-generated code. It’s very rare that I do NOT find something to fix or improve.

  • Stop AI coding sessions when you feel overwhelmed by what’s going on. Either revise your prompt and start a new session, or fall back to manual implementation - “artisanal coding”, as my colleague Steve Upton calls it.

  • Stay cautious of “good enough” solutions that were miraculously created in a very short amount of time, but introduce long-term maintenance costs.

  • Practice pair programming. Four eyes catch more than two, and two brains are less complacent than one

Team and organization

  • Good ol’ code quality monitoring. If you don’t have them already, set up tools like Sonarqube or Codescene to alert you about code smells. While they can’t catch everything, it’s a good building block of your safety net. Some code smells become more prominent with AI tools and should be more closely monitored than before, e.g. code duplication.

  • Pre-commit hooks and IDE-integrated code review. Remember to shift-left as much as possible - there are many tools that review, lint and security-check your code during a pull request, or in the pipeline. But the more you can catch directly during development, the better.

  • Revisit good code quality practices. In light of the types of the pitfalls described here, and other pitfalls a team experiences, create rituals that reiterate practices to mitigate the outer two impact radiuses. For example, you could keep a “Go-wrong” log of events where AI-generated code led to friction on the team, or affected maintainability, and reflect on them once a week.

  • Make use of custom rules. Most coding assistants now support the configuration of rule sets or instructions that will be sent along with every prompt. You can make use of those as a team to iterate on a baseline of prompt instructions to codify your good practices and mitigate some of the missteps listed here. However, as mentioned at the beginning, it is by no means guaranteed that the AI will follow them. The larger a session and therefore a context window gets, the more hit and miss it becomes.

  • A culture of trust and open communication. We are in a transition phase where this technology is seriously disrupting our ways of working, and everybody is a beginner and learner. Teams and organizations with a trustful culture and open communication are better equipped to learn and deal with the vulnerability this creates. For example, an organization that puts high pressure on their teams to deliver faster “because you now have AI” is more exposed to the quality risks mentioned here, because developers might cut corners to fulfill the expectations. And developers on teams with high trust and psychological safety will find it easier to share their challenges with AI adoption, and help the team learn faster to get the most out of the tools.

Thanks to Jim Gumbley, Karl Brown, Jörn Dinkla, Matteo Vaccari and Sarah Taraporewalla for their feedback and input.

联系我们 contact @ memedata.com