扩展长期运行的自主编码

扩展长期运行的自主编码
Scaling long-running autonomous coding

原始链接: https://cursor.com/blog/scaling-agents

## 利用人工智能代理进行自主编码的规模化最近的实验表明，协调数百个人工智能代理来处理复杂的软件项目是可行的，这些项目传统上需要耗费数月的人力。最初的“扁平”协调尝试——即代理自我组织——由于锁定瓶颈和倾向于规避风险、低效工作的倾向而失败。突破性进展在于基于角色的流水线：**规划者**递归地定义任务，而**执行者**则专注于完成任务，从而最大限度地减少协调开销。该系统成功地从头开始构建了一个网络浏览器（超过一百万行代码），在一周内完成了从Solid到React的代码库迁移，并显著提高了视频渲染性能。其他几个大型项目，包括模拟器和电子表格软件，也在进行中。关键经验包括GPT-5.2模型对于长期任务的重要性，根据特定角色定制模型选择，以及优先考虑系统设计的简洁性。有效的提示对于代理协调和集中注意力至关重要。虽然仍然存在挑战——例如防止漂移和优化规划周期——但结果表明，通过利用并行代理，自主编码*可以*被规模化，为Cursor等先进的人工智能辅助开发工具铺平道路。

## 光标AI构建的浏览器：一个怀疑的观点光标最近展示了一个由AI驱动的“从头”构建网页浏览器的尝试，在Hacker News上引发了争论。虽然范围令人印象深刻（3万次提交，100万行代码），但许多评论员对该项目真正的成功和实际价值表示怀疑。一个主要担忧是缺乏细节和验证。据报道，代码无法始终如一地编译，测试失败且大量警告被忽略——这是AI生成代码的常见问题。人们怀疑，演示的功能（Apple.com的截图）是否代表了真正的成就，还是仅仅是幸运的结果。讨论的中心是“自主编码”是否可行，或者AI是否最好被用作人类开发者的强大*工具*，而不是完全独立的构建者。许多评论员强调代码质量、可维护性和严格测试的重要性——这些都是AI目前难以解决的领域。Solid到React迁移的未合并PR被引用为集成大型AI生成更改所面临的挑战的例子。最终，共识倾向于在宣布实验成功之前，需要具体的、可重复的结果以及对浏览器功能的更深入分析。该项目提出了关于软件开发未来的问题，但也强调了持续需要人类监督和专业知识。

原文

We've been experimenting with running coding agents autonomously for weeks.

Our goal is to understand how far we can push the frontier of agentic coding for projects that typically take human teams months to complete.

This post describes what we've learned from running hundreds of concurrent agents on a single project, coordinating their work, and watching them write over a million lines of code and trillions of tokens.

The limits of a single agent

Today's agents work well for focused tasks, but are slow for complex projects. The natural next step is to run multiple agents in parallel, but figuring out how to coordinate them is challenging.

Our first instinct was that planning ahead would be too rigid. The path through a large project is ambiguous, and the right division of work isn't obvious at the start. We began with dynamic coordination, where agents decide what to do based on what others are currently doing.

Learning to coordinate

Our initial approach gave agents equal status and let them self-coordinate through a shared file. Each agent would check what others were doing, claim a task, and update its status. To prevent two agents from grabbing the same task, we used a locking mechanism.

This failed in interesting ways:

Agents would hold locks for too long, or forget to release them entirely. Even when locking worked correctly, it became a bottleneck. Twenty agents would slow down to the effective throughput of two or three, with most time spent waiting.
The system was brittle: agents could fail while holding locks, try to acquire locks they already held, or update the coordination file without acquiring the lock at all.

We tried replacing locks with optimistic concurrency control. Agents could read state freely, but writes would fail if the state had changed since they last read it. This was simpler and more robust, but there were still deeper problems.

With no hierarchy, agents became risk-averse. They avoided difficult tasks and made small, safe changes instead. No agent took responsibility for hard problems or end-to-end implementation. This lead to work churning for long periods of time without progress.

Planners and workers

Our next approach was to separate roles. Instead of a flat structure where every agent does everything, we created a pipeline with distinct responsibilities.

Planners continuously explore the codebase and create tasks. They can spawn sub-planners for specific areas, making planning itself parallel and recursive.
Workers pick up tasks and focus entirely on completing them. They don't coordinate with other workers or worry about the big picture. They just grind on their assigned task until it's done, then push their changes.

At the end of each cycle, a judge agent determined whether to continue, then the next iteration would start fresh. This solved most of our coordination problems and let us scale to very large projects without any single agent getting tunnel vision.

Running for weeks

To test this system, we pointed it at an ambitious goal: building a web browser from scratch. The agents ran for close to a week, writing over 1 million lines of code across 1,000 files. You can explore the source code on GitHub.

Despite the codebase size, new agents can still understand it and make meaningful progress. Hundreds of workers run concurrently, pushing to the same branch with minimal conflicts.

While it might seem like a simple screenshot, building a browser from scratch is extremely difficult.

Another experiment was doing an in-place migration of Solid to React in the Cursor codebase. It took over 3 weeks with +266K/-193K edits. As we've started to test the changes, we do believe it's possible to merge this change.

Another experiment was to improve an upcoming product. A long-running agent made video rendering 25x faster with an efficient Rust version. It also added support to zoom and pan smoothly with natural spring transitions and motion blurs, following the cursor. This code was merged and will be in production soon.

We have a few other interesting examples still running:

Java LSP: 7.4K commits, 550K LoC
Windows 7 emulator: 14.6K commits, 1.2M LoC
Excel: 12K commits, 1.6M LoC
FX1: 9.5K commits, 1.2M LoC

What we've learned

We've deployed billions of tokens across these agents toward a single goal. The system isn't perfectly efficient, but it's far more effective than we expected.

Model choice matters for extremely long-running tasks. We found that GPT-5.2 models are much better at extended autonomous work: following instructions, keeping focus, avoiding drift, and implementing things precisely and completely.

Opus 4.5 tends to stop earlier and take shortcuts when convenient, yielding back control quickly. We also found that different models excel at different roles. GPT-5.2 is a better planner than GPT-5.1-codex, even though the latter is trained specifically for coding. We now use the model best suited for each role rather than one universal model.

Many of our improvements came from removing complexity rather than adding it. We initially built an integrator role for quality control and conflict resolution, but found it created more bottlenecks than it solved. Workers were already capable of handling conflicts themselves.

The best system is often simpler than you'd expect. We initially tried to model systems from distributed computing and organizational design. However, not all of them work for agents.

The right amount of structure is somewhere in the middle. Too little structure and agents conflict, duplicate work, and drift. Too much structure creates fragility.

A surprising amount of the system's behavior comes down to how we prompt the agents. Getting them to coordinate well, avoid pathological behaviors, and maintain focus over long periods required extensive experimentation. The harness and models matter, but the prompts matter more.

What's next

Multi-agent coordination remains a hard problem. Our current system works, but we're nowhere near optimal. Planners should wake up when their tasks complete to plan the next step. Agents occasionally run for far too long. We still need periodic fresh starts to combat drift and tunnel vision.

But the core question, can we scale autonomous coding by throwing more agents at a problem, has a more optimistic answer than we expected. Hundreds of agents can work together on a single codebase for weeks, making real progress on ambitious projects.

The techniques we're developing here will eventually inform Cursor's agent capabilities. If you're interested in working on the hardest problems in AI-assisted software development, we'd love to hear from you at [email protected].