软件工厂与能动性时刻

软件工厂与能动性时刻
Software factories and the agentic moment

## StrongDM 的软件工厂：摘要 StrongDM 构建了一个“软件工厂”——一个完全自动化的软件开发系统，由规范和场景驱动，无需人工编码或审查。这一突破得益于 LLM（如 Claude 3.5，尤其是在 2024 年 10 月修订后）的进步，关键在于该模型能够在扩展的编码流程中*积累正确性*，而不是错误。核心原则是“不干预”——不编写任何人工代码。最初的尝试失败了，直到强大的测试演变为“场景”——详细的用户故事，其有效性不是由简单的通过/失败测试来验证，而是由概率性的“满意度”评分来验证。重要的是，验证发生在“数字孪生宇宙”（DTU）中——外部服务（Okta、Jira 等）的行为克隆，允许进行高容量、安全的测试，超越生产限制。这种方法极大地改变了软件经济，使以前不可行的任务（例如构建完整的 SaaS 副本）成为例行公事。 StrongDM 强调对 LLM 代币的大量投资——表明至少 1000 美元/工程师/天表明了真正自动化工厂的充足资源。该团队的成功源于挑战传统的软件开发约束，并拥抱“深思熟虑的幼稚”来释放新的可能性。

StrongDM正在开创一种名为“软件工厂”的新软件开发方法，利用人工智能自动化编码过程中的重要部分。Simon Willison对此表示，他们的工作是他见过的最雄心勃勃的“人工智能辅助软件工程”探索，并提到了“黑暗工厂”模式。该工厂的核心位于factory.strongdm.ai，但早期用户报告称存在性能问题——具体表现为加载缓慢以及在移动设备（iOS/Safari）上的可访问性问题。尽管存在这些初期问题，该项目仍引发了人们对软件构建未来形态的期待，并推动了人工智能在开发领域可能性的边界。Willison在他的博客文章中进一步阐述了他的想法：[https://simonwillison.net/2026/Feb/7/software-factory/](https://simonwillison.net/2026/Feb/7/software-factory/)。

原文

We built a Software Factory: non-interactive development where specs + scenarios drive agents that write code, run harnesses, and converge without human review.

The narrative form is included below. If you'd prefer to work from first principles, I offer a few constraints & guidelines that, applied iteratively, will accelerate any team toward the same intuitions, convictions¹, and ultimately a factory² of your own. In kōan or mantra form:

Why am I doing this? (implied: the model should be doing this instead)

In rule form:

Code must not be written by humans
Code must not be reviewed by humans

Finally, in practical form:

If you haven't spent at least $1,000 on tokens today per human engineer, your software factory has room for improvement

The StrongDM AI Story

On July 14th, 2025, Jay Taylor and Navan Chauhan joined me (Justin McCarthy, co-founder, CTO) in founding the StrongDM AI team.

The catalyst was a transition observed in late 2024: with the second revision of Claude 3.5 (October 2024), long-horizon agentic coding workflows began to compound correctness rather than error.

Compounding correctness vs compounding error

By December of 2024, the model's long-horizon coding performance was unmistakable via Cursor's YOLO mode.

Prior to this model improvement, iterative application of LLMs to coding tasks would accumulate errors of all imaginable varieties (misunderstandings, hallucinations, syntax, version DRY violations, library incompatibility, etc). The app or product would decay and ultimately "collapse": death by a thousand cuts, etc.

Together with YOLO mode, the updated model from Anthropic provided the first glimmer of what we now refer to internally as non-interactive development or grown software.

Find Knobs, Turn To Eleven

In the first hour of the first day of our AI team, we established a charter which set us on a path toward a series of findings (which we refer to as our "unlocks"). In retrospect, the most important line in the charter document was the following:

Initially it was just a hunch. An experiment. How far could we get, without writing any code by hand?

Not very far! At least: not very far, until we added tests. However, the agent, obsessed with the immediate task, soon began to take shortcuts: return true is a great way to pass narrowly written tests, but probably won't generalize to the software you want.

Tests were not enough. How about integration tests? Regression tests? End-to-end tests? Behavior tests?

From Tests to Scenarios and Satisfaction

One recurring theme of the agentic moment: we need new language. For example, the word "test" has proven insufficient and ambiguous. A test, stored in the codebase, can be lazily rewritten to match the code. The code could be rewritten to trivially pass the test.

We repurposed the word scenario to represent an end-to-end "user story", often stored outside the codebase (similar to a "holdout" set in model training), which could be intuitively understood and flexibly validated by an LLM.

Synthetic scenario curation and shaping interface

Because much of the software we grow itself has an agentic component, we transitioned from boolean definitions of success ("the test suite is green") to a probabilistic and empirical one. We use the term satisfaction to quantify this validation: of all the observed trajectories through all the scenarios, what fraction of them likely satisfy the user?

Validating Scenarios in the Digital Twin Universe

In previous regimes, a team might rely on integration tests, regression tests, UI automation to answer "is it working?"

We noticed two limitations of previously reliable techniques:

Tests are too rigid - we were coding with agents, but we're also building with LLMs and agent loops as design primitives; evaluating success often required LLM-as-judge
Tests can be reward hacked - we needed validation that was less vulnerable to the model cheating

The Digital Twin Universe is our answer: behavioral clones of the third-party services our software depends on. We built twins of Okta, Jira, Slack, Google Docs, Google Drive, and Google Sheets, replicating their APIs, edge cases, and observable behaviors.

With the DTU, we can validate at volumes and rates far exceeding production limits. We can test failure modes that would be dangerous or impossible against live services. We can run thousands of scenarios per hour without hitting rate limits, triggering abuse detection, or accumulating API costs.

Digital Twin Universe: behavioral clones of Okta, Jira, Google Docs, Slack, Drive, and Sheets
(click to enlarge)

Unconventional Economics

Our success with DTU illustrates one of the many ways in which the Agentic Moment has profoundly changed the economics of software. Creating a high fidelity clone of a significant SaaS application was always possible, but never economically feasible. Generations of engineers may have wanted a full in-memory replica of their CRM to test against, but self-censored the proposal to build it. They didn't even bring it to their manager, because they knew the answer would be no.

Those of us building software factories must practice a deliberate naivete: finding and removing the habits, conventions, and constraints of Software 1.0. The DTU is our proof that what was unthinkable six months ago is now routine.