编码代理的成功告诉我们关于人工智能系统的一般性经验。

编码代理的成功告诉我们关于人工智能系统的一般性经验。
Software is mostly all you need

原始链接: https://softwarefordays.com/post/software-is-mostly-all-you-need/

## AI原生软件的崛起 (摘要 - 约180字) 近期，AI编码代理的进步展示了惊人的自动化软件开发能力，能够在数小时内完成过去需要数周的工作。重要的是，使用这些代理*构建*的软件开始表现出与创造它的神经网络相似的特征——通过代码（策略）、部署（剧集）和错误报告（奖励）的循环进行学习。然而，成功取决于将*判断*（最好由神经网络处理）与*执行*（最好由传统软件处理）分离。AI擅长模糊分类和决策，而传统代码则提供可靠运行所需的确定性、可审计性和精确性。最有效的架构是在*构建时*将判断委托给AI，生成持久的、版本控制的代码，并在运行时执行。这与浏览器使用和Stagehand等架构形成对比，后者将两者混为一体，导致系统脆弱且不透明。最终，这种方法能够实现自适应软件系统——受益于强化学习式的适应——同时不牺牲传统代码的关键属性。未来不是用AI*取代*软件，而是*加速*其创建并使其能够应用于更广泛的任务。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录软件日刊：编码代理的成功告诉我们关于通用人工智能系统的什么 (softwarefordays.com) 6点由 jbmilgrom 56分钟前 | 隐藏 | 过去 | 收藏 | 2评论 verdverm 6分钟前 | 下一个 [–] 在AI擅长判断的说法上就迷失了，这与我的经验完全相反，它们做出好的和坏的决定，可靠性都一样。 wrs 8分钟前 | 上一个 [–] 换句话说，这是一个更高层次的JIT编译器，意味着它仍然基于运行时观察动态生成代码，但代码使用的语言比汇编语言更高级，观察到的上下文也比仅仅是运行时数据类型更高级。指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

原文

January 8, 2026

Neural Networks at Buildtime, Software at Runtime

Over the last 6 months and the last 6 weeks in particular, AI coding agents have shown to be incredibly capable at writing software. Tasks that traditionally required weeks of human labor can now be done in days if not hours. Even more incredibly, software systems that are designed from the start to harness AI coding agents exhibit many of the characteristics of the neural nets that were integral to their creation in the first place. These AI-native software systems are learned, not designed. Code is the policy, deployment is the episode, and the bug report is the reward signal - well-architected coding agents can drive this loop with little human intervention. Unlike traditional reinforcement learning architectures, they are encoded in CPU instruction sets instead of neural network weights, but they are learned just the same.

The success of coding agents and the software systems built thereon carry lessons about where to apply AI agents in general as well. Coding, like many other creative tasks, requires judgment. How best to implement some function with input A and output B; how to name some variable; whether to share some function or implement a new version; etc. Neural networks excel at judgment (more on why below). Yet many of the agentic deployments we are seeing in the wild are against tasks that can be fully specified as explicit instructions. Of course, traditional software excels at executing explicit instructions. Any programming language can be executed on today’s machinery at billions of instructions per second.

Coding agents get this exactly right, since by definition they are making a series of judgments when writing code at buildtime and leaving the execution of such code to machines operating at runtime. The best performing architectures follow suit, delegating judgment to neural networks and execution to traditional software, even when the executable artifacts are produced entirely by AI.

Some Agents in Practice #

Many agentic AI projects are failing — agentic drift, opaque debugging, brittle autonomy. Meanwhile, Claude Code has driven significant productivity gains by doing something different: it writes code that humans review and deploy, producing artifacts that are durable, version-controlled, and deterministic.

These failures and successes reflect a fundamental architectural difference.

Judgment and Execution Historically #

Humans have historically done two different types of jobs for different reasons, and AI changes each differently.

Judgment is fuzzy classification that cannot be specified as explicit rules. This variable should be made private, not public; this handwritten letter is a “B”, not a “P”; this customer complaint is about a refund, not fraud; this image contains a receipt; this element on some unfamiliar page is “the login button.” Humans did these tasks because traditional CPU-based Von Neumann machines simply could not. The rules could not be written down, and even today exist only as learned boundaries in high-dimensional space. Minimization of a loss function via gradient descent in a vastly dimensional space draws these boundaries inside neural networks without confinement to the nouns and verbs of English, C, or even Rust (lol).

Execution is discrete logic that can be specified as explicit rules. If complaint type is refund and days since purchase is less than 30, approve; if machine type is CPAP and facility code is X, the SKU is ABC-123; click the element with selector a[href="/login"]. Humans did these tasks, even though Von Neumann machines theoretically could and are more reliable and faster, because writing and operating software systems that encode these rules was expensive. The investment was not worth the savings not because of any fuzziness inherent to the task.

Common Conflations Today #

Dominant agent architectures conflate judgment and execution, frequently using neural networks for both. The consensus definition of an agent — “an LLM runs tools in a loop to achieve a goal” — clarifies the mechanism but not the problem space.

Frameworks like browser-use and Stagehand embody this conflation. Consider browser-use:

agent = Agent(task="Find the top HN post", llm=llm, browser=browser)
await agent.run()

Or Stagehand:

await stagehand.act("click on the stagehand repo");
await agent.execute("Get to the latest PR");

In both cases, the LLM performs judgment (which element is “the stagehand repo”?) and execution (click it, figure out the next step, click that). The entire loop is neural. No durable artifact emerges. The LLM is the runtime.

Why Execution Requires Traditional Software #

Neural networks lack the properties that execution requires: determinism, auditability, and precision on edge cases.

Consider this business logic from a system that processes medical equipment orders (from Docflow Labs, my startup):


const scriptedMachineCode = extractMachineCodeFromScripted(scriptedMachine);
if (scriptedMachineCode && machineType) {
const machineMake = lookupMachineSku(
machineType,
scriptedMachineCode,
classification
);
if (machineMake) {
machineSku = machineMake;
}
}
if (!machineSku || machineSku.trim() === "") {
const facilityMachineCode = extractMachineCodeFromFacility(facility);
if (facilityMachineCode && machineType) {
const machineMake = lookupMachineSku(
machineType,
facilityMachineCode,
classification
);
if (machineMake) {
machineSku = machineMake;
}
}
}

This code handles combinations that may occur once a year — a rare facility, an unusual machine type, a specific classification. The code provides 100% precision even for edge cases. When a billing dispute arises and someone asks why the system chose rental versus purchase for a particular patient, the logic can be traced line by line. It lives in version control and is semantically transparent, deterministic, and auditable.

A neural network approximating this function cannot provide these properties. Sparse training data will never cover the combinatorial space. Moreover, it blurs boundaries that business requires to be sharp. And it fails opaquely — gradients and activations offer no affordance for debugging. Decisions in this substrate are semantically opaque, non-deterministic, and untraceable.

The Stagehand Example: Half Right #

Stagehand’s act("click on the stagehand repo") correctly implements judgment via a neural network in some sense. Which element on any dynamically chosen page corresponds to the “stagehand repo” cannot be represented in traditional software. There are too many permutations of page layout. The fuzziness of these boundaries is best approached by neural networks in massively multidimensional space minimizing some loss function against many examples.

In another sense, however, Stagehand’s architecture is limited. We may know ahead of time which webpage we are attempting a click against and it may change infrequently, requiring only a one-time (or few-time) judgment.

Yet Stagehand produces no executable artifact by design. Instead, the LLM returns a selector, which gets cached opaquely outside version control. On cache miss, the LLM re-engages at runtime to re-interpret the instruction, invoking a neural net.

A better architecture might still allow the LLM to make a judgment and return a selector, but afford positioning this judgment squarely at buildtime. The selector gets emitted as code into a Playwright script. The script is committed to version control, reviewed, and deployed. On failure — because the site changed and the selector broke — the development process re-engages. An AI agent rewrites the script. Same judgment, different artifact. The selector becomes a semantically transparent piece of the underlying software system, not ephemeral runtime state.

A Better Architecture #

Neural nets may remain at runtime when tackling judgments that can only be made dynamically at runtime. Every other LLM agent belongs at buildtime accelerating the production of executable software.

complaint_text = get_complaint()

complaint_type = llm.classify(
complaint_text,
categories=["refund", "fraud", "shipping", "other"]
)

if complaint_type == "refund":
if days_since_purchase < 30 and item_condition == "unopened":
approve_refund()
else:
escalate_to_human()
elif complaint_type == "fraud":
flag_for_review()

The workflow orchestrator is traditional software. It calls out to neural networks for judgment tasks: classification, extraction, interpretation — the fuzzy pattern matching that cannot be specified as rules. Then it executes business logic itself, deterministically. The execution paths are explicit, auditable, and version-controlled.

This is not a new pattern. Production ML systems already work this way: a model classifies, code acts. What’s new is that AI agents can write the code, dissolving the apparent tradeoff between RPA (deterministic but brittle) and AI agents (adaptive but unpredictable).

Development Time Approaching Runtime #

Software systems have historically maintained a clear separation between two domains: development (humans writing code, days or weeks) and execution (CPUs running code, nanoseconds). AI coding agents close this gap. The theoretical limit as development time approaches zero is runtime:

$\lim_{\text{devtime} \to 0} \text{buildtime} = \text{runtime}$

Even if AI never achieves nanosecond times for writing software, timescales of hours, minutes, and perhaps even seconds allow software systems to adapt to feedback as it arrives.

As AI agents get more capable, the distinction between “writing code” and “running code” may dissolve. What emerges resembles reinforcement learning with a different substrate. In traditional RL, a neural network observes state, outputs an action, receives a reward signal, and updates its weights. The network is the adaptive element.

Substitute software for the neural network and the structure remains identical. The system observes data — requests, errors, metrics, user complaints. Code executes a response. Feedback arrives. An AI agent updates the code. Same adaptive loop, different computable substrate.

The difference in representation matters. Neural networks encode behavior in opaque weight matrices. Software encodes behavior in symbolic, human-readable form. Software can be audited, debugged, and surgically modified if necessary. A single fallback chain can be altered without retraining an entire model and hoping it generalizes correctly. The symbolic substrate preserves the properties that production systems often require: interpretability, debuggability, auditability, and surgical modifiability. When the learned update mechanism provides adaptability, you get the benefits of RL without the costs.

Adaptable Software Systems #

Ironically, software is still mostly all you need at runtime.

Neural networks are best reserved for judgment — the fuzzy tasks we cannot otherwise specify in language — and for buildtime acceleration. Neural networks will not replace traditional software, but rather enable its proliferation into corners of the economy that could benefit from reliable discrete logical execution at a fraction of historical costs.

An architecture where neural networks handle runtime judgment, software handles execution, and AI agents accelerate buildtime creates a symbolic substrate that is nonetheless adaptable — auditability, determinism, and precision alongside the adaptability of learned systems.

This is what we’re building at Docflow Labs: adaptive systems with a symbolic substrate. If this resonates, say hello!