Ornith-1.0：用于智能体编程的自架设大语言模型

Ornith-1.0：用于智能体编程的自架设大语言模型
Ornith-1.0: Self-scaffolding LLMs for agentic coding

原始链接: https://deep-reinforce.com/ornith_1_0.html

Ornith-1.0 是一个专为智能体编码任务优化的全新开源模型系列，涵盖了从 9B 参数的轻量化边缘部署单元到 397B 参数的前沿规模模型。该系列基于 Gemma 4 和 Qwen 3.5 构建，在 SWE-Bench Verified 和 Terminal-Bench 2.1 等主流基准测试中表现出色，其中 397B 版本足以媲美 Claude Opus 4.7。 Ornith-1.0 的突破性在于其自我完善的训练框架。模型不再依赖人工编写的代码工具，而是同步进化其问题解决策略以及引导任务的特定“框架”（编排逻辑）。通过强化学习，模型能够不断优化这些框架，从而引导出更高奖励的搜索路径。为防止奖励破解，该框架采用了三层防御机制：不可变的运行环境边界、确定性的工具使用监控，以及作为否决权执行者的冻结 LLM 评判员。此外，模型通过采用带有滞后标记加权的流水线强化学习策略，有效处理了长时、异步的训练回放。这种方法使 Ornith-1.0 能够在无需人工干预的情况下，实现高质量、自动化的编码策略，并持续自我提升，从而在各种设备规模下提供强大且高效的性能。

Hacker News 的讨论聚焦于 **Ornith-1.0**，这是一款专为智能体编程设计的自构建（self-scaffolding）大语言模型。用户基准测试（专门测试模型发现安全漏洞的能力）显示结果参差不齐。虽然 Ornith 在工具受限（仅能使用 read/grep/ls）时表现不佳，但在获得完整的 Shell 和 Python 权限后，其性能翻了一番，这证实了它旨在通过构建工具来解决问题的设计初衷。然而，据报道，其表现仍不及 Qwen AgentWorld 模型。争议的一个重要焦点在于该模型的性能声明。用户澄清称，Ornith 9B 是基于 Qwen 3.5 的微调版本，而关于它能媲美 35B 级别模型的说法遭到了质疑。尽管一些用户认为 35B 版本的 Ornith 能力尚可，但其他人指出它无法超越 DeepSeek 等行业领先者。此次对话还引发了一场技术辩论：即“自构建”方法究竟是一种真正的架构创新，还是仅仅以代码执行为核心的复杂提示词优化形式。

原文

Aloha! 🌺

Today, we are introducing Ornith-1.0, a self-improving family of open-source models specially for agentic coding tasks. Ornith-1.0 spans the full spectrum, from compact 9B Dense models suitable for edge device deployment to 397B MoE frontier-scale models optimized for maximum performance, with variants including 9B Dense, 31B Dense, 35B MoE, and 397B MoE. Built on top of pretrained Gemma 4 and Qwen 3.5, it achieves state-of-the-art performance among open-source models of comparable size on coding benchmarks.

The key innovation behind Ornith-1.0 is a self-improving training framework. Instead of relying on human-designed harnesses to drive solution generation in RL, Ornith-1.0 learns to generate both solution rollouts and the task-specific harnesses that guide those rollouts. By jointly optimizing the scaffold and the resulting solution, the model can discover better search trajectories and generate higher-quality solutions.

Ornith-1.0 achieves state-of-the-art performance among open-source models of comparable size across a broad range of agentic coding benchmarks: Ornith-1.0-397B (77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified) matches the performance of Claude Opus 4.7 (70.3 on TB-2.1 and 80.8 on SWE-Bench Verified) and outperforming leading open-source models of similar size, including MiniMax M3 (66.0 on TB-2.1 and 80.5 on SWE-Bench Verified) and DeepSeek-V4-Pro (67.9 on TB-2.1 and 80.6 on SWE-Bench Verified). Ornith-1.0-9B, which can be easily deployed on edge devices, matches or exceeds the performance of much larger models such as Gemma 4-31B and Qwen 3.6 35B.

At the flagship scale, Ornith-1.0-397B achieves 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified, surpassing Claude Opus 4.7 on both benchmarks and outperforming leading open-source models of similar size, including Minimax M3 and DeepSeek-V4-Pro.

Ornith-1.0-35B significantly outperforms similarly sized models, including Qwen 3.5-35B, Qwen 3.6-35B, and Gemma 31B. Despite having only 35B parameters, it even surpasses Qwen 3.5-397B on Terminal-Bench 2.1 (64.4 vs. 53.5) while matching its performance across several other coding and agentic benchmarks.

The edge-deployable Ornith-1.0-9B also delivers remarkably strong results, achieving 43.1 on Terminal-Bench 2.1 and 69.4 on SWE-Bench Verified. Despite being a compact 9B-parameter model, it matches or exceeds the performance of much larger models such as Gemma 4-31B, demonstrating that strong agentic coding capabilities can be achieved even in resource-efficient deployments.

At the core of Ornith-1.0 is a self-improving training framework that jointly learns to solve tasks and to construct the scaffolds that guide those solutions. Rather than relying on a fixed, human-designed harness shared across a task category, Ornith-1.0 treats the scaffold as a learnable object that co-evolves with the policy.

Each RL step proceeds in two stages: conditioned on a task and the scaffold previously used for it, the model first proposes a refined scaffold; conditioned on that scaffold and the task description, it then generates a solution rollout. Reward from the rollout is propagated to both stages, so the model is optimized not only to produce better answers but to author the orchestration that elicits them.

Repeated over training, this yields a feedback loop in which scaffolds are continually mutated and selected toward those that induce higher-reward trajectories, allowing per-task-category strategies to emerge automatically and driving sustained capability gains without hand-engineered harness design.

Addressing Reward Hacking in Self-improvement

Allowing the model to author its own scaffold naturally introduces the reward-hacking issue. A self-generated scaffold can learn to satisfy the verifier without performing the task: reading the visible test files and hardcoding the expected artifacts, such as touching the checked-for file or writing the literal expected output, or copying an oracle solution present in the environment.

We defend against this in three layers. First, we fix the outer trust boundary: the environment, the tool surface, and test isolation are immutable and outside the model's reach, so the model evolves only the inner policy scaffold: its memory, error-handling, and orchestration logic.

Second, a deterministic monitor enforces that boundary at the level it can be specified exactly, flagging any attempt to read withheld paths, modify verification scripts, or invoke actions outside the sanctioned tool surface, and assigning such trajectories zero reward with exclusion from the advantage computation.

Third, because intent-level gaming can occur entirely within the allowed tool surface, a frozen LLM judge acts as a veto on top of the verifier rather than the primary reward.

Asynchronous RL Training

For RL training, to address the off-line policy problem for long rollouts, Ornith-1.0 adopts the pipeline-RL strategy. To control the effect of earlier generated off-policy tokens, we apply a staleness weight \(w(d_t)\) that downweights tokens according to their age \(d_t\) and drops them entirely once a threshold is exceeded:

\[ w(d_t)= \begin{cases} \!1, & \text{if } d_t \le K_1,\\ \!\exp\!\bigl(-\lambda(d_t-K_1)\bigr), & \text{if } K_1 < d_t \le K_2,\\ \!0, & \text{if } d_t > K_2. \end{cases} \]

The token-level GRPO loss is weighted as follows:

\[ L_t=\min\!\bigl(r_t A_t,\; \mathrm{clip}(r_t,1-\epsilon^{-},1+\epsilon^{+})A_t\bigr)\cdot w(d_t), \]

where

\[ r_t= \frac{\pi_{\theta}(y_t \mid x, y_{<t})} {\pi_{\theta_t^{\mathrm{beh}}}(y_t \mid x, y_{<t})} \]