Aloha! 🌺
Today, we are introducing Ornith-1.0, a self-improving family of open-source models specially for agentic coding tasks. Ornith-1.0 spans the full spectrum, from compact 9B Dense models suitable for edge device deployment to 397B MoE frontier-scale models optimized for maximum performance, with variants including 9B Dense, 31B Dense, 35B MoE, and 397B MoE. Built on top of pretrained Gemma 4 and Qwen 3.5, it achieves state-of-the-art performance among open-source models of comparable size on coding benchmarks.
The key innovation behind Ornith-1.0 is a self-improving training framework. Instead of relying on human-designed harnesses to drive solution generation in RL, Ornith-1.0 learns to generate both solution rollouts and the task-specific harnesses that guide those rollouts. By jointly optimizing the scaffold and the resulting solution, the model can discover better search trajectories and generate higher-quality solutions.
Ornith-1.0 achieves state-of-the-art performance among open-source models of comparable size across a broad range of agentic coding benchmarks: Ornith-1.0-397B (77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified) matches the performance of Claude Opus 4.7 (70.3 on TB-2.1 and 80.8 on SWE-Bench Verified) and outperforming leading open-source models of similar size, including MiniMax M3 (66.0 on TB-2.1 and 80.5 on SWE-Bench Verified) and DeepSeek-V4-Pro (67.9 on TB-2.1 and 80.6 on SWE-Bench Verified). Ornith-1.0-9B, which can be easily deployed on edge devices, matches or exceeds the performance of much larger models such as Gemma 4-31B and Qwen 3.6 35B.
At the flagship scale, Ornith-1.0-397B achieves 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified, surpassing Claude Opus 4.7 on both benchmarks and outperforming leading open-source models of similar size, including Minimax M3 and DeepSeek-V4-Pro.
Ornith-1.0-35B significantly outperforms similarly sized models, including Qwen 3.5-35B, Qwen 3.6-35B, and Gemma 31B. Despite having only 35B parameters, it even surpasses Qwen 3.5-397B on Terminal-Bench 2.1 (64.4 vs. 53.5) while matching its performance across several other coding and agentic benchmarks.
The edge-deployable Ornith-1.0-9B also delivers remarkably strong results, achieving 43.1 on Terminal-Bench 2.1 and 69.4 on SWE-Bench Verified. Despite being a compact 9B-parameter model, it matches or exceeds the performance of much larger models such as Gemma 4-31B, demonstrating that strong agentic coding capabilities can be achieved even in resource-efficient deployments.
At the core of Ornith-1.0 is a self-improving training framework that jointly learns to solve tasks and to construct the scaffolds that guide those solutions. Rather than relying on a fixed, human-designed harness shared across a task category, Ornith-1.0 treats the scaffold as a learnable object that co-evolves with the policy.
Each RL step proceeds in two stages: conditioned on a task and the scaffold previously used for it, the model first proposes a refined scaffold; conditioned on that scaffold and the task description, it then generates a solution rollout. Reward from the rollout is propagated to both stages, so the model is optimized not only to produce better answers but to author the orchestration that elicits them.
Repeated over training, this yields a feedback loop in which scaffolds are continually mutated and selected toward those that induce higher-reward trajectories, allowing per-task-category strategies to emerge automatically and driving sustained capability gains without hand-engineered harness design.
Addressing Reward Hacking in Self-improvement
Allowing the model to author its own scaffold naturally introduces the reward-hacking issue. A self-generated scaffold can learn to satisfy the verifier without performing the task: reading the visible test files and hardcoding the expected artifacts, such as touching the checked-for file or writing the literal expected output, or copying an oracle solution present in the environment.
We defend against this in three layers. First, we fix the outer trust boundary: the environment, the tool surface, and test isolation are immutable and outside the model's reach, so the model evolves only the inner policy scaffold: its memory, error-handling, and orchestration logic.
Second, a deterministic monitor enforces that boundary at the level it can be specified exactly, flagging any attempt to read withheld paths, modify verification scripts, or invoke actions outside the sanctioned tool surface, and assigning such trajectories zero reward with exclusion from the advantage computation.
Third, because intent-level gaming can occur entirely within the allowed tool surface, a frozen LLM judge acts as a veto on top of the verifier rather than the primary reward.
Asynchronous RL Training
For RL training, to address the off-line policy problem for long rollouts, Ornith-1.0 adopts the pipeline-RL strategy. To control the effect of earlier generated off-policy tokens, we apply a staleness weight \(w(d_t)\) that downweights tokens according to their age \(d_t\) and drops them entirely once a threshold is exceeded:
\[ w(d_t)= \begin{cases} \!1, & \text{if } d_t \le K_1,\\ \!\exp\!\bigl(-\lambda(d_t-K_1)\bigr), & \text{if } K_1 < d_t \le K_2,\\ \!0, & \text{if } d_t > K_2. \end{cases} \]
The token-level GRPO loss is weighted as follows:
\[ L_t=\min\!\bigl(r_t A_t,\; \mathrm{clip}(r_t,1-\epsilon^{-},1+\epsilon^{+})A_t\bigr)\cdot w(d_t), \]
where
\[ r_t= \frac{\pi_{\theta}(y_t \mid x, y_{<t})} {\pi_{\theta_t^{\mathrm{beh}}}(y_t \mid x, y_{<t})} \]