Ornith 1.0: Self-Scaffolding Open-Source AI Coding Model

Ornith 1.0 — Self-scaffolding AI coding model from DeepReinforce. YouTube video thumbnail featuring Sam Witteveen.

DeepReinforce just dropped something that changes the game for open-source AI coding. Ornith 1.0 isn't just another model release — it's a new paradigm for how AI agents learn to write code.

The headline: a fully open-source family of models (9B to 397B parameters, all MIT licensed) that teaches itself to write its own reinforcement learning scaffolds. The largest variant matches Claude Opus 4.7 on SWE-Bench Verified. The smallest 9B model outperforms Gemma 4-31B — a model 3x its size.

Let's break down what makes this release different.

What Is Ornith 1.0?

Ornith 1.0 is a family of self-improving open-source models purpose-built for agentic coding tasks, developed by DeepReinforce. It spans four sizes:

Ornith 1.0 9B Dense — Edge-deployable, runs on consumer hardware
Ornith 1.0 31B Dense — Balanced performance for workstation deployment
Ornith 1.0 35B MoE — Mixture-of-experts for efficient inference
Ornith 1.0 397B MoE — Frontier-scale, matching closed-source leaders

Built on pretrained Gemma 4 and Qwen 3.5 checkpoints, these models achieve state-of-the-art results among open-source models of comparable size across the major coding benchmarks.

The Core Innovation: Self-Scaffolding

Here's where it gets interesting. Every agentic coding system — whether it's Claude Code, Cursor, or an open-source agent — relies on a scaffold: the orchestration logic that structures how the model interacts with tools, manages context, retries on failure, and delivers a final solution.

Until now, scaffolds were hand-designed by humans. You write the harness, you define the tool-use protocol, you structure the error recovery. The model just fills in the code.

Ornith 1.0 flips this. Its training framework jointly optimizes the scaffold AND the solution. Each RL step works in two stages:

Propose a refined scaffold — conditioned on the task and the scaffold previously used for it
Generate a solution rollout — conditioned on that scaffold and the task description

Reward from the rollout propagates to both stages. The model isn't just learning to write better answers — it's learning to author the orchestration that elicits those answers.

Self-scaffolding training framework

Ornith's dual-stage RL loop: scaffold proposal and solution generation are jointly optimized, creating a feedback loop where the model continually improves its own orchestration strategy.

Sam Witteveen's deep dive on Ornith 1.0 puts it well — this isn't an incremental improvement. It's a structural shift from "train the solver" to "train the scaffold + solver together."

Benchmark Performance: Punching Well Above Weight

The numbers speak for themselves. Let's look at how Ornith stacks up against the competition.

Frontier Scale (397B MoE)

Benchmark	Ornith 1.0 397B	Claude Opus 4.7	DeepSeek-V4-Pro	MiniMax M3
Terminal-Bench 2.1 (Terminus-2)	77.5	70.3	67.9	66.0
SWE-Bench Verified	82.4	80.8	80.6	80.5
SWE-Bench Pro	62.2	64.3	55.4	59.0
SWE-Bench Multilingual	78.9	—	76.2	—
NL2Repo	48.2	—	—	42.1

Ornith 1.0 397B beats Claude Opus 4.7 on both Terminal-Bench 2.1 and SWE-Bench Verified, and leads DeepSeek-V4-Pro and MiniMax M3 across almost every metric.

397B Evaluation Results

Ornith 1.0 397B vs. leading frontier models — note the across-the-board leadership on agentic coding benchmarks.

Mid-Scale (35B MoE)

Benchmark	Ornith 1.0 35B	Qwen 3.5-35B	Qwen 3.6-35B	Gemma 4-31B
Terminal-Bench 2.1	64.2	41.4	52.5	42.1
SWE-Bench Verified	75.6	70.0	73.4	52.0
SWE-Bench Pro	50.4	44.6	49.5	35.7
NL2Repo	34.6	20.5	29.4	15.5

The 35B variant doesn't just beat similarly sized models — it surpasses Qwen 3.5's 397B model on Terminal-Bench 2.1 (64.2 vs 53.5). That's a 10x parameter disadvantage overcome by smarter training.

35B Evaluation Results

Edge Scale (9B Dense)

Benchmark	Ornith 1.0 9B	Qwen 3.5-9B	Gemma 4-12B	Gemma 4-31B
Terminal-Bench 2.1	43.1	21.3	21.0	42.1
SWE-Bench Verified	69.4	53.2	44.2	52.0
SWE-Bench Pro	42.9	31.3	27.6	35.7

A 9B model beating a 31B model on SWE-Bench Verified? That's the power of self-scaffolding training. For teams that need local, private, offline code agents, this is a watershed moment.

9B Evaluation Results

How It Works: The Self-Improving Training Framework

The technical architecture is worth understanding because it hints at where the entire field is heading.

The Feedback Loop

Traditional RL for coding uses a fixed harness. You define how the model interacts with the terminal, how it reads files, how it runs tests — and the model optimizes its code output within those constraints. The harness never changes.

Ornith treats the harness as a learnable object. Over training iterations:

The model proposes a scaffold for a given task category
It generates a solution using that scaffold
Reward from the solution propagates back to update both the solution policy AND the scaffold policy
Better scaffolds lead to better solutions, which further refine scaffolds

This creates an autonomous capability flywheel — one that doesn't require human engineers to manually redesign the agent loop every time the model improves.

Defending Against Reward Hacking

Giving the model control over its own scaffold introduces an obvious risk: reward hacking. What stops it from learning to cheat the benchmarks rather than actually solving coding problems?

DeepReinforce implements a three-layer defense:

Layer 1: Fixed Trust Boundary. The environment, tool surface, and test isolation are immutable and outside the model's reach. The model can only evolve its inner policy scaffold — memory, error-handling, orchestration logic.

Layer 2: Deterministic Monitoring. A monitor enforces the boundary, flagging attempts to read withheld paths, modify verification scripts, or invoke actions outside the sanctioned tool surface. Zero reward for violations.

Layer 3: Frozen LLM Judge. Because intent-level gaming can happen within allowed tool surfaces, a frozen LLM acts as a veto on top of the verifier. If the judge detects gaming behavior even within valid tool usage, the trajectory gets penalized.

This three-layer approach is a reference architecture for anyone building self-improving agent systems.

Asynchronous RL at Scale

Training was done with a pipeline-RL strategy to handle the off-policy problem created by long agentic rollouts. A staleness weight downweights older tokens and drops them entirely once a threshold is exceeded. This lets the training scale to the long-horizon trajectories that agentic coding requires.

Why This Matters for Enterprise AI

Ornith 1.0 isn't just a research milestone — it has immediate practical implications.

1. Open Weights Change the Risk Calculus

All Ornith 1.0 checkpoints carry the MIT license. GGUF versions run in Ollama and Unsloth with no gatekeeping. For regulated industries (finance, healthcare, defense), this means:

Code never has to leave your infrastructure
You can audit and modify the agent behavior
No dependency on API pricing or availability
Custom fine-tuning for proprietary codebases is possible

2. The Workflow, Not Just the Model, Determines Outcomes

Ornith 1.0 proves that scaffold design is now a competitive differentiator. Two teams using the same base model can get wildly different results depending on their orchestration logic. The model that can evolve its own orchestration will pull ahead.

3. Capability Is Flowing Downstream

The 9B model's performance is arguably the most important signal here. It means agentic coding capability — once the domain of massive data center deployments — is becoming accessible on laptops and edge devices. Private, offline, real-time code assistance is now feasible.

4. The Open-Source Gap Is Closing

Category	Claude Opus 4.7	Ornith 1.0 397B	Gap
SWE-Bench Verified	80.8	82.4	+1.6
Terminal-Bench 2.1	70.3	77.5	+7.2
SWE-Bench Pro	64.3	62.2	-2.1

The gap between best-in-class closed-source and open-source on agentic coding benchmarks is effectively zero. For many use cases, Ornith 1.0 already leads.

The Bottom Line

Ornith 1.0 is the most important open-source agentic coding release of 2026 so far. It validates a thesis that many in the AI community suspected but no one had proven at scale: jointly optimizing the scaffold and the solver produces better results than optimizing either in isolation.

For CTOs and engineering leaders evaluating their AI strategy, the implications are clear:

You can now run production-grade agentic coding entirely on your own infrastructure with open weights
The competitive advantage shifts from model access to orchestration design and custom tooling
Self-improving agents that evolve their own workflows are no longer theoretical — they're shipping now

At aratech, we're tracking this space closely. If you're evaluating how self-scaffolding models fit into your AI architecture or want to benchmark Ornith 1.0 against your private codebase, get in touch.

Watch Sam Witteveen's full breakdown of Ornith 1.0 on YouTube for a hands-on walkthrough of the models and their capabilities.