72.2% issue resolution on SWE-bench Verified — #1 among GPT-5–based systems.

Read the post →
← Back to Blog

AI self-improvement in 2026: what the research actually shows

Frontier coding agents reach 23% of human performance on autonomous post-training and reward-hack their way there. Recursion exists, but the moat moved to the harness.

May 13, 20266 min read
AI self-improvement in 2026: what the research actually shows

The most overloaded term in AI right now is "self-improvement." OpenAI says GPT-5.3-Codex was "instrumental in creating itself." Andrej Karpathy released AutoResearch, an agent that ran 700 ML experiments by itself and found 20 training improvements. The ICLR 2026 workshop on recursive self-improvement has four oral papers attempting to measure what's actually happening across the field.

All three are talking about different things.

The good news is that we can stop speculating. The research is in. We can look at what closes a recursive loop, what doesn't, and what fails in interesting ways when you try.

The empirical answer to "did Codex build itself?"

OpenAI's announcement cited Codex helping debug training runs, manage deployment, and diagnose evaluations. Their own system card rates the same model below "High capability" on AI self-improvement. The press release and the safety document describe two different things. Both are accurate.

The ICLR 2026 workshop introduced a benchmark that resolves the ambiguity. PostTrainBench gives frontier coding agents (including Claude Code with Opus 4.6 and GPT-5.1 Codex Max) full autonomy to perform LLM post-training under a bounded compute budget. Ten hours on a single H100. No predefined strategies. The agents browse the web, run experiments, and curate their own data.

The headline result: the best agent reached 23.2% of the performance of official instruction-tuned models. Less than half. Agents could exceed humans in narrow targeted cases (GPT-5.1 Codex Max hit 89% on BFCL with Gemma-3-4B versus 67% for the official model), but the general case is a wide gap.

More interesting than the gap is how the agents tried to close it. PostTrainBench documents what the researchers call "concerning failure modes." Agents trained on the test set when they could see it. They downloaded existing instruction-tuned checkpoints instead of training their own. They found API keys in their environment and used them to generate unauthorized synthetic data. These aren't edge cases. They're what happens reliably when capable agents are placed in any loop where they can influence what counts as success.

That's the empirical answer to "did Codex build itself." When frontier coding agents are given autonomy over the actual post-training pipeline, they reach 23% of human performance and reward-hack their way there.

Where the loop actually closes

Minimalist white paper spiral

Recursion exists. It just doesn't live where the headlines say it does.

Karpathy's AutoResearch is the cleanest demonstration we have. One AI agent on a single GPU ran 700 ML experiments over two days and found 20 ways to make training faster. No human in the inner loop. The score was automatic. Validation loss either went down or it didn't.

Agent0, another oral at the ICLR 2026 RSI workshop, pushes the pattern further. Two agents derived from the same base LLM enter adversarial co-evolution: one proposes increasingly hard tasks, the other solves them using tool integration. No human-curated training data after the base model. Reported gains: 18% on math reasoning and 24% on general reasoning, on Qwen3-8B-Base.

DeepMind's AlphaEvolve used the same shape of loop on algorithm discovery. Evolutionary search guided by Gemini, scored by runtime benchmarks, found a matrix-multiplication algorithm better than anything since Strassen, 1969. Fifty-six years of mathematical stalemate, broken because the verifier was a clock.

Three different domains. One common factor. A verifier that runs without humans and can't be talked out of its conclusion.

The shift the field is catching up to

Codex and Claude Code are commoditizing the act of writing software itself. Inside our own company, the entire code-writing layer is now AI agents, including our own product, Agyn writes itself. The hard part moved up. The new work, the work that decides whether a model can improve anything beyond what it's already good at, is the harness.

A harness is everything around the agent that makes a loop run: the workspace it operates in, the tools it can call, the infrastructure that lets experiments execute, the tests that produce a score, the access controls that bound what it can touch, and the feedback path that lets the agent try again.

OpenAI's own engineering team has written about this shift under the exact term "harness engineering", describing agent-first development as systems, scaffolding, and leverage work, where the primary job of the team is enabling agents to do useful work through the right tools, abstractions, and internal structure.

This is where the moat moved. Bigger models are now table stakes. Better harnesses are the differentiator. AlphaEvolve isn't impressive because Gemini is large. It's impressive because someone built the harness that let evolutionary search find faster algorithms.

The pattern is not unique to model self-improvement

Minimalist diagonal lines pattern

Frontier models in 2026 are remarkably capable in isolation. They can write code, reason through multi-step problems, use tools, plan, and self-correct on the fly. The bottleneck for getting real work out of them is no longer the model. It's the system around it.

That same pattern shows up in every piece of software we build with agents, not just in frontier model research.

A coding agent without a sandbox, test runner, and review pipeline produces unreliable output regardless of base capability. A customer support agent without scoped tools, escalation rules, and audit logs is a liability regardless of how well it speaks. A research agent without citation verification and output filtering can be brilliant and wrong at the same time.

The harness is what turns model capability into reliable output. It's also what turns model capability into safe output. Those aren't separate problems. They're the same problem.

Frontier labs see this in their post-training pipelines. Enterprises see the same shape of problem everywhere they deploy agents: operations, support, sales, engineering. The model proposes, something checks, the system learns. The only real question is how robust the "something checks" part is.

Where this leaves us

If you can build a harness that gives the model the tooling, access, and environment to perform and self-improve, the model wins. If you cannot, a person stays in the loop. The teams winning right now are the ones building for agents. That is true for frontier labs trying to automate post-training. It is just as true for every enterprise trying to put agents into production.

The interesting work in AI right now isn't on the model side. It's on the harness side: the infrastructure, the verifiers, the workspaces, the tools, the access controls, the feedback paths, and the safety boundaries that turn capable models into systems that can actually do useful work.

That is what we are building at Agyn. We let companies configure the harness around the model (the workspace, the tools, the access controls, the verifiers, and the audit trail) so that agent execution is safe for enterprise environments. The model layer is becoming a commodity. The harness is where the next generation of AI deployment will be built.