72.2% issue resolution on SWE-bench Verified — #1 among GPT-5–based systems.

Read the post →
← Back to Blog

We tested how an AI team improves issue resolution on SWE-bench Verified

We evaluated a team-based approach on SWE-bench Verified, showing top performance among systems using GPT-5–class models.

Feb 12, 20265 min read
We tested how an AI team improves issue resolution on SWE-bench Verified

Software engineering is a collaborative process. Work is split across roles, coordination happens through shared artifacts, and progress emerges through iteration and review. This led us to a simple question:

if we allow a team of AI agents to communicate, split responsibilities, and follow an explicit methodology, does issue resolution improve?

We tested this on SWE-bench Verified using our production system. The same prompts, role definitions, tools, and execution model we use in real workloads were applied directly, without benchmark-specific tuning or enhancements.

In a fully automated setting, the system resolved 72.2% of tasks, placing it among the top-performing systems evaluated with GPT-5–class models.

The full technical report is available in the arXiv paper.

Design constraints for autonomous execution

  • Isolated execution environments — Each agent runs in its own sandbox with shell access. Agents can modify files, install dependencies, and run tests/builds/diagnostics independently. There is no shared filesystem, which prevents interference, allows agent to configure only needed dependences, and makes failures easier to attribute.

  • Explicit role specification and enforcement — For every role, we define its LLM, reasoning level, tools, skills, and clear responsibilities. This prevents instructions from being mixed and avoids collapsing all responsibilities into a single agent. In practice, this improves debuggability, allows agents to control each other’s execution (for example, reviewer over engineer), and enables flexible allocation of model capacity rather than using high-reasoning models everywhere.

  • Structured communication pattern and methodology — Agents follow a structured process for analysis, implementation, review, and iteration. Coordination is controlled by a manager agent, which allows the system to adapt dynamically rather than follow a fixed pipeline of actions.

  • Context optimization for long-running tasks — Accumulated context is summarized automatically. Large files and artifacts are persisted to the filesystem instead of being fed into the model context, and task completion is determined by system-level criteria such as green CI and reviewer approval.

Together, these constraints make it possible to run the system end-to-end in a fully automated setting without manual intervention.

About SWE-bench Verified

SWE-bench Verified is a curated subset of the SWE-bench benchmark designed to evaluate end-to-end issue resolution in real-world GitHub repositories. Each task requires understanding an issue, modifying the codebase, and producing a pull request that passes the project’s test suite. The benchmark is widely used to assess autonomous software engineering systems under realistic constraints. A detailed description of the benchmark and its verification process is available in OpenAI’s overview.

Evaluation setup

We evaluated a multi-agent system built on agyn. Once a task starts, the system runs end-to-end without human intervention. The same prompts, role definitions, tools, and execution model used in production were applied directly, without benchmark-specific tuning or enhancements.

Each task is handled by a team of agents with fixed, non-overlapping responsibilities:

  • Manager coordinates execution, communication, and termination
  • Researcher explores the repository and the problem, and gathers relevant context to define a specification
  • Engineer implements changes and debugs failures
  • Reviewer evaluates the pull request and enforces acceptance criteria

Different roles use different models, reflecting production constraints:

  • Manager, Researcher use GPT-5 (medium reasoning)
  • Engineer, Reviewer use GPT-5-Codex (medium reasoning)

There is no predefined number of steps. Coordination is controlled by the manager agent, allowing the system to adapt dynamically rather than follow a fixed pipeline, and execution continues until system-level completion criteria are met.

For evaluation, all SWE-bench Verified repositories were forked into the agyn-sandbox GitHub organization. Each task was resolved via a pull request, and all agent interactions occurred through standard GitHub artifacts, including commits, PR descriptions, and review comments. These forks and pull requests are published as part of our contribution, allowing external researchers to inspect how agents communicate during issue resolution.

As an example, for the task sphinx-doc__sphinx-9320, the system produced the following pull request: agyn-sandbox/sphinx#81

The PR shows the full end-to-end process, including code changes, test execution, and review-driven iteration, with all coordination happening through standard GitHub workflows.

Results on SWE-bench Verified

System Model(s) Resolved, %
agyn GPT-5 / GPT-5-Codex (medium reasoning) 72.2
OpenHands GPT-5 (high reasoning) 71.8
mini-SWE-agent GPT-5.2 (high reasoning) 71.8
mini-SWE-agent GPT-5 (medium reasoning) 65.0

These results indicate that structuring autonomous software engineering as a coordinated team process can measurably improve issue resolution. The system outperformed the single-agent baseline by 7.2% without benchmark-specific tuning and under the same model class. And outperformed mini-SWE agent with higher-reasoning GPT-5.2 variants and OpenHands running GPT-5 at a high reasoning level, while our setup used medium-reasoning configurations.

Takeaway

The SWE-bench Verified results indicate that issue resolution improves when autonomous software engineering is structured as a coordinated team.

The gains come from a small set of design patterns: isolated execution environments, explicit role definitions and agent configuration, structured communication, and context management for long-running tasks.

Beyond aggregate performance, this work shows that team-based coordination among specialized agents is a practical and effective paradigm for autonomous software engineering. Modeling development as a collaborative process with distinct roles, responsibilities, and execution environments enables agents to iterate, review, and converge more reliably than monolithic or pipeline-based approaches.

From this perspective, progress in agent infrastructure and organizational design is as important as improvements in underlying models.

References