Benchmarks, honestly

AI Coding Agent Benchmarks

Benchmarks are a real signal of capability - and easy to misuse. This page explains how we measure autonomous coding agents, what the standard benchmarks do and do not capture, and how CodeCourier reports results: only numbers we can reproduce and stand behind.

Methodology first, no vanity scoresReproducible or not published
How we benchmark

What we measure, and what it means

We evaluate against two public, respected benchmarks. Each tests something real about an agent's ability to do engineering work, and each has limits worth understanding before reading any score.

SWE-bench Verified

A human-validated set of real GitHub issues from open-source projects. An agent is given the issue and the repository and must produce a patch; the task counts as solved only if the change makes the project's own tests pass. It measures whether an agent can fix real bugs in real code, not whether it can ace a quiz - the closest thing the field has to a meaningful exam for an AI software engineer.

Terminal-Bench

A benchmark for agents that work through a terminal - running commands, inspecting output, and iterating toward a goal in a real shell environment. It complements SWE-bench by testing operational, tool-using competence rather than only patch generation, which matters for an agent that has to set up, build, and verify its own work.

Reproducible or it does not ship

Any figure we publish comes with the exact methodology, the date, and the scaffold used to produce it. If a result is not reproducible, it is not on this page.

The scaffold matters as much as the model

A score reflects a whole system - how the agent gathers context, plans, and iterates - not just the underlying model. Two products on the same model can score very differently, so a number describes a system, not a model.

Like-for-like, or not at all

Pass@1 versus multiple attempts, which subset, which date, what tooling - all change the number. We compare like with like and label the conditions, or we do not compare.

CodeCourier's results

What we report, and when

We are deliberate about benchmarks. Rather than lead with a headline percentage, we publish verified, reproducible results here as our runs complete - each with its full methodology, date, and scaffold so anyone can check it. This section lists the categories we report against; the numbers land here as they are measured and independently verifiable.

We have chosen not to print a benchmark score we cannot yet reproduce and independently verify. Publishing an unverified number would be marketing, not measurement. The table below shows the categories we will report - every status reads pending verification, and nothing here is a measured result. When a figure is reproducible, it appears here with its full methodology.

Benchmark
What it measures
Status
SWE-bench Verified
Resolving real GitHub issues so the project's own tests pass
Pending independent verification
Terminal-Bench
Operational, tool-using competence in a real shell environment
Pending independent verification
Real-codebase reliability
Closing tickets safely on private, messy repositories - the benchmark that matters most to you
Pending independent verification

This table is methodology, not measured results. Rows are the categories we report against; no row contains a score. Verified figures are added here as runs complete and can be independently reproduced.

The broader landscape

Where the field stands today

Capabilities move weekly and scores go stale fast, so we do not freeze competitor percentages into this page. For current standings, go to the source: the official leaderboard and each vendor's own dated, published figures.

The official SWE-bench leaderboard

The canonical public ranking of agents on SWE-bench and SWE-bench Verified, maintained by the benchmark's authors. It is the right place to see live, current standings rather than a number copied into an article that will be wrong next month.

Open the SWE-bench leaderboard

Each vendor's own published results

For any specific agent - Devin, Claude Code, OpenAI Codex, Cursor, GitHub Copilot, OpenHands and others - check that vendor's own site for their latest, dated figures and methodology. We name competitors only to compare fairly, and we do not quote percentages we cannot verify.

As of June 2026. Treat any benchmark figure - here or anywhere - as a dated snapshot, and check the live leaderboard and each vendor's site for current numbers.

FAQ

Questions about agent benchmarks

Why does this page not show a CodeCourier benchmark score?
Because we will not publish a number we cannot yet reproduce and independently verify. A benchmark figure without disclosed methodology, a date, and a fixed scaffold is marketing, not measurement. We publish verified, reproducible results here as our runs complete - each with the full methodology so anyone can check it. Reliability and safety on your real codebase matter more to us than a leaderboard percentage.
What is SWE-bench Verified and why use it?
SWE-bench Verified is a human-validated subset of SWE-bench where engineers confirmed each task is well-specified and solvable. It exists because the original full set contained ambiguous or impossible tasks that made raw scores misleading. As of June 2026 it is the figure most vendors report, because it is the cleaner, fairer measure. Read our full explainer on the What Is SWE-bench page.
Why do you not quote competitors' SWE-bench scores?
Because they move fast and any specific percentage in an article goes stale quickly. The honest approach is to point you to each vendor's own current, dated figures and to the official leaderboard rather than fossilize a number that will be wrong next month. We name competitors only to compare fairly, never to imply endorsement.
Does a high benchmark score mean an agent is production-ready?
No. A high score is necessary but not sufficient. SWE-bench says nothing about isolation, failing safely, auditability, or fit with your team's workflow. A great score with poor isolation and no audit trail is not a production-ready agent. Treat the number as one input and weigh it against how the agent behaves on your own messy repository.
How will CodeCourier's published results be reproducible?
Every figure we publish will state the exact benchmark and subset, the date, the model, and the scaffold used to produce it, so an independent reader can rerun it under the same conditions. If a result cannot be reproduced that way, we do not publish it.
Beyond the leaderboard

See how CodeCourier closes tickets on your codebase

Free for 14 days · no credit card

Hire your first AI engineer.
Ship by lunchtime.

5 minutes to onboard. First PR within an hour. Cancel anytime.