SWE-bench is a benchmark that tests whether AI systems can resolve real software engineering issues. It takes thousands of actual GitHub issues and their corresponding fixes from open-source projects, hands the model the issue and the repository, and checks whether the model's proposed patch makes the project's existing tests pass. In short: it measures whether an AI can fix real bugs in real code, not whether it can ace a quiz. That makes it the closest thing the field has to a meaningful exam for an AI software engineer.
This guide explains what SWE-bench measures, what SWE-bench Verified is and why it exists, how to read the scores without being misled, and - just as importantly - what the benchmark does not capture. It is a foundational entry in our glossary and the conceptual basis for a future benchmarks page.
What SWE-bench actually measures
Older AI coding benchmarks tested isolated functions: "write a function that does X." Real engineering is not like that. Real engineering is "here is a bug report and a large existing codebase - go figure out where the problem is and fix it without breaking anything else."
SWE-bench was built to test exactly that harder thing. Each task gives the model:
- A real GitHub issue describing a problem.
- The full repository at the commit before the fix.
- The project's existing test suite.
The model has to produce a patch. The patch is then evaluated by running the tests: a task counts as solved only if the model's change makes the relevant tests pass without breaking the rest. This is why SWE-bench is respected - success is verified by actual test execution, not by a human guessing whether the answer looks plausible. It is the benchmark analog of the real issue-to-PR loop: read the issue, change the code, prove it with tests.
SWE-bench Verified: the cleaner subset
The original SWE-bench had a problem. Some of its tasks were ambiguous, under-specified, or effectively impossible - the test would never pass no matter how good the fix was, or the issue did not contain enough information to solve it. That made raw scores misleading, because a model could be penalized for tasks no one could solve.
SWE-bench Verified was created to fix this. It is a curated, human-validated subset - a few hundred tasks - where engineers confirmed each problem is well-specified and genuinely solvable. As of June 2026, SWE-bench Verified is the figure most vendors and researchers report, precisely because it is the fairer, cleaner measure. When you see a "SWE-bench" percentage cited today, it almost always means SWE-bench Verified - and if it does not say which, that is a reason to ask.
There are also related and expanded benchmarks in the ecosystem (multimodal and multilingual variants, and other agent benchmarks like Terminal-Bench), reflecting how fast this evaluation space is evolving. The principle is the same: test agents on realistic, verifiable tasks.
How to read SWE-bench scores without being misled
This is the part that matters most, because benchmark numbers are easy to misuse.
- Scores go stale fast. Capabilities have risen sharply and keep rising. Any specific percentage in any article - including this one - is a snapshot. Treat published numbers as directional and check the vendor's own site for current figures.
- The scaffold matters as much as the model. A score depends not just on the underlying model but on the agent "scaffold" around it - how it gathers context, plans, and iterates. Two products using the same model can score very differently. So a SWE-bench number describes a system, not just a model.
- Conditions vary. Pass@1 versus multiple attempts, which subset, which date, what tooling - all change the number. Compare like with like, or do not compare.
- A high score is necessary, not sufficient. It tells you an agent can do real fixes. It does not tell you it will be reliable, safe, or auditable on your codebase.
Because of all this, we deliberately do not quote specific competitor SWE-bench percentages in our content. The honest approach is to point you to each vendor's current, dated figures rather than fossilize a number that will be wrong next month. That is the same honesty principle we apply across our 15 best AI coding agents ranking.
What SWE-bench does not measure
A benchmark is a flashlight, not a floodlight. SWE-bench Verified illuminates one important thing - can the agent fix real issues - and leaves a lot in the dark. The things it does not capture are often what determine whether an agent is usable in production:
- Isolation and safety. Does the agent run in a disposable code sandbox, or against your live systems? SWE-bench says nothing about this.
- Failing safely. When the agent cannot solve a task, does it escalate honestly or confidently merge something wrong? This reliability property is invisible to a pass/fail score but critical in production.
- Auditability. Can a human see what the agent did and why? A leaderboard does not care; a team does.
- Workflow fit. Issue-driven intake, review gates, analytics, integration with your tracker and CI - none of this is benchmarked, yet all of it determines real-world value.
- Your codebase. SWE-bench uses specific open-source projects. Your messy, private, idiosyncratic repository is the only benchmark that truly matters for you.
This is why a great SWE-bench score with poor isolation and no audit trail is not a production-ready agent. The score is one input; reliability, safety, and fit are the rest. CodeCourier's design leans hard on the parts benchmarks miss: every run is isolated in a sandbox, the agent fails safely and escalates, and the work is auditable through Issue Sessions and analytics.
How CodeCourier thinks about benchmarks
Our position is simple and, we think, the honest one. Benchmarks like SWE-bench Verified are a useful, real signal of capability, and we take them seriously as one input. But we will not turn them into a marketing trophy. Where we report figures, we will state the exact methodology, the date, and the scaffold so they are reproducible - and we will not publish numbers we have not measured ourselves or cannot stand behind. A dedicated, transparently-reported benchmarks page is on our roadmap, built on that principle.
The deeper point: what should matter to you is not where an agent sits on a leaderboard, but whether it reliably and safely closes the tickets you give it on your codebase. That is the bar we hold ourselves to.
To go further, see how the loop works in What Is an AI Software Engineer, the safety layer in What Is a Code Sandbox, and the full landscape in our 15 best AI coding agents ranking. To compare options, visit the comparison hub; when you are ready, see pricing.
FAQ: what is SWE-bench
What is SWE-bench?
SWE-bench is a benchmark that tests whether AI systems can resolve real software engineering issues. It draws thousands of actual GitHub issues and their corresponding fixes from open-source Python projects, gives the model the issue and the repository, and checks whether the model's patch makes the project's tests pass. It measures real bug-fixing ability, not multiple-choice trivia.
What is SWE-bench Verified?
SWE-bench Verified is a curated, human-validated subset of SWE-bench (a few hundred tasks) where engineers confirmed that each problem is well-specified and solvable. It was created because the original full set contained some ambiguous or impossible tasks that made scores misleading. As of June 2026, SWE-bench Verified is the figure most vendors report, because it is the cleaner, fairer measure.
What is a good SWE-bench score in 2026?
Scores have risen quickly and vary by model, scaffold, and date, so any specific number goes stale fast. The honest answer is to treat published percentages as directional and check each vendor's own site for current figures rather than trusting a number in an article. More important than the headline percentage is how the score was produced and whether it reflects the agent's real reliability.
Are SWE-bench scores reliable for choosing an AI coding agent?
Partially. A higher SWE-bench Verified score is a positive signal that an agent can do real fixes, but the benchmark says nothing about isolation, security, auditability, team workflow, or reliability on your own codebase. Use it as one input among several, and weigh it against how the agent behaves on your actual messy repository. See our 15 best AI coding agents ranking for the fuller criteria.
What does SWE-bench not measure?
A lot. It does not measure whether the agent runs in an isolated sandbox, whether it fails safely when it cannot solve a task, whether its work is auditable, how it handles ambiguous tickets, or how it fits a team's workflow. It also focuses on a specific language and task shape. A great benchmark score with poor isolation and no audit trail is not a production-ready agent.
Does CodeCourier publish SWE-bench scores?
We treat benchmarks as one honest input, not a marketing trophy. Where we report figures we will state the exact methodology, the date, and the scaffold so they are reproducible, and we will not publish numbers we cannot stand behind or that we have not measured ourselves. Our position is that reliability and safety on your real codebase matter more than a leaderboard percentage. A dedicated benchmarks page is on our roadmap.