SWE-bench Verified
A human-validated set of real GitHub issues from open-source projects. An agent is given the issue and the repository and must produce a patch; the task counts as solved only if the change makes the project's own tests pass. It measures whether an agent can fix real bugs in real code, not whether it can ace a quiz - the closest thing the field has to a meaningful exam for an AI software engineer.