TL;DR โ SWE-fficiency is a repository-level benchmark for performance optimization (not bug fixing). Each task ships:
- a full codebase,
- a targeted performance workload to speed up,
- and the subset of repo correctness tests that must remain green.
We evaluate patches by applying them, running the correctness suite, and measuring runtime speedups vs. the expert (human) PR, reporting Speedup Ratio (SR).
SWE-fficiency evaluates pass-to-pass performance engineering: start from a codebase and a slow workload, improve runtime, and donโt break behavior. The focus is on investigation (profiling/localization) and correctness-preserving editsโmirroring how performance engineers work day-to-day.
- Real repos, real workloads: 498 tasks from 9 major Python librariesโnumpy, scipy, pandas, scikit-learn, matplotlib, xarray, sympy, dask, astropy.
- Correctness-preserving: Edits must pass the repoโs own unit/integration tests covering the changed code.
- Reproducible evaluation: Prebuilt, containerized environments; per-task CPU/memory pinning recommended (4 vCPUs, 16 GB RAM per worker).
- Metric: Speedup Ratio (SR) = (LM speedup) / (expert speedup); aggregate with a harmonic mean. SR>1.0 means you beat the human baseline.
Performance improvements in widely used libraries have outsized impact. SWE-fficiency isolates the open-ended challenge: find bottlenecks, propose safe optimizations, and prove correctness against the repoโs own testsโat repository scope.
We recommend Python 3.12 and a Linux host. The benchmark is also installable via pip in editable mode.
uv venv --python 3.12
source .venv/bin/activate
uv sync
# Alternatively, you can install directly via pip.
pip install -e .Evaluating on SWE-fficiency is a multi-step process via our package's CLI
For faithful reproduction of paper results, use a large VM (for identical leaderboard setup, use GCP n2-standard-64) and run the setup scripts to configure Docker and CPU pinning. We recommend using --num_workers 12 on this configuration, which allocates 4 vCPUs and 16 GB RAM per worker.
bash scripts/vm/setup_vm.sh
# IMPORTANT: This script pins the number of CPUs for the docker daemon
# hence why it must be run in sudo priveleges. This is so image building
# and pulling overhead does not interfere with evaluation.
sudo scripts/vm/setup_docker.sh MEM_MAX MEM_HIGHswefficiency eval --run_id my_eval --num_workers 12This runs the expert (human) patches to establish baseline performance metrics. Results are stored in logs/run_evaluation/my_eval/gold/.
swefficiency eval --run_id my_eval --num_workers 12 --prediction_path predictions.jsonlYour predictions file should be JSONL with each line containing:
{"instance_id": "<id>", "model_patch": "<patch_text>", "model_name_or_path": "<model_name>"}Results are stored in logs/run_evaluation/my_eval/<model_name>/.
swefficiency report \
--gold_run logs/run_evaluation/my_eval/gold \
--pred_run logs/run_evaluation/my_eval/<model_name>This generates two output files in eval_reports/:
eval_report_<model_name>.csv- Per-instance resultseval_report_<model_name>.json- Summary metrics including:overall_score: Harmonic mean of speedup ratiosproportion_incorrect: Instances that failed correctness testsproportion_correct_but_no_speedup: Correct but slower than baselineproportion_human_speedup_or_better: Matched or beat expert performance
You can also point to arbitrary paths if your evaluation results are stored elsewhere:
swefficiency report \
--gold_run /path/to/gold/results \
--pred_run /path/to/model/results \
--report_output my_reports-
Task structure (per instance):
- Repo snapshot + diff metadata
- A performance workload script that exhibits a measurable speedup under the expert patch
- The set of repo tests whose coverage intersects the expert diff (the โguardingโ tests)
The workloads are separate from correctness tests (as in real projects). The benchmark rejects instances whose speedups are not statistically significant in a controlled environment.
For each instance:
- Let
T_prebe workload runtime pre-edit. - Let
T_post_goldbe runtime after applying the expert patch. - Let
T_post_lmbe runtime after applying your modelโs patch.
Expert speedup = T_pre / T_post_gold
Model speedup = T_pre / T_post_lm
Speedup Ratio (SR) = Model speedup / Expert speedup.
- We aggregate SR across tasks with the harmonic mean.
- If a patch fails correctness tests or doesnโt apply, the instance is scored as if no LM edit were attempted (
T_pre / T_post_lm = 1).
- Run Patch Evaluation โ Apply predicted patches, run guarding correctness tests, run the performance workload; store logs and raw measurements.
- Check Evaluation โ Aggregate JSON/CSV artifacts into final metrics (SR, pass rates, etc.).
See the Quick Start section above for CLI usage, or scripts/eval/README.md for advanced options.
We provide integration points for popular SWE agent harnesses lie OpenHands and SWE-agent via already containerized docker containers.
We ship prebuilt Docker images for generation to match the evaluation environment and avoid dependency drift.
Recommended per-task limits (matching paper setup): 3 hours wall-clock, 100 max actions/turns; be generous with workload timeouts (since tests or workloads can be substantial).
Need a generalized way to prep instances, run your agent, and capture patches? See
scripts/inference/README.md for the cursor.py harness. It loads the
SWE-fficiency dataset directly from Hugging Face, runs prework/inference steps
defined in YAML specs (Cursor CLI example included), and writes git patches ready
for swefficiency eval.
- Use the provided container images (prebuilt for each instance).
- Pin CPU and memory per worker (4 vCPUs / 16 GB RAM). See
scripts/vm/scripts for more details. - Pre-built images include everything needed.
We include reference results in the paper across several modern LMs using OpenHands/SWE-agent. Overall, agents today are far from expert parity (SR โช 1ร) and frequently introduce correctness regressions when attempting optimizations. See paper for full tables and analysis.
.
โโโ scripts/
โ โโโ eval/ # evaluation runner + aggregator
โ โโโ vm/ # docker & VM pinning helpers
โโโ swefficiency/ # python package (cli, utils, loaders)
โโโ assets/figures/ # logos, diagrams
โโโ README.md
This codebase began as a fork from SWE-Gym's fork of SWE-bench (https://github.com/SWE-Gym/SWE-Bench-Fork). We updated repo specific dependencies in the constants files, extended the data pipeline to be able to filter performance specific commits (as per our paper), and updated the evaluation harness to validate our performance + correctness setting. We've also added several helper scripts and utilities to support evaluation and experiment analysis
Copyright 2025 Google LLC
All software is licensed under the Apache License, Version 2.0 (Apache 2.0); you may not use this file except in compliance with the Apache 2.0 license. You may obtain a copy of the Apache 2.0 license at: https://www.apache.org/licenses/LICENSE-2.0
All other materials are licensed under the Creative Commons Attribution 4.0 International License (CC-BY). You may obtain a copy of the CC-BY license at: https://creativecommons.org/licenses/by/4.0/legalcode
Unless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0 or CC-BY licenses are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses.
This is not an officially supported Google product. This project is not eligible for the Google Open Source Software Vulnerability Rewards Program.
