Skip to content

Benchmark harness and code for "SWE-fficiency: Can Language Models Optimize Real World Repositories on Real World Workloads?"

License

Notifications You must be signed in to change notification settings

swefficiency/swefficiency

Repository files navigation

SWE-fficiency Logo

home Data paper


SWE-fficiency: Can Language Models Optimize Real World Repositories on Real World Workloads?

TL;DR โ€” SWE-fficiency is a repository-level benchmark for performance optimization (not bug fixing). Each task ships:

  • a full codebase,
  • a targeted performance workload to speed up,
  • and the subset of repo correctness tests that must remain green.

We evaluate patches by applying them, running the correctness suite, and measuring runtime speedups vs. the expert (human) PR, reporting Speedup Ratio (SR).


๐Ÿš€ What is SWE-fficiency?

SWE-fficiency evaluates pass-to-pass performance engineering: start from a codebase and a slow workload, improve runtime, and donโ€™t break behavior. The focus is on investigation (profiling/localization) and correctness-preserving editsโ€”mirroring how performance engineers work day-to-day.

Highlights

  • Real repos, real workloads: 498 tasks from 9 major Python librariesโ€”numpy, scipy, pandas, scikit-learn, matplotlib, xarray, sympy, dask, astropy.
  • Correctness-preserving: Edits must pass the repoโ€™s own unit/integration tests covering the changed code.
  • Reproducible evaluation: Prebuilt, containerized environments; per-task CPU/memory pinning recommended (4 vCPUs, 16 GB RAM per worker).
  • Metric: Speedup Ratio (SR) = (LM speedup) / (expert speedup); aggregate with a harmonic mean. SR>1.0 means you beat the human baseline.

Why this matters

Performance improvements in widely used libraries have outsized impact. SWE-fficiency isolates the open-ended challenge: find bottlenecks, propose safe optimizations, and prove correctness against the repoโ€™s own testsโ€”at repository scope.


๐Ÿ“ฆ Install & Environment

We recommend Python 3.12 and a Linux host. The benchmark is also installable via pip in editable mode.

uv venv --python 3.12
source .venv/bin/activate
uv sync

# Alternatively, you can install directly via pip.
pip install -e .

Quick Start

Evaluating on SWE-fficiency is a multi-step process via our package's CLI

Step 0: VM / Container Setup (highly recommended for reproducibility)

For faithful reproduction of paper results, use a large VM (for identical leaderboard setup, use GCP n2-standard-64) and run the setup scripts to configure Docker and CPU pinning. We recommend using --num_workers 12 on this configuration, which allocates 4 vCPUs and 16 GB RAM per worker.

bash scripts/vm/setup_vm.sh

# IMPORTANT: This script pins the number of CPUs for the docker daemon
# hence why it must be run in sudo priveleges. This is so image building
# and pulling overhead does not interfere with evaluation.
sudo scripts/vm/setup_docker.sh MEM_MAX MEM_HIGH

Step 1: Run gold baseline (establishes reference performance)

swefficiency eval --run_id my_eval --num_workers 12

This runs the expert (human) patches to establish baseline performance metrics. Results are stored in logs/run_evaluation/my_eval/gold/.

Step 2: Run your model predictions

swefficiency eval --run_id my_eval --num_workers 12 --prediction_path predictions.jsonl

Your predictions file should be JSONL with each line containing:

{"instance_id": "<id>", "model_patch": "<patch_text>", "model_name_or_path": "<model_name>"}

Results are stored in logs/run_evaluation/my_eval/<model_name>/.

Step 3: Generate evaluation report

swefficiency report \
    --gold_run logs/run_evaluation/my_eval/gold \
    --pred_run logs/run_evaluation/my_eval/<model_name>

This generates two output files in eval_reports/:

  • eval_report_<model_name>.csv - Per-instance results
  • eval_report_<model_name>.json - Summary metrics including:
    • overall_score: Harmonic mean of speedup ratios
    • proportion_incorrect: Instances that failed correctness tests
    • proportion_correct_but_no_speedup: Correct but slower than baseline
    • proportion_human_speedup_or_better: Matched or beat expert performance

You can also point to arbitrary paths if your evaluation results are stored elsewhere:

swefficiency report \
    --gold_run /path/to/gold/results \
    --pred_run /path/to/model/results \
    --report_output my_reports

๐Ÿงฐ Dataset

  • Location: Hugging Face โ€” swefficiency/swefficiency

  • Task structure (per instance):

    • Repo snapshot + diff metadata
    • A performance workload script that exhibits a measurable speedup under the expert patch
    • The set of repo tests whose coverage intersects the expert diff (the โ€œguardingโ€ tests)

The workloads are separate from correctness tests (as in real projects). The benchmark rejects instances whose speedups are not statistically significant in a controlled environment.


๐Ÿ“Š Evaluation

Metric: Speedup Ratio (SR)

For each instance:

  • Let T_pre be workload runtime pre-edit.
  • Let T_post_gold be runtime after applying the expert patch.
  • Let T_post_lm be runtime after applying your modelโ€™s patch.

Expert speedup = T_pre / T_post_gold Model speedup = T_pre / T_post_lm Speedup Ratio (SR) = Model speedup / Expert speedup.

  • We aggregate SR across tasks with the harmonic mean.
  • If a patch fails correctness tests or doesnโ€™t apply, the instance is scored as if no LM edit were attempted (T_pre / T_post_lm = 1).

Two-stage evaluation pipeline

  1. Run Patch Evaluation โ€” Apply predicted patches, run guarding correctness tests, run the performance workload; store logs and raw measurements.
  2. Check Evaluation โ€” Aggregate JSON/CSV artifacts into final metrics (SR, pass rates, etc.).

See the Quick Start section above for CLI usage, or scripts/eval/README.md for advanced options.


๐Ÿ› ๏ธ Generation (Agents & Harness)

We provide integration points for popular SWE agent harnesses lie OpenHands and SWE-agent via already containerized docker containers.

We ship prebuilt Docker images for generation to match the evaluation environment and avoid dependency drift.

Recommended per-task limits (matching paper setup): 3 hours wall-clock, 100 max actions/turns; be generous with workload timeouts (since tests or workloads can be substantial).

Need a generalized way to prep instances, run your agent, and capture patches? See scripts/inference/README.md for the cursor.py harness. It loads the SWE-fficiency dataset directly from Hugging Face, runs prework/inference steps defined in YAML specs (Cursor CLI example included), and writes git patches ready for swefficiency eval.


๐Ÿ”ฌ Reproducibility Tips

  • Use the provided container images (prebuilt for each instance).
  • Pin CPU and memory per worker (4 vCPUs / 16 GB RAM). See scripts/vm/ scripts for more details.
  • Pre-built images include everything needed.

๐Ÿ“ˆ Baseline Snapshot

We include reference results in the paper across several modern LMs using OpenHands/SWE-agent. Overall, agents today are far from expert parity (SR โ‰ช 1ร—) and frequently introduce correctness regressions when attempting optimizations. See paper for full tables and analysis.


๐Ÿงญ Project Structure (high level)

.
โ”œโ”€โ”€ scripts/
โ”‚   โ”œโ”€โ”€ eval/           # evaluation runner + aggregator
โ”‚   โ””โ”€โ”€ vm/             # docker & VM pinning helpers
โ”œโ”€โ”€ swefficiency/       # python package (cli, utils, loaders)
โ”œโ”€โ”€ assets/figures/     # logos, diagrams
โ””โ”€โ”€ README.md

Acknowledgements

This codebase began as a fork from SWE-Gym's fork of SWE-bench (https://github.com/SWE-Gym/SWE-Bench-Fork). We updated repo specific dependencies in the constants files, extended the data pipeline to be able to filter performance specific commits (as per our paper), and updated the evaluation harness to validate our performance + correctness setting. We've also added several helper scripts and utilities to support evaluation and experiment analysis

License

Copyright 2025 Google LLC

All software is licensed under the Apache License, Version 2.0 (Apache 2.0); you may not use this file except in compliance with the Apache 2.0 license. You may obtain a copy of the Apache 2.0 license at: https://www.apache.org/licenses/LICENSE-2.0

All other materials are licensed under the Creative Commons Attribution 4.0 International License (CC-BY). You may obtain a copy of the CC-BY license at: https://creativecommons.org/licenses/by/4.0/legalcode

Unless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0 or CC-BY licenses are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses.

This is not an officially supported Google product. This project is not eligible for the Google Open Source Software Vulnerability Rewards Program.

About

Benchmark harness and code for "SWE-fficiency: Can Language Models Optimize Real World Repositories on Real World Workloads?"

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages