Skip to content

A GitHub issue resolution benchmark with multi-aspect diversity in programming languages, repository domains and modality of input information. (ISSTA'25)

License

Notifications You must be signed in to change notification settings

DeepSoftwareAnalytics/OmniGIRL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SVG Banners

πŸ‘‰πŸ» OmniGIRL πŸ‘ˆπŸ»

🌐 Website β€’ πŸ€— Hugging Face β€’ πŸ‹ Env Docker Image β€’ πŸ“ƒ arXiv Paper Β· πŸ““ ISSTA 2025

✨ Key Features

  • πŸš€ Convenient, Standardized Evaluation Environment

    Provide Pre-built Docker images, significantly simplifying the environment setup process and guaranteeing the consistency and reproducibility of evaluation tests.

  • πŸ•Έ Extensive Programming Language Coverage

    Support Python, Java, JavaScript, and TypeScript, ensuring effective evaluation across these four major programming language ecosystems.

  • πŸ—‚οΈ Rich Multimodal Input Data

    Integrate diverse modalities (text, web content, and images), requiring evaluated models to understand and leverage information from all sources to effectively resolve issues.

  • βš’ Automatic Environment Setup & Dataset Construction Tool

    We introduce SWE-Factory, an automatic issue-resolution benchmark construction pipeline based on a multi-agent framework. For more information and the full source code, visit: SWE-Factory.


πŸ“¦ Environment Setup

To get started, run the bash script below to set up the environment:

bash setup.sh

πŸš€ Running Evaluations

After setup the environment, you need to do following things to run evaluation:

  1. Prepare Prediction file: Some patch files in JSONL format, each item containing:

    • model_name_or_path: Model Name
    • instance_id: Task Instance id
    • prediction_patch: Prediction Patch Content

    Example:

    {
        "model_name_or_path": "agentless-v1",
        "instance_id": "prettier__prettier-12260",
        "model_patch": "diff --git ...."
    }
  2. Move to omnigirl/harness, then you can run the evaluation using the following command:

    # required
    cd omnigirl/harness
    
    python run_evaluation.py --predictions_path <path of your prediction results> \
                             --max_workers <number of workers> \
                             --run_id <unique id number of this evaluation>
  3. By default, your evaluation results will be generated in omnigirl/harness/reports.

  4. For the detailed tutorial about evaluation, please refer to omnigirl/harness directory

  5. Evaluation is recommended to be run on machines with amd64 architecture, consistent with the evaluation environment in the paper.

πŸ“– Citation

If you find OmniGIRL useful for your research and applications, feel free to give us a star ⭐ or cite us using:

@inproceedings{guo2025omnigirl,
  title={OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution},
  author={Guo, Lianghong and Tao, Wei and Jiang, Runhan and Wang, Yanlin and Chen, Jiachi and Liu, Xilin and Ma, Yuchi and Mao, Mingzhi and Zhang, Hongyu and Zheng, Zibin},
  booktitle={Proceedings of the 34rd ACM SIGSOFT International Symposium on Software Testing and Analysis},
  year={2025},
  publisher={{ACM}},
}

πŸ™ Acknowledgements

  • We build on prior work β€” SWE-bench, Agentless, and AutoCodeRover β€” which laid the groundwork for this study.
  • We thank the EvalPlus leaderboard team for releasing the elegant page template that inspired this site.
  • Finally, we are grateful to the open-source developer community for their invaluable contributions.

About

A GitHub issue resolution benchmark with multi-aspect diversity in programming languages, repository domains and modality of input information. (ISSTA'25)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •