Skip to content

A benchmark for evaluating LLMs on open-ended CS problems. Exploring the Next Frontier of Computer Science.

License

Notifications You must be signed in to change notification settings

FrontierCS/Frontier-CS

Repository files navigation

Frontier-CS Logo

Evolving Challenges for Evolving Intelligence

Website Discord DeepWiki
arXiv Hugging Face Research Problems Algorithmic Problems

What is Frontier-CS?

Frontier-CS is an unsolved, open-ended, verifiable, and diverse benchmark for evaluating AI on challenging computer science problems.

Think of it as an "exam" for AI, but instead of easy textbook questions, we give problems that are genuinely difficult: ones that researchers struggle with, that have no known optimal solutions, or that require deep expertise to even attempt.

Why Frontier-CS?

Current benchmarks are becoming too easy. Models score 90%+ on many existing coding benchmarks, but that doesn't mean they can actually do useful research or solve real-world engineering challenges.

Frontier-CS is different:

Traditional Benchmarks Frontier-CS
Difficulty Often saturated with evolving intelligence Unsolved: no solution has achieved perfect scores
Problems Textbook-style, known solutions Open-ended research & optimization challenges
Evaluation Binary pass-or-fail Verifiable continuous scoring, always room to improve
Scope Usually one domain Diverse: systems, ML, algorithms, security, and more

🏆 Leaderboard Snapshot (01/29/2026)

Score@k = best-of-k runs; Avg@k = average over k runs; Elo uses Bradley–Terry from single-attempt performance (difficulty-normalized).

Algorithmic Track (172 problems)

Rank Model Score@1 Avg@5 Score@5 Elo
🥇 Gemini 3.0 Pro 33.12 34.58 56.09 1265
🥈 GPT 5.2 Thinking 32.40 33.11 47.19 1242
🥉 GPT 5 Thinking 23.10 22.58 39.73 1196
4 DeepSeek 3.2 24.83 23.89 41.44 1193
5 Grok 4 24.04 22.98 36.81 1174
6 Gemini 2.5 Pro 20.34 19.32 36.65 1167
7 GPT 5.1 Thinking 20.64 21.49 34.76 1164

Human reference: 86.99 (Score@1).

Research Track (68 problems)

Rank Model Score@1 Avg@5 Score@5 Elo
🥇 Gemini 3.0 Pro 46.55 43.14 59.22 1283
🥈 GPT 5 Thinking 30.91 34.94 55.25 1218
🥉 GPT 5.1 Thinking 32.12 33.70 56.79 1214
4 GPT 5.2 Thinking 30.29 34.09 58.90 1210
5 Gemini 2.5 Pro 21.66 25.74 51.57 1180
6 Grok 4 26.75 24.01 48.15 1149
7 DeepSeek 3.2 21.51 21.76 44.41 1146

Getting Started

Installation

Requirements: Python 3.11+, Docker 24+ (for local evaluation)

git clone https://github.com/FrontierCS/Frontier-CS.git
cd Frontier-CS

# Install dependencies (using uv, recommended)
uv sync

# Or with pip:
pip install -e .

Try it yourself

Here's Algorithmic Problem 0 - try to beat GPT-5!

# Run the example solution (Human Expert Solution)
frontier eval algorithmic 0 algorithmic/problems/0/examples/reference.cpp

# Run the example solution (GPT-5 Thinking Solution)
frontier eval algorithmic 0 algorithmic/problems/0/examples/gpt5.cpp

# Try your own solution!
frontier eval algorithmic 0 <your_solution.cpp>

Example Problem

Research Problems

# List all problems
frontier list research

# Evaluate (uses SkyPilot by default, requires `sky check`)
frontier eval research flash_attn <your_solution.py>

# Use Docker instead (no cloud setup needed)
frontier eval research flash_attn <your_solution.py> --backend docker

See research/README.md for full documentation.

Algorithmic Problems

# Evaluate (uses Docker by default)
frontier eval algorithmic 1 <your_solution.cpp>

# Use SkyPilot instead
frontier eval algorithmic 1 <your_solution.cpp> --backend skypilot

See algorithmic/README.md for full documentation.

Raw Score

Frontier-CS supports unbounded scoring, enabling open-ended evaluation compatible with algorithm evolution frameworks such as OpenEvolve.

# Get unbounded score (without clipping to 100)
frontier eval research flash_attn <your_solution.py> --unbounded
frontier eval algorithmic 1 <your_solution.cpp> --unbounded

Python API

from frontier_cs import SingleEvaluator

evaluator = SingleEvaluator()

# Evaluate a research problem
result = evaluator.evaluate("research", problem_id="flash_attn", code=my_code)
print(f"Score: {result.score}")

# Evaluate an algorithmic problem
result = evaluator.evaluate("algorithmic", problem_id=1, code=cpp_code)
print(f"Score: {result.score}")

# Get unbounded score for algorithmic problems
result = evaluator.evaluate("algorithmic", problem_id=1, code=cpp_code, unbounded=True)
print(f"Score (bounded): {result.score}")
print(f"Score (unbounded): {result.score_unbounded}")

Batch Evaluation

For testing your solutions at scale with public test cases.

Solution directory structure:

{track}/solutions/
  {problem}/
    {model}.py          # variant 0
    {model}_1.py        # variant 1
    {model}_2.py        # variant 2

Example for research track:

research/solutions/
  flash_attn/
    gpt5.py
    claude4.5sonnet.py
  cross_entropy/
    gpt5.py

Basic usage:

# Evaluate all research solutions (uses SkyPilot by default)
frontier batch research

# Evaluate all algorithmic solutions (uses Docker by default)
frontier batch algorithmic

# Filter by model or problem
frontier batch research --model gpt5.1
frontier batch research --problem flash_attn

# Override default backend
frontier batch research --backend docker
frontier batch algorithmic --backend skypilot

Custom solutions directory: You can test solutions from a custom directory with the same structure:

frontier batch research --solutions-dir ./my_solutions

Results are saved to ./results/batch/{track}/ by default. The state file tracks which (solution, problem) pairs have been evaluated, so you can:

  • Resume interrupted evaluations automatically
  • Run multiple times with different --solutions-dir and results accumulate

See --help for all options.

Note: For maintainers, ./scripts/run_eval.sh is used for full evaluation with private test cases.

Evaluating and Submitting Results

Reference solutions and full test cases are withheld. We release partial test cases so you can develop and debug locally. For the complete evaluation workflow (preparing solutions, running batch evaluation, viewing results, and submitting to the leaderboard), see SUBMIT.md and submit your solutions to qmang@berkeley.edu, wenhao.chai@princeton.edu, huanzhimao@berkeley.edu, or zhifei.li@berkeley.edu.

Questions? Join our Discord

Acknowledgments

Some problems are adapted from ALE-bench and AI-Driven Research for Systems (ADRS).

Citing Us

If you use Frontier-CS in your research, please cite:

@misc{mang2025frontiercsevolvingchallengesevolving,
      title={FrontierCS: Evolving Challenges for Evolving Intelligence},
      author = {Qiuyang Mang and Wenhao Chai and Zhifei Li and Huanzhi Mao and
                Shang Zhou and Alexander Du and Hanchen Li and Shu Liu and
                Edwin Chen and Yichuan Wang and Xieting Chu and Zerui Cheng and
                Yuan Xu and Tian Xia and Zirui Wang and Tianneng Shi and
                Jianzhu Yao and Yilong Zhao and Qizheng Zhang and Charlie Ruan and
                Zeyu Shen and Kaiyuan Liu and Runyuan He and Dong Xing and
                Zerui Li and Zirong Zeng and Yige Jiang and Lufeng Cheng and
                Ziyi Zhao and Youran Sun and Wesley Zheng and Meiyuwang Zhang and
                Ruyi Ji and Xuechang Tu and Zihan Zheng and Zexing Chen and
                Kangyang Zhou and Zhaozi Wang and Jingbang Chen and
                Aleksandra Korolova and Peter Henderson and Pramod Viswanath and
                Vijay Ganesh and Saining Xie and Zhuang Liu and Dawn Song and
                Sewon Min and Ion Stoica and Joseph E. Gonzalez and
                Jingbo Shang and Alvin Cheung},
      year={2025},
      eprint={2512.15699},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2512.15699},
}

About

A benchmark for evaluating LLMs on open-ended CS problems. Exploring the Next Frontier of Computer Science.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 23

Languages