π We are actively gathering feedback from the community to improve our benchmark. We welcome your input and encourage you to stay updated through our repository!!
π To add your own model to the leaderboard, please send an Email to Yibin Wang, then we will help with the evaluation and updating the leaderboard.
Please leave us a star β if you find our benchmark helpful.
-
[2026/02] π₯π₯ GPT-4o-1.5, Seedream-4.5, and FLUX.2-(klein/pro/flex/max) are added to all π Leaderboard.
-
[2025/11] π₯π₯ Nano Banana Pro, FLUX.2-dev and Z-Image are added to all π Leaderboard.
-
[2025/11] π₯π₯π₯ We release the offline evaluation model UniGenBench-EvalModel-qwen3vl-32b-v1.
-
[2025/10] π₯π₯π₯ We release the offline evaluation model UniGenBench-EvalModel-qwen-72b-v1, which achieves an average accuracy of 94% compared to evaluations by Gemini 2.5 Pro.
-
[2025/9] π₯π₯ Lumina-DiMOO, OmniGen2, Infinity, X-Omni, OneCAT, Echo-4o, and MMaDA are added to all π Leaderboard.
-
[2025/9] π₯π₯ Seedream-4.0, Nano Banana, GPT-4o, Qwen-Image, FLUX-Kontext-[Max/Pro] are added to all π Leaderboard.
-
[2025/9] π₯π₯ We release UniGenBench π Leaderboard (Chinese), π Leaderboard (English Long) and π Leaderboard (Chinese Long). We will continue to update them regularly. The test prompts are provided in
./data. -
[2025/9] π₯π₯ We release all generated images from the T2I models evaluated in our UniGenBench on UniGenBench-Eval-Images. Feel free to use any evaluation model that is convenient and suitable for you to assess and compare the performance of your models.
-
[2025/8] π₯π₯ We release paper, project page, and UniGenBench π Leaderboard (English).
We propose UniGenBench, a unified and versatile benchmark for image generation that integrates diverse prompt themes with a comprehensive suite of fine-grained evaluation criteria.
-
Comprehensive and Fine-grained Evaluation: covering 10 primary dimensions and 27 sub-dimensions, enabling systematic and fine-grained assessment of diverse model capabilities.
-
Rich Prompt Theme Coverage: organized into 5 primary themes and 20 sub-themes, comprehensively spanning both realistic and imaginative generation scenarios.
-
Efficient yet Comprehensive: unlike other benchmarks, UniGenBench requires only 600 prompts, with each prompt targeting 1β10 specific testpoint, ensuring both coverage and efficiency.
-
Stremlined MLLM Evaluation: Each testpoint of the prompt is accompanied by a detailed description, explaining how the testpoint is reflected in the prompt, assisting MLLM in conducting precise evaluations.
-
Bilingual and Length-variant Prompt Support: providing both English and Chinese test prompts in short and long forms, together with evaluation pipelines for both languages, thus enabling fair and broad cross-lingual benchmarking.
-
Reliable Evaluation Model for Offline Assessment: To facilitate community use, we train a robust evaluation model that supports offline assessment of T2I model outputs.
Each prompt in our benchmark is recorded as a row in a .csv file, combining with structured annotations for evaluation.
- index
- prompt: The full English prompt to be tested
- sub_dims: A JSON-encoded field that organizes rich metadata, including:
- Primary / Secondary Categories β prompt theme (e.g., Creative Divergence β Imaginative Thinking)
- Subjects β the main entities involved in the prompt (e.g., Animal)
- Sentence Structure β the linguistic form of the prompt (e.g., Descriptive)
- Testpoints β key aspects to evaluate (e.g., Style, World Knowledge, Attribute - Quantity)
- Testpoint Description β evaluation cues extracted from the prompt (e.g., classical ink painting, Egyptian pyramids, two pandas)
| Category | File | Description |
|---|---|---|
| English Short | data/test_prompts_en.csv |
600 short English prompts |
| English Long | data/test_prompts_en_long.csv |
Long-form English prompts |
| Chinese Short | data/test_prompts_zh.csv |
600 short Chinese prompts |
| Chinese Long | data/test_prompts_zh_long.csv |
Long-form Chinese prompts |
| Training | data/train_prompt.txt |
Training prompts |
We provide reference code for multi-node inference based on FLUX.1-dev.
# English Prompt
bash inference/flux_en_dist_infer.sh
# Chinese Prompt
bash inference/flux_zh_dist_infer.shFor each test prompt, 4 images are generated and stored in the following folder structure:
output_directory/
βββ 0_0.png
βββ 0_1.png
βββ 0_2.png
βββ 0_3.png
βββ 1_0.png
βββ 1_1.png
...
The file naming follows the pattern promptID_imageID.png
The evaluation scripts expect generated images organized as follows:
eval_data/
βββ en/
β βββ FLUX.1-dev/ # --model name
β βββ 0_0.png
β βββ 0_1.png
β βββ ...
β βββ 599_3.png
βββ en_long/
β βββ FLUX.1-dev/
βββ zh/
β βββ FLUX.1-dev/
βββ zh_long/
βββ FLUX.1-dev/
File naming: {promptID}_{imageID}.png (4 images per prompt by default).
You can customize the base directory via --eval_data_dir, images per prompt via --images_per_prompt, and file extension via --image_suffix.
We use gemini-2.5-pro (GA, June 17, 2025) via OpenAI-compatible API.
# Set API credentials (or pass via --api_key / --base_url)
export GEMINI_API_KEY="sk-xxxxxxx"
export GEMINI_BASE_URL="https://..."
# Evaluate English & Chinese short prompts
bash eval/eval_gemini.sh --model FLUX.1-dev --categories en zh
# Evaluate all categories (en, en_long, zh, zh_long)
bash eval/eval_gemini.sh --model FLUX.1-dev --categories all
# Resume from previous progress
bash eval/eval_gemini.sh --model FLUX.1-dev --categories en --resumeAvailable categories: en (English short), en_long (English long), zh (Chinese short), zh_long (Chinese long), all.
Run bash eval/eval_gemini.sh -h for all options (--num_processes, --images_per_prompt, etc.).
After evaluation, for each category:
- Scores across all dimensions are printed to the console
- A detailed CSV results file is saved:
./results/{model}_{category}.csv - A JSON score summary is saved:
./results/{model}_{category}.json
python eval/src/calculate_score.py --result_csv ./results/FLUX.1-dev_en.csv --json_path ./results/FLUX.1-dev_en.jsonInstall dependencies:
pip install vllm>=0.11.0 qwen-vl-utils==0.0.14Start server:
# UniGenBench-EvalModel-qwen-72b-v1
vllm serve CodeGoat24/UniGenBench-EvalModel-qwen-72b-v1 \
--host localhost --port 8080 \
--served-model-name QwenVL \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 4 \
--limit-mm-per-prompt.image 2
# UniGenBench-EvalModel-qwen3vl-32b-v1 (recommended, supports 8 GPUs)
vllm serve CodeGoat24/UniGenBench-EvalModel-qwen3vl-32b-v1 \
--host localhost --port 8080 \
--served-model-name QwenVL \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 8 \
--limit-mm-per-prompt.image 2# Evaluate English & Chinese short prompts
bash eval/eval_vllm.sh --model FLUX.1-dev --categories en zh
# Evaluate all categories
bash eval/eval_vllm.sh --model FLUX.1-dev --categories all
# Custom server URL and resume
bash eval/eval_vllm.sh --model FLUX.1-dev --categories en_long zh_long \
--api_url http://gpu-server:8080 --resumeRun bash eval/eval_vllm.sh -h for all options.
Same as Gemini evaluation β results are saved to ./results/{model}_{category}.csv and ./results/{model}_{category}.json.
python eval/src/calculate_score.py --result_csv ./results/FLUX.1-dev_en.csv --json_path ./results/FLUX.1-dev_en.jsonIf you have any comments or questions, please open a new issue or feel free to contact Yibin Wang.
@article{UniGenBench++,
title={UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation},
author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Bu, Jiazi and Zhou, Yujie and Xin, Yi and He, Junjun and Wang, Chunyu and Lu, Qinglin and Jin, Cheng and others},
journal={arXiv preprint arXiv:2510.18701},
year={2025}
}
@article{Pref-GRPO&UniGenBench,
title={Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning},
author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Zhou, Yujie and Bu, Jiazi and Wang, Chunyu and Lu, Qinglin and Jin, Cheng and Wang, Jiaqi},
journal={arXiv preprint arXiv:2508.20751},
year={2025}
}




