Skip to content

Official implementation of Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

License

Notifications You must be signed in to change notification settings

CodeGoat24/Pref-GRPO

Repository files navigation

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

UnifiedReward Team

Paper PDF Project Page Project Page Project Page Project Page

Hugging Face Spaces Hugging Face Spaces Hugging Face Spaces Hugging Face Spaces

πŸ”₯ News

Please leave us a star if you find this work helpful.

More News

pref_grpo_pipeline

pref_grpo_pipeline

πŸ”§ Environment Setup

  1. Clone this repository and navigate to the folder:
git clone https://github.com/CodeGoat24/Pref-GRPO.git
cd Pref-GRPO
  1. Install the training package:
conda create -n PrefGRPO python=3.12
conda activate PrefGRPO

bash env_setup.sh fastvideo

git clone https://github.com/mlfoundations/open_clip
cd open_clip
pip install -e .
cd ..
  1. Install vLLM (for UnifiedReward-based rewards)
conda create -n vllm
conda activate vllm
pip install "vllm>=0.11.0"
pip install qwen-vl-utils==0.0.14
  1. Download Models
huggingface-cli download CodeGoat24/UnifiedReward-2.0-qwen3vl-8b
huggingface-cli download CodeGoat24/UnifiedReward-Think-qwen3vl-8b
huggingface-cli download CodeGoat24/UnifiedReward-Flex-qwen3vl-8b
huggingface-cli download CodeGoat24/UnifiedReward-Edit-qwen3vl-8b

πŸ’» Training

1. Model-specific workflows (click to expand)

We use training prompts in UniGenBench, as shown in "./data/unigenbench_train_data.txt".

Image edit dataset format

Put jsonl files under data/{Image_Edit_Dataset_Name}/ (default examples use data/Image_Edit_data).

Each line is a JSON object. Recommended fields:

  • instruction: edit instruction
  • instruction_cn: optional Chinese instruction (used when USE_CN=1)
  • source_image or image: source image path (required)
  • target_image: optional target/reference edited image path

Instruction fallback order:

  • instruction -> prompt -> caption -> text

Path rules:

  • absolute path: used directly
  • relative path: resolved against dataset root (input_path dir)
  • fallback: <dataset_root>/images/<relative_path>

Minimal jsonl example:

{"instruction":"replace the red car with a blue one","source_image":"images/0001_source.png","target_image":"images/0001_target.png"}
{"instruction_cn":"ζŠŠε€©η©Ίζ”Ήζˆζ™šιœž","source_image":"images/0002_source.jpg"}
FLUX.1-dev
Preprocess training Data
bash fastvideo/data_preprocess/preprocess_flux_rl_embeddings.sh
Train (examples)
# Pref-GRPO
## UnifiedReward-Flex
bash scripts/full_train/ur_flex_prefgrpo_flux.sh
## UnifiedReward-Think
bash scripts/full_train/ur_think_prefgrpo_flux.sh


# UnifiedReward for Point Score-based GRPO
bash scripts/full_train/unifiedreward_flux.sh
FLUX.2-Klein(T2I,I2I)
Preprocess training Data (T2I)
bash fastvideo/data_preprocess/preprocess_flux2_klein_rl_embeddings.sh
Preprocess training Data (I2I)
# default: INPUT_PATH=data/Image_Edit_data, OUTPUT_DIR=data/flux2_klein_edit_embeddings
bash fastvideo/data_preprocess/preprocess_flux2_klein_edit.sh

# optional: use Chinese instruction when available
USE_CN=1 bash fastvideo/data_preprocess/preprocess_flux2_klein_edit.sh
Train (examples)
# Pref-GRPO (UnifiedReward-Flex as example)
bash scripts/lora/lora_ur_flex_prefgrpo_flux2_klein.sh

# Edit GRPO (UnifiedReward-Edit pointwise/prefgrpo reward example)
bash scripts/lora/lora_ur_edit_point_flux2_klein_edit.sh
bash scripts/lora/lora_ur_edit_prefgrpo_flux2_klein_edit.sh
FLUX.1-Kontext-dev
Preprocess training Data (edit embeddings)
# default output: data/flux1_kontext_edit_embeddings
bash fastvideo/data_preprocess/preprocess_flux1_kontext_edit.sh

# optional: use Chinese instruction when available
USE_CN=1 bash fastvideo/data_preprocess/preprocess_flux1_kontext_edit.sh
Train (examples)
# start UnifiedReward-Edit server first
bash vllm_utils/vllm_server_UnifiedReward_Edit.sh

# Pref-GRPO with edit pairwise reward
bash scripts/lora/lora_ur_edit_prefgrpo_flux1_kontext_edit.sh
Qwen-Image
Preprocess training Data
pip install diffusers==0.35.0 peft==0.17.0 transformers==4.56.0

bash fastvideo/data_preprocess/preprocess_qwen_image_rl_embeddings.sh
Train (examples)
## UnifiedReward-Think for Pref-GRPO
bash scripts/full_train/ur_think_prefgrpo_qwenimage.sh

## UnifiedReward for Point Score-based GRPO
bash scripts/full_train/unifiedreward_qwenimage.sh
Z-Image
Preprocess training Data
bash fastvideo/data_preprocess/preprocess_z_image_rl_embeddings.sh
Train (examples)
## UnifiedReward-Flex for Pref-GRPO (full training)
bash scripts/full_train/ur_flex_prefgrpo_zimage.sh

## UnifiedReward-Flex for Pref-GRPO (LoRA)
bash scripts/lora/lora_ur_flex_prefgrpo_zimage.sh
Qwen-Image-Edit
Preprocess training Data (edit embeddings)
# default output: data/qwen_image_edit_embeddings
bash fastvideo/data_preprocess/preprocess_qwen_image_edit.sh

# optional: use Chinese instruction when available
USE_CN=1 bash fastvideo/data_preprocess/preprocess_qwen_image_edit.sh
Train (examples)
# start UnifiedReward-Edit server first
bash vllm_utils/vllm_server_UnifiedReward_Edit.sh

# Pref-GRPO with edit pairwise reward
bash scripts/full_train/ur_edit_prefgrpo_qwen_image_edit.sh
Wan2.1
Preprocess training Data
bash fastvideo/data_preprocess/preprocess_wan21_rl_embeddings.sh
Train (examples)
# Pref-GRPO
## UnifiedReward-Flex
bash scripts/lora/lora_ur_flex_prefgrpo_wan21.sh

## UnifiedReward-Think
bash scripts/lora/lora_ur_think_prefgrpo_wan21.sh
Wan2.2
Preprocess training Data
bash fastvideo/data_preprocess/preprocess_wan22_rl_embeddings.sh
Train (examples)
# Pref-GRPO
## UnifiedReward-Flex
bash scripts/lora/lora_ur_flex_prefgrpo_wan22.sh

🧩 Reward Models & Usage

We support multiple reward models via the dispatcher in fastvideo/rewards/dispatcher.py. Reward model checkpoint paths are configured in fastvideo/rewards/reward_paths.py. Supported reward models (click to expand for setup details):

aesthetic

Set in fastvideo/rewards/reward_paths.py
aesthetic_ckpt: path to the Aesthetic MLP checkpoint (assets/sac+logos+ava1-l14-linearMSE.pth)
aesthetic_clip: HuggingFace CLIP model id (openai/clip-vit-large-patch14)

clip

Download weights

wget https://huggingface.co/apple/DFN5B-CLIP-ViT-H-14-378/resolve/main/open_clip_pytorch_model.bin

Set in fastvideo/rewards/reward_paths.py
clip_pretrained: path to OpenCLIP weights (used by CLIP reward)

hpsv2

Set in fastvideo/rewards/reward_paths.py
hpsv2_ckpt: path to HPS_v2.1_compressed.pt
clip_pretrained: path to OpenCLIP weights (required by HPSv2)

hpsv3

Set in fastvideo/rewards/reward_paths.py
hpsv3_ckpt: path to HPSv3 checkpoint

pickscore

Set in fastvideo/rewards/reward_paths.py
pickscore_processor: HuggingFace processor id (CLIP-ViT-H-14-laion2B-s32B-b79K)
pickscore_model: HuggingFace model id (Pickscore_v1)

unifiedreward (alignment / style / coherence)

Start server
Targets: unifiedreward_alignment, unifiedreward_style, unifiedreward_coherence

bash vllm_utils/vllm_server_UnifiedReward.sh  
unifiedreward_think

Start server
Target: unifiedreward_think

bash vllm_utils/vllm_server_UnifiedReward_Think.sh  
unifiedreward_flex

Start server
Target: unifiedreward_flex

bash vllm_utils/vllm_server_UnifiedReward_Flex.sh  
unifiedreward_edit

Start server (UnifiedReward-Edit)
Targets:

  • unifiedreward_edit_pairwise
  • unifiedreward_edit_pointwise_image_quality
  • unifiedreward_edit_pointwise_instruction_following
bash vllm_utils/vllm_server_UnifiedReward_Edit.sh

Scope
Edit rewards are image-only (modality=image) and expect edit-specific inputs:

  • pairwise: source image + two edited candidates + instruction
  • pointwise image quality: edited image only
  • pointwise instruction following: source image + edited image + instruction

Optional weighting via env vars
For unifiedreward_edit_pointwise_image_quality:

  • EDIT_QUALITY_WEIGHT_NATURALNESS (default 1.0)
  • EDIT_QUALITY_WEIGHT_ARTIFACTS (default 1.0)

For unifiedreward_edit_pointwise_instruction_following:

  • EDIT_IF_WEIGHT_SUCCESS (default 1.0)
  • EDIT_IF_WEIGHT_OVEREDIT (default 1.0)
videoalign

Set in fastvideo/rewards/reward_paths.py
videoalign_ckpt: path to VideoAlign checkpoint directory

Set rewards in your training/eval scripts

Use --reward_spec to choose which rewards to compute and (optionally) their weights.

Examples:

# Use a list of rewards (all weights = 1.0)
--reward_spec "unifiedreward_think,clip,hpsv3"

# Weighted mix
--reward_spec "unifiedreward_alignment:0.5,unifiedreward_style:1.0,unifiedreward_coherence:0.5"

# Edit reward examples
--reward_spec '{"unifiedreward_edit_pointwise_image_quality":0.5,"unifiedreward_edit_pointwise_instruction_following":0.5}'
--reward_spec '{"unifiedreward_edit_pairwise":1.0}'

# JSON formats are also supported
--reward_spec '{"clip":0.5,"aesthetic":1.0,"hpsv2":0.5}'
--reward_spec '["clip","aesthetic","hpsv2"]'

πŸš€ Inference and Evaluation

we use test prompts in UniGenBench, as shown in "./data/unigenbench_test_data.csv".

FLUX.1-dev
bash inference/flux_dist_infer.sh
Qwen-Image
bash inference/qwen_image_dist_infer.sh
FLUX.2-Klein
bash inference/flux2_klein_dist_infer.sh
Wan2.1
bash inference/wan21_dist_infer.sh
bash inference/wan21_eval_vbench.sh
Wan2.2
bash inference/wan22_dist_infer.sh
bash inference/wan22_eval_vbench.sh

Then, evaluate the outputs following UniGenBench.

πŸ“Š Reward-based Image Scoring (UniGenBench)

We provide a script to score a folder of generated images on UniGenBench using supported reward models.

GPU_NUM=8 bash tools/eval_quality.sh

Edit tools/eval_quality.sh to set:

  • --image_dir: path to your UniGenBench generated images
  • --prompt_csv: prompt file (default: data/unigenbench_test_data.csv)
  • --reward_spec: the reward models (and weights) to use
  • --api_url: UnifiedReward server endpoint (if using UnifiedReward-based rewards)
  • --output_json: output file for scores

πŸ“§ Contact

If you have any comments or questions, please open a new issue or feel free to contact Yibin Wang.

πŸ€— Acknowledgments

Our training code is based on DanceGRPO, Flow-GRPO, and FastVideo.

We also use UniGenBench for T2I model semantic consistency evaluation.

Thanks to all the contributors!

⭐ Citation

@article{Pref-GRPO&UniGenBench,
  title={Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning},
  author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Zhou, Yujie and Bu, Jiazi and Wang, Chunyu and Lu, Qinglin and Jin, Cheng and Wang, Jiaqi},
  journal={arXiv preprint arXiv:2508.20751},
  year={2025}
}

About

Official implementation of Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published