[π Paper] β’ [π± GitHub] β’ [π¦ Twitter] β’ [π Rednote]
Figure 1: An overview of the DAC-style inference and reward assignments in training, illustrated with a case study.
What this repo does: it trains LLMs to think in a divide-and-conquer (DAC) way via an end-to-end RL pipeline.
Core idea: instead of only learning sequential chain-of-thought (CoT), the policy learns to:
- Divide a hard problem into structured subproblems, then
- Conquer: solve subproblems and finally solve the original problem conditioned on those solutions.
Why it matters: CoT is strictly sequential and can hit a ceiling on very hard problems. DAC offers stronger test-time scalability by enabling structured exploration through decomposition.
- [2026/02/02] DAC-RL paper and repo are released.
At a high level, DAC-RL alternates between two roles (not necessarily two separate models):
- Divide stage: generate subproblems with a strict tag format.
- Conquer stage: solve subproblems sequentially and then solve the original problem, outputting a final boxed answer.
The repo implements this as a single training loop that creates two rollout batches per iteration (divide and conquer) and assigns rewards to both.
flowchart TD
Q[Original question] --> D[Divide prompt: propose subproblems]
D --> S[Parse & validate subproblems]
S --> C[Conquer prompt: solve subproblems then original]
C --> R[Rule-based verifier: answer correctness]
R --> RC[Conquer reward]
RC --> RD[Divide reward (derived from conquer outcomes)]
RC --> PPO[PPO/GRPO-style policy update]
RD --> PPO
π Experiments on Qwen3-4B-Instruct-2507
| Model | AIME 2024 | AIME 2025 | Beyond-AIME | HMMT 2025 | Average |
|---|---|---|---|---|---|
| Init-CoT (Pass@1/32) | 62.6 / 90.0 | 45.7 / 76.7 | 32.1 / 65.0 | 30.3 / 56.7 | 42.7 / 72.1 |
| Init-DAC (Pass@1/32) | 59.6 / 90.0 | 43.2 / 73.3 | 29.6 / 61.0 | 28.2 / 63.3 | 40.2 / 71.9 |
| RL-CoT (Pass@1/32) | 45.9 / 85.8 | 52.1 / 77.4 | 30.4 / 58.1 | 21.8 / 54.4 | 37.5 / 69.0 |
| RL-DAC (Pass@1/32) | 63.9 / 87.7 | 54.2 / 78.8 | 34.6 / 67.9 | 31.9 / 66.6 | 46.1 / 75.3 |
We recommend using Conda to manage your environment. We use vLLM (0.10.1.1) to accelerate inference. Run the following commands to setup your environment:
conda create -n svs python=3.10.16
conda activate svs
pip install torch==2.7.1 --index-url https://download.pytorch.org/whl/cu128 # CUDA 12.6 for example
pip install -r requirements.txtThe provided benchmark is data/dac-rl-benchmarks.jsonl.
Each line is a JSON object with keys:
question: the problem statementref_answer: the reference final answerdata_source: benchmark name (e.g.,AIME-2024-30,AIME-2025-30, ...)
The training script expects a parquet like data/DAPO-Math-17k.parquet. One simple way to download & save as parquet:
from datasets import load_dataset
import pandas as pd
ds = load_dataset('BytedTsinghua-SIA/DAPO-Math-17k')
# Adjust split names if needed.
train = ds['train'].to_pandas()
train.to_parquet('data/DAPO-Math-17k.parquet', index=False)
print('saved:', 'data/DAPO-Math-17k.parquet', 'rows:', len(train))We provide evaluation scripts for both CoT and DAC inference. To use them, simply configure the model_name_or_path (default: Qwen/Qwen3-4B-Instruct-2507) and the data_path (by default, AIME 24, AIME 25, Beyond-AIME, and HMMT-25 are used for evaluation, as described in the paper) in scripts/eval_cot.sh and scripts/eval_dac.sh, and then run the following command:
bash scripts/eval_cot.sh # Evaluate model performance using chain-of-thought prompting
bash scripts/eval_dac.sh # Evaluate model performance using divide-and-conquer style reasoningDAC evaluation does two-stage generation:
- divide prompt β generate subproblems
- conquer prompt β solve and answer original
We also present our complete training scripts, where the core implementation is the RayDACTrainer class in the verl/trainer/ppo/ray_trainer.py file. We provide the training data used in our paper in data.
For example, to train the Qwen3-4B-Instruct-2507 model, run the following command:
bash scripts/run_dac_training.shWhich file implements DAC training? - verl/trainer/ppo/ray_trainer.py β RayDACTrainer.
DAC-specific reward controls: These parameters directly encode the paperβs training-time alignment to DAC inference.
| Config key | Default (in config) | Used in | What it does | Practical notes |
|---|---|---|---|---|
data.divide_reward_setting |
"format" |
training | How to reward the divide stage. Options implemented: format, any_accuracy, average_accuracy |
format rewards producing valid subproblem structure; any_accuracy is stricter/competition-aware; average_accuracy gives graded signal but can be noisier. |
data.conquer_reward_setting |
"answer" |
training | How to reward the conquer stage. Options: answer, answer_and_format |
answer_and_format additionally enforces that the response follows subproblem order/structure, which stabilizes DAC behavior but may reduce reward early. |
data.stop_divide |
false |
training | If true, only train on conquer samples (ignore divide samples) |
Useful for ablations; for full DAC-RL keep false. |
data.max_prompt_length |
4096 |
training | Max tokens for prompts (after chat template) | Must accommodate divide/conquer prompts. Overlong samples are filtered. |
data.max_response_length |
8192 |
training | Max generation tokens during rollouts | DAC often benefits from longer outputs (esp. conquer stage). |
If you find this repository helpful, please consider citing our paper:
@misc{liang2026trainingllmsdivideandconquerreasoning,
title={Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability},
author={Xiao Liang and Zhong-Zhi Li and Zhenghao Lin and Eric Hancheng Jiang and Hengyuan Zhang and Yelong Shen and Kai-Wei Chang and Ying Nian Wu and Yeyun Gong and Weizhu Chen},
year={2026},
eprint={2602.02477},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.02477},
}This project is released under the Apache License 2.0. See LICENSE.