Evaluation#
AReaL supports distributed inference using the same controller infrastructure as training. This allows you to leverage existing workflows and schedulers to scale evaluation across multiple GPUs and nodes.
Note: AReaL provides distributed inference for your trained model, not a complete evaluation pipeline with dataset retrieval and metrics computation. You can use third-party evaluation frameworks with AReaL checkpoints directly — no conversion required since AReaL saves HuggingFace-compatible checkpoints.
Quick Start#
Run evaluation on GSM8K:
python3 examples/math/gsm8k_eval.py \
--config examples/math/gsm8k_grpo.yaml \
scheduler.type=local \
actor.path=/path/to/checkpoint
For distributed evaluation:
# With Ray (3 nodes, 12 GPUs)
python3 examples/math/gsm8k_eval.py \
--config examples/math/gsm8k_grpo.yaml \
scheduler.type=ray \
allocation_mode=sglang:d12p1t1 \
cluster.n_nodes=3
# With Slurm (12 nodes, 96 GPUs)
python3 examples/math/gsm8k_eval.py \
--config examples/math/gsm8k_grpo.yaml \
scheduler.type=slurm \
allocation_mode=sglang:d96p1t1 \
cluster.n_nodes=12
Evaluation Metrics#
Select an appropriate dataset and metrics for your task, then integrate the evaluation logic as a workflow. See the Agentic RL guide for details.
Example with an agentic math evaluator (the evaluation code is independent with AReaL):
from agents import Agent, OpenAIProvider, RunConfig, SQLiteSession, function_tool
from agents import Runner as OpenAIRunner
from math_verify import parse, verify
from openai import AsyncOpenAI
@function_tool
def add(a: float, b: float) -> float:
"""Add two numbers."""
return a + b
@function_tool
def multiply(a: float, b: float) -> float:
"""Multiply two numbers."""
return a * b
def math_reward_fn(completions: str, answer: str) -> float:
return float(verify(parse(completions), parse(answer)))
class MathAgent:
async def run(self, data, **extra_kwargs):
http_client = extra_kwargs.get("http_client")
base_url = extra_kwargs.get("base_url")
client = AsyncOpenAI(base_url=base_url, http_client=http_client, max_retries=0)
run_config = RunConfig(
model_provider=OpenAIProvider(openai_client=client),
model="default",
tracing_disabled=True,
)
agent = Agent(
name="RLVR Math with Calculator",
instructions="Answer math questions using the calculator tools.",
tools=[add, multiply],
)
result = await OpenAIRunner.run(
agent,
input=data["messages"][-1]["content"],
session=SQLiteSession("math"),
run_config=run_config,
)
return math_reward_fn(result.final_output, data["answer"])
Architecture#
Evaluation uses a single-controller architecture without training workers:
Controller Process
│
└─> Inference Engine Controller (SGLang/vLLM)
├─> Scheduler creates inference workers
├─> Submits evaluation tasks with workflow
└─> Collects results and computes metrics
The controller orchestrates evaluation from a CPU process while inference workers run on GPUs.
Implementation#
See examples/math/gsm8k_eval.py for a complete
example. The key pattern:
from areal.api.alloc_mode import AllocationMode
from areal.api.cli_args import GRPOConfig, SGLangConfig, load_expr_config, vLLMConfig
from areal.engine.sglang_remote import RemoteSGLangEngine
from areal.engine.vllm_remote import RemotevLLMEngine
from areal.infra import LocalScheduler, RayScheduler, SlurmScheduler
# Load config and parse allocation mode
config, _ = load_expr_config(args, GRPOConfig)
allocation_mode = AllocationMode.from_str(config.allocation_mode)
# Initialize scheduler based on config
if config.scheduler.type == "local":
scheduler = LocalScheduler(exp_config=config)
elif config.scheduler.type == "ray":
scheduler = RayScheduler(exp_config=config)
elif config.scheduler.type == "slurm":
scheduler = SlurmScheduler(exp_config=config)
# Select inference engine and build server args
if allocation_mode.gen_backend == "sglang":
engine_cls = RemoteSGLangEngine
server_args = SGLangConfig.build_args(
sglang_config=config.sglang,
tp_size=allocation_mode.gen.tp_size,
base_gpu_id=0,
)
elif allocation_mode.gen_backend == "vllm":
engine_cls = RemotevLLMEngine
server_args = vLLMConfig.build_args(
vllm_config=config.vllm,
tp_size=allocation_mode.gen.tp_size,
pp_size=allocation_mode.gen.pp_size,
)
# Create controller and initialize
eval_rollout = engine_cls.as_controller(config.rollout, scheduler)
eval_rollout.initialize(
role="eval-rollout",
alloc_mode=allocation_mode,
server_args=server_args,
)
# Define workflow and its configuration
workflow = "areal.workflow.rlvr.RLVRWorkflow"
workflow_kwargs = dict(
reward_fn="areal.reward.gsm8k.gsm8k_reward_fn",
gconfig=config.gconfig,
tokenizer=config.tokenizer_path,
enable_thinking=False,
)
# Submit evaluation tasks
cnt = 0
for data in valid_dataloader:
for item in data:
eval_rollout.submit(
item,
workflow=workflow,
workflow_kwargs=workflow_kwargs,
group_size=config.gconfig.n_samples,
)
cnt += 1
# Wait for completion and collect results
eval_rollout.wait(cnt, timeout=None)
eval_stats = eval_rollout.export_stats()
This follows the same controller pattern as training but without training components.
Configuration#
Evaluation reuses the same config structure as training. You can use an existing training config directly with the evaluation script.
experiment_name: gsm8k-eval
trial_name: eval0
seed: 1
allocation_mode: sglang:d4p1t1 # Inference-only allocation
scheduler:
type: local # or 'ray', 'slurm'
rollout:
max_concurrent_rollouts: 256
# max_head_offpolicyness is set to 1e12 internally for eval
gconfig:
n_samples: 8
temperature: 1.0
max_new_tokens: 1024
actor:
path: Qwen/Qwen2.5-1.5B-Instruct
dtype: bfloat16
scheduling_spec:
- task_type: worker
port_count: 2
gpu: 1
cmd: python3 -m areal.infra.rpc.rpc_server
valid_dataset:
name: gsm8k
split: test
batch_size: 32
Logging Results#
Use tabulate_stats to format evaluation metrics:
from areal.utils.printing import tabulate_stats
eval_stats = eval_rollout.export_stats()
logger.info(f"Evaluation Results: {tabulate_stats(eval_stats)}")
Custom Workflows#
Reuse training workflows or create custom ones. See the Agentic RL tutorial and Customization: Rollout Workflows for complete guides.