Home

Published

- 31 min read

Inside the Agentic RL Training Loop

img of Inside the Agentic RL Training Loop

The algorithm is well-understood — GRPO with binary reward. The real complexity lives in everything else: how a multi-turn trajectory becomes a training sample, how loss masks determine which tokens receive gradients, how Docker containers are orchestrated at scale, and why training a single agent works for multi-agent deployment.

Over the past year, I’ve been digging into native agentic RL — the kind of RL that bakes agentic capabilities (tool use, multi-step planning, error recovery) directly into model weights, rather than relying on external scaffolding and prompt engineering. The recent release of GLM5 — one of the strongest open-source models for agentic tasks (77.8% on SWE-Bench Verified), trained with Zhipu AI’s Slime framework — is a good occasion to write up what I’ve learned. This post documents the full pipeline, step by step: from what happens inside a single rollout to how hundreds of rollouts are orchestrated in parallel, from how a 25-turn trajectory is flattened into a single training sample to how GRPO extracts learning signal from groups of trajectories.

The reference framework throughout is Slime (Zhipu AI / THUDM), with SWE-Bench as the training benchmark. The ideas are framework-agnostic — the same pipeline structure appears in DeepSWE, SWE-RL, Agent-R1, and every other native agentic RL system I’ve studied — but Slime is a natural starting point given the GLM5 results.


1. What Is Native Agentic RL

The Two Ways to Build an Agent

Before 2024, building an LLM-based coding agent meant taking a capable base model and wrapping it in external scaffolding — ReAct prompting, orchestration frameworks like LangChain or AutoGPT, hand-crafted system prompts that spell out tool-use conventions, and multi-step pipelines that manage the agent loop. The model itself didn’t know how to be an agent; the framework did the thinking about when to search, when to edit, when to test.

Native Agentic RL flips this. Instead of engineering agentic behavior from the outside, we train it directly into the model’s weights. Through RL with environment interaction, the model learns to use tools, plan multi-step solutions, recover from errors, and manage context — natively, without relying on external orchestration. The “native” qualifier is about where the agentic capability lives: in the weights, not in the scaffold.

ApproachAgentic Capability Lives InProsCons
Scaffold-basedExternal code (prompts, orchestration)Quick to prototype, model-agnosticBrittle, limited by prompt engineering ceiling, can’t improve through experience
Native Agentic RLModel weights (trained via RL)Robust, learns from experience, discovers novel strategiesExpensive to train, requires environment infrastructure

How It’s Trained: RLVR in Agentic Environments

The training methodology is RL with Verifiable Rewards (RLVR) — the same paradigm used for math reasoning, but applied to agent-environment interaction. The core idea: drop the model into a real task environment — a code repository with a bug, a terminal, a test suite — let it try to solve the task through multi-turn interaction, observe whether it succeeded, and use that outcome as the reward signal. No expert trajectories to imitate. No human labelers to score outputs. Just: did the tests pass?

Training ParadigmTraining SignalWhat the Model Learns
SFTExpert demonstrations”How the expert solves this”
RLHFHuman preference labels”What humans prefer to see”
RLVR (math/reasoning)Verifiable answer correctness”How to reason correctly”
Native Agentic RLTask outcome in real environment”How to use tools and complete tasks”

Native Agentic RL combines two ideas: the RLVR training methodology (grounded reward from verifiable task outcomes) and the goal of producing natively agentic models (models that inherently know how to interact with environments). The reward on SWE-Bench is literally 1.0 if the agent’s patch passes all tests, 0.0 otherwise. No proxy, no learned reward model to be hacked. The signal is grounded in reality.

Why does this matter? Because SFT has a ceiling effect — the model can never exceed the quality of its demonstrations. Scaffold-based agents are limited by the quality of their prompts and orchestration. And RLHF’s reward model can be gamed. Native Agentic RL, in principle, has none of these limitations. The model discovers strategies through exploration, the reward is incorruptible (assuming your test suite is sound), and the resulting capabilities are baked into the weights — no scaffold required at inference time.

The Landscape

As of 2025, several frameworks have emerged for native agentic RL:

FrameworkOrganizationKey Design ChoiceModel SizeSWE-Bench Verified
DeepSWEAgenticaSimplified pipeline + RepoGraph32B (Qwen3)~50.8%
SWE-RLMeta FAIRFirst RL-for-SWE paper, clean baseline70B (Llama3)~41%
Agent-R1USTCR1-style reasoning for agents7B+
VerlToolTiger AI Lab (on VERL)Multi-tool RL integration7B+

A few things stand out. GRPO dominates — almost every framework uses it, because eliminating the value model simplifies infrastructure dramatically. Open-weight RL-trained agents have closed the gap with proprietary models — DeepSWE’s 32B model (~50.8%) matches Claude 3.5 Sonnet (~49%), a striking validation of the approach.

But the most impressive results come from production-scale systems training much larger models:

SystemOrganizationModelSWE-Bench VerifiedKey Innovation
SlimeZhipu AIGLM-5 (744B / 40B active MoE)77.8%Async training (integrates APRIL), 3-layer architecture
MiniMax ForgeMiniMaxM2.5 (230B / ~10B active MoE)80.2%CISPO algorithm, tree-structured merging, process reward
VERLByteDanceGeneral-purpose (tested at 671B)3D-HybridEngine, AgentLoopBase + VerlTool for agentic RL
Poolside Model FactoryPoolsideProprietary10K H200 GPUs, Saucer (800K+ repos), end-to-end automation

MiniMax Forge is particularly notable for its black-box agent integration — the framework intercepts the agent’s API calls without requiring any modification to the agent code itself — and its CISPO (Clipped Importance Sampling Policy Optimization) algorithm that clips importance weights while preserving gradients. Its tree-structured merging yields ~40x training speedup over naive sequential rollouts. Poolside’s system is the most complete end-to-end infrastructure publicly documented, with GPU-Direct NCCL transferring Llama 405B weights (810 GB) in ~2 seconds. For a deeper comparison of these infrastructure frameworks, see the companion post on RL infrastructure.

Worth noting: Zhipu AI (THUDM) maintains two complementary frameworks — Slime and AgentRL. Slime is the general-purpose RL training infrastructure (SGLang + Megatron) used to train the full GLM model family (GLM-4.5 through GLM-5), handling both reasoning and agentic capabilities. AgentRL is a higher-level framework specifically for multi-turn, multi-task agent training — it tackles challenges like heterogeneous environment orchestration, cross-policy sampling, and task advantage normalization, and was adopted to build AutoGLM. Where Slime answers “how to scale RL training efficiently across thousands of GPUs,” AgentRL answers “how to train agents that interact with diverse real-world environments (web, OS, APIs) across multi-turn episodes.” AgentRL’s benchmarks (AgentBench) are broader than SWE-bench, covering web browsing, OS operations, and other interactive tasks beyond coding.

What Makes Agentic RL Structurally Different from Math RLVR

If you’ve worked with RLVR for math or reasoning, agentic RL might look superficially similar — same algorithms, same training loop pattern. But the engineering challenges are qualitatively different:

DimensionMath / Reasoning RLVRAgentic Coding RL
EnvironmentStateless (verify the answer)Stateful (Docker container with repo, deps, tests)
Episode lengthSingle generation, < 4K tokensMulti-turn, 2K-128K tokens
Reward latencyMillisecondsMinutes (run test suite)
Action spaceNatural languageTool calls (file_edit, terminal, search)
ParallelismTrivialComplex (container orchestration)
Credit assignmentDirect (output → reward)Ambiguous (which of 15 turns caused success?)

The stateful environment requirement is the defining challenge. For math RLVR, you can generate 1024 completions in a batched inference call and verify them in microseconds. For coding RL, each completion requires launching a Docker container, checking out a repository, running a multi-turn interaction loop, and executing a test suite. Wall-clock time per trajectory: 3-10 minutes, not 3-10 seconds.


2. The Framework: Slime’s Three-Layer Architecture

Most RL training frameworks were designed for single-turn RLVR: generate one completion, score it, update the policy. Agentic RL breaks this model — a single trajectory might take 5 minutes of wall-clock time across 10-25 interaction turns. If the training engine waits for all trajectories to complete, GPUs sit idle for minutes between gradient steps. (For a broader survey of how different organizations — Poolside, MiniMax, ByteDance, and others — approach RL training infrastructure, see my companion post on RL Infrastructure for Large-Scale Agentic Training.)

Slime’s answer is a three-layer architecture that separates concerns: Training, DataBuffer, and Rollout.

Slime Architecture

Training Layer manages the policy update — receive processed batches, compute GRPO loss, run backward pass, push updated weights. This is the simplest layer. It supports GRPO, PPO, RLOO, and GRPO++ (with length normalization to combat verbose trajectories).

DataBuffer Layer is the transformation pipeline: receive raw trajectories from workers, compute rewards, group by prompt for GRPO, tokenize multi-turn conversations into flat sequences, compute loss masks and advantages, filter degenerate trajectories, and assemble training batches. In async mode (APRIL), it continuously accumulates data and provides batches to Training as they become ready.

Rollout Layer is where the model interacts with the world. 32-128+ parallel workers, each managing a Docker container, running the agent loop (generate → execute → observe), collecting trajectories. This is the most complex layer and the one that consumes ~80% of wall-clock time.

Synchronous vs. Asynchronous (APRIL)

Slime integrates APRIL (Active Partial Rollouts in Reinforcement Learning), a system-level optimization from AMD/CMU/LMSYS/UCLA that addresses the long-tail generation bottleneck. The choice between sync and async is the most impactful architectural decision:

Synchronous: all workers complete → DataBuffer processes → Training updates → repeat. Clean and strictly on-policy, but GPU utilization is terrible — typically 5-10%, because training GPUs idle during the multi-minute rollout phase.

Asynchronous (APRIL): Rollout, DataBuffer, and Training all run concurrently. Workers continuously generate trajectories. Training continuously consumes batches. Result: ~2x throughput improvement.

The tradeoff is staleness — some trajectories in a batch were generated by an older policy version. APRIL mitigates this through importance sampling correction, staleness budgets (discard trajectories more than K steps old), and policy version tracking for precise ratio computation. In practice, the staleness is manageable because the KL penalty in GRPO keeps consecutive policy versions close.

Eight Extension Points

Slime provides 8 customizable function hooks covering the full pipeline — from generation format (custom_generation_fn) to reward computation (custom_reward_fn) to trajectory filtering (custom_filter_fn). This makes it possible to adapt the framework to new tasks without modifying core code:

   config = SlimeConfig(
    model=ModelConfig(model_name_or_path="Qwen/Qwen2.5-Coder-7B-Instruct"),
    training=TrainingConfig(algorithm="grpo", group_size=8, kl_coeff=0.01),
    rollout=RolloutConfig(num_workers=32, max_turns=25, timeout_seconds=300),
    reward=RewardConfig(reward_type="binary"),
    async_mode=True,  # Enable APRIL
)

trainer = SlimeTrainer(config)
trainer.register_reward_fn(custom_reward_fn)
trainer.register_env_fn(custom_env_fn)
trainer.register_filter_fn(custom_filter_fn)
trainer.train()

3. The End-to-End Training Pipeline

Let’s walk through a complete training iteration, from task sampling through policy update.

Phase 0: Data Preparation

Before training begins, we need a task pool and execution environments.

SWE-Bench provides GitHub issues with ground-truth test patches. After filtering out repositories that take >10 minutes to build, issues requiring external services, and tasks with flaky tests, a typical training pool contains 300-500 tasks.

The critical preparation step is pre-building Docker images — for each (repository, commit) pair, a Docker image with the full repo, all dependencies, and a configured test framework:

Repository @ CommitDocker ImageSize
django/django@abc123swebench-django-abc123:latest~2-4 GB
flask/flask@def456swebench-flask-def456:latest~1-2 GB

Pre-building matters enormously: installing dependencies from scratch takes 5-30 minutes per repository. With pre-built images, container startup drops to 2-3 seconds. The initial image-building phase takes 2-4 hours with parallel builds, but only happens once.

Phase 1: Trajectory Generation (Rollout)

This is the bottleneck — ~80% of wall-clock time.

Sample a batch: With GRPO group_size G=8 and batch_size=64, we sample 8 unique tasks and generate 8 independent rollouts each.

Launch containers: Each of the 64 task-rollout pairs gets its own Docker container, checked out to the correct commit. With a pre-warmed pool: 2-3 seconds. Without: 30-60 seconds.

Run the agent loop (up to 25 turns per trajectory):

   messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": task.issue_text},
]

for turn in range(max_turns=25):
    response = model.generate(messages, temperature=0.7)
    action = parse_action(response)
    observation = container.execute(action)

    messages.append({"role": "assistant", "content": response})
    messages.append({"role": "tool", "content": observation})

    if action.type == "submit":
        break

A concrete example — fixing a Django bug where QuerySet.only() crashes with select_related():

TurnAgent (Thinking + Action)Environment
1<think>Need to find relevant code</think> <action>search: QuerySet.only</action>query.py:1205
2<think>Read the implementation</think> <action>file_read: query.py:1200-1250</action>def only(self, *fields): ...
3<think>Found it — deferred_loading ignores related fields</think> <action>file_edit: compiler.py ...</action>File edited successfully
4<think>Verify the fix</think> <action>terminal: python -m pytest tests/</action>5 passed, 0 failed
5<action>submit</action>

Compute reward: Apply the ground-truth test patch to the container, run pytest, check results. Binary: 1.0 if all pass, 0.0 otherwise.

Package and submit: The trajectory (all turns + reward + metadata) is sent to the DataBuffer. With 32 workers in parallel, the full batch of 64 trajectories completes in roughly (64 / 32) × 5min ≈ 10 minutes.

Phase 2: Data Processing (DataBuffer)

Trajectories go through a transformation pipeline to become training-ready batches:

Group by prompt: GRPO needs all G=8 trajectories for the same task grouped together.

Tokenize: Each multi-turn trajectory is flattened into a single token sequence — system prompt, user prompt, and all (assistant, tool) turn pairs concatenated.

Compute loss masks: Binary mask over the token sequence — 1 for model-generated tokens, 0 for everything else. More on this in the next section.

Compute advantages: For each prompt group, GRPO normalizes rewards:

   rewards = [1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0]  # G=8
mean_r, std_r = 0.375, 0.484
advantages = [(r - mean_r) / std_r for r in rewards]
# [+1.29, -0.77, +1.29, -0.77, -0.77, +1.29, -0.77, -0.77]

Successful trajectories get positive advantages (reinforced). Failed ones get negative advantages (suppressed).

Filter degenerates: Discard trajectories where all G rewards are identical (no signal), loops (>20 turns, reward=0), or extreme lengths (>50K tokens).

Assemble batch: Package input_ids, loss_mask, advantages, old_log_probs, ref_log_probs into a training batch.

Phase 3: Policy Update (Training)

The actual gradient computation is straightforward:

   # Forward pass
logits = model(batch["input_ids"])
log_probs = compute_log_probs(logits, batch["input_ids"])

# GRPO loss
ratio = exp(log_probs - batch["old_log_probs"])
clipped_ratio = clip(ratio, 1 - 0.2, 1 + 0.2)
policy_loss = -min(ratio * advantages, clipped_ratio * advantages)

# KL penalty
kl_penalty = 0.01 * (log_probs - batch["ref_log_probs"])

# Apply mask and compute total loss
loss = ((policy_loss + kl_penalty) * loss_mask).sum() / loss_mask.sum()

loss.backward()
clip_grad_norm(model.parameters(), max_norm=1.0)
optimizer.step()

A single training step takes ~20-40 seconds. Compare this to the 10-minute rollout phase — the 20:1 ratio is why generation efficiency is the primary optimization target.

Timeline

Sync vs Async Timeline

A full training run (3 epochs, 500 tasks, G=8): ~188 iterations. Synchronous: ~34 hours. With APRIL: ~17 hours.


4. How Trajectories Become Training Data

This is the conceptually trickiest part of agentic RL. In single-turn RLVR, training data is simple: one prompt, one completion, one reward. The model produced one block of text, compute the policy gradient over it.

Agentic trajectories are messier. The model produces multiple blocks of text (one per turn), interleaved with environment responses it didn’t generate. The reward arrives only at the very end. How does the gradient actually flow?

One Trajectory = One Training Sample

A trajectory is not split into separate training samples. The entire multi-turn sequence is concatenated into a single token sequence:

Token Sequence with Loss Mask

The model processes the full sequence in one forward pass. Loss masking determines which tokens receive gradients: only model-generated tokens (assistant turns) are trained on. Environment-generated tokens (system, user, tool) are in the sequence for context — the model sees them during the forward pass, which affects hidden states and therefore gradients on the assistant tokens — but they don’t directly contribute to the loss.

Loss Masking: The Core Mechanism

Token TypeIn Loss?Reason
System promptNo (mask)Not model-generated — environment setup
User promptNo (mask)Not model-generated — task description
Assistant response (each turn)Yes (train)Model-generated — this is the policy
Tool observation (each turn)No (mask)Not model-generated — environment feedback

All assistant turns are trained simultaneously. The masking is not progressive or per-turn by default. Turn 1’s assistant tokens, turn 5’s assistant tokens, and turn 15’s assistant tokens all receive the same advantage signal and are all updated in the same backward pass.

This is analogous to teacher forcing in seq2seq models: the model sees real observations (not its own predictions about what the environment would return) and learns what to do given those observations.

The Trajectory Data Structure

Before looking at examples, let’s see how a trajectory is represented in code. The data structures make the token-sequence-with-masking concept concrete:

   @dataclass
class Turn:
    """One turn in the agent interaction loop."""
    turn_index: int
    assistant_message: str      # Model output: thinking + action
    observation_message: str    # Environment response
    tokens_generated: int       # Tokens the model produced

@dataclass
class Trajectory:
    """A complete agent trajectory for one task."""
    task_id: str
    system_prompt: str
    user_prompt: str
    turns: List[Turn]
    reward: float
    metadata: Dict[str, Any]

    def to_token_sequence(self):
        """Convert to flat sequence with trainability labels."""
        segments = []

        # System prompt — NOT trainable
        segments.append({"role": "system", "content": self.system_prompt,
                         "trainable": False})
        # User prompt — NOT trainable
        segments.append({"role": "user", "content": self.user_prompt,
                         "trainable": False})

        # Each turn: assistant is trainable, tool is not
        for turn in self.turns:
            segments.append({"role": "assistant", "content": turn.assistant_message,
                             "trainable": True})    # <-- gradients flow here
            segments.append({"role": "tool", "content": turn.observation_message,
                             "trainable": False})   # <-- no gradients
        return segments

The to_token_sequence() method is the critical transformation: it converts a structured multi-turn conversation into a flat list of segments, each labeled as trainable or not. The tokenizer then converts this into input_ids and loss_mask tensors aligned token-for-token.

A Worked Example

3-turn trajectory for fixing an IndexError:

TurnRoleContent
1 (Search)AssistantThink: Need to find the file. Action: search: IndexError utils.py
Toolutils.py:42: items[len(items)] # IndexError
2 (Fix)AssistantThink: Off-by-one. Should be len-1. Action: file_edit: utils.py ...
ToolFile edited successfully.
3 (Verify)AssistantThink: Run tests to verify. Action: terminal: pytest tests/
Tool5 passed, 0 failed
Reward1.0

The loss mask:

Position[0..50][51..80][81..130][131..200][201..240][241..270][271..310][311..330]
ContentSYSUSERASST_T1TOOL_T1ASST_T2TOOL_T2ASST_T3TOOL_T3
Mask000…000…111…000…111…000…111…000…

With GRPO (G=4 for this task):

TrajectoryOutcomeRewardAdvantage
Traj 1 (above)Tests pass1.0+1.15
Traj 2Edited wrong file0.0-0.58
Traj 3Tests pass1.0+1.15
Traj 4Timeout0.0-0.58

Training effect: the model increases the probability of search → fix → verify patterns (Traj 1 & 3) and decreases the probability of misidentification (Traj 2) and looping (Traj 4).

The Credit Assignment Problem

The fundamental challenge: in a 15-turn trajectory that succeeds, which turns caused the success? Was it the search in turn 3? The fix in turn 7? The debugging cycle in turns 10-13?

GRPO’s approach is deliberately coarse: assign the same advantage to every assistant turn. If the trajectory succeeded, every turn gets a positive advantage. If it failed, every turn gets a negative advantage.

This seems wasteful. But it works for two reasons:

  1. Signal averages out across iterations. Good strategies (search first, then fix, then test) appear in successful trajectories more often than in failed ones, so they get reinforced — just less strongly than the key actions.

  2. The alternative is expensive. Per-turn credit assignment requires either a trained critic model (PPO/GAE, needing ~84 GB extra memory for a 7B model) or Shapley value computation (exponentially expensive). GRPO’s uniform assignment trades precision for simplicity.

For the research frontier: iStar estimates per-turn contributions using hindsight analysis. Progressive masking (training on only the last turn initially, then expanding) provides a curriculum-based middle ground. But the standard full-assistant-mask approach is what ships in production.

Five Masking Strategies

While the standard approach (full assistant mask) dominates, it’s worth knowing what alternatives exist — they may become important as the field matures:

StrategyWhat’s TrainedUse Case
Full Assistant MaskAll assistant tokens across all turnsDefault. Simple, stable. Used by Slime, SWE-RL, DeepSWE.
Action-Only MaskOnly <action>...</action> tokens, not <think>Focuses gradient on decisions. Risk: model doesn’t learn to reason.
Progressive MaskEpoch 1: last turn only. Epoch 2: last 3 turns. Epoch 3: all.Curriculum learning for long trajectories (50+ turns).
Outcome-WeightedAll assistant tokens, but with continuous weights (0.1–2.0) based on rewardSimilar to GRPO advantage, redundant in practice.
Turn-SelectiveOnly turns scoring above a quality thresholdRequires per-turn scoring (critic or heuristic). Finest granularity.

In code, the standard full assistant mask is trivial:

   def full_assistant_mask(tokens):
    """1 for assistant tokens, 0 for everything else."""
    return [1 if t.role == "assistant" else 0 for t in tokens]

The action-only mask is more selective:

   def action_only_mask(tokens):
    """1 only for tokens inside <action> tags."""
    return [1 if t.role == "assistant" and t.is_action else 0 for t in tokens]

The progressive mask implements curriculum learning:

   def progressive_mask(tokens, current_epoch, total_epochs, max_turns):
    """Gradually unmask more turns as training progresses."""
    progress = current_epoch / max(total_epochs - 1, 1)
    turns_to_unmask = max(1, int(progress * max_turns) + 1)
    max_turn = max(t.turn_index for t in tokens if t.turn_index >= 0)
    threshold = max_turn - turns_to_unmask + 1

    return [1 if t.role == "assistant" and t.turn_index >= threshold else 0
            for t in tokens]

The All-Same Problem

When all G trajectories for a task get the same reward (all succeed or all fail), advantages are zero — no gradient signal. Solutions:

  • DAPO’s dynamic sampling: filter out zero-variance prompt groups entirely
  • Curriculum design: maintain a mix of task difficulties
  • Larger G: with G=16, the probability of all-same outcomes drops significantly

5. Environment Orchestration — The Hidden Bottleneck

If you’ve worked with math RLVR, environment orchestration isn’t something you’ve had to think about. The “environment” is a function that checks if 2 + 3 = 5. Stateless, instant, trivially parallelizable.

Agentic coding RL is different. The environment is a full software development setup — a cloned repository, installed dependencies, a working test suite, tool endpoints. Every trajectory requires its own isolated environment, and each takes seconds to minutes to set up. At scale, environment management becomes the primary engineering challenge.

The Docker Architecture

Every task-rollout pair gets its own container. This is non-negotiable:

Docker Container

Isolation: Agent A’s file_edit must not affect Agent B’s environment. With G=8, eight containers run the same task concurrently, each accumulating different file edits.

Reproducibility: GRPO requires all G rollouts for a task to start from identical states.

Cleanup: After trajectory completion, the container is destroyed. No state leaks.

The Container Lifecycle

Container Lifecycle

Container reuse optimization: If consecutive tasks come from the same repo, the worker resets the container (git checkout + git clean) instead of destroying and recreating it — saving 30-60 seconds per lifecycle.

The Rollout Worker in Code

Here’s the skeleton of what each worker does — the core loop that runs for each (task, rollout) pair:

   class RolloutWorker:
    def __init__(self, worker_id, max_turns=25, timeout_seconds=300):
        self.worker_id = worker_id
        self.max_turns = max_turns
        self.timeout = timeout_seconds

    def collect_trajectory(self, task):
        """Main entry point: collect one trajectory."""
        start_time = time.time()

        # Step 1: Set up environment
        env = DockerEnvironment(task["repo"], task["base_commit"])
        env.start()  # 2-3s with warm pool, 30-60s without

        # Step 2: Prepare conversation
        messages = [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": task["issue_description"]},
        ]

        # Step 3: Agent interaction loop
        turns = []
        try:
            for turn_idx in range(self.max_turns):
                if time.time() - start_time > self.timeout:
                    break  # Hard timeout

                # Model generates thinking + action
                response = model_server.generate(messages, temperature=0.7)
                action = parse_action(response)

                # Execute in container
                observation = env.execute_tool(action.tool, action.args)

                turns.append(TrajectoryTurn(
                    turn_index=turn_idx,
                    assistant_output=response,
                    observation=observation,
                    tokens_generated=len(tokenizer.encode(response)),
                ))

                messages.append({"role": "assistant", "content": response})
                messages.append({"role": "tool", "content": observation})

                if action.type == "submit":
                    break

            # Step 4: Run tests and compute reward
            test_results = env.run_tests(task["test_patch"])
            reward = 1.0 if test_results["all_passed"] else 0.0
        finally:
            env.stop()  # Always clean up

        # Step 5: Package trajectory
        return Trajectory(
            task_id=task["task_id"],
            system_prompt=SYSTEM_PROMPT,
            user_prompt=task["issue_description"],
            turns=turns,
            reward=reward,
            metadata={"worker_id": self.worker_id,
                      "elapsed_s": time.time() - start_time,
                      "num_turns": len(turns)},
        )

For GRPO with G=8, the RolloutManager calls collect_trajectory() 8 times for each task, distributing across the worker pool. With 32 workers, 8 trajectories for one task complete in parallel in 3-8 minutes.

Production Infrastructure

Production Infrastructure

Rollout Manager: Maintains the task queue, assigns tasks to workers, handles failures and retries.

Worker Pool: 32-128+ workers, each handling one trajectory at a time. CPU-bound (container management), not GPU-bound.

Container Pool: Pre-warmed containers ready for immediate assignment.

Model Server: vLLM/SGLang instances serving the policy model. This is the GPU-intensive component. (For a deeper dive into how model serving, weight synchronization, and GPU orchestration work at scale across different frameworks, see RL Infrastructure for Large-Scale Agentic Training.)

Resource Budget

ComponentResourcesNotes
Training GPUs8× H100 80GBPolicy + reference model + GRPO
Rollout Inference4× H100vLLM for agent generation
Container Pool32-64 containers~4 GB RAM each, CPU nodes
Training Duration~3 days (3 epochs)500 SWE-Bench tasks, G=8
Per-Trajectory2-10 minutesDepends on task complexity

The Long-Tail Problem

The defining operational challenge: trajectory durations follow a heavy-tailed distribution.

Duration Distribution

In synchronous training, the batch completes when the slowest trajectory finishes. If 63 out of 64 finish in 3 minutes but one takes 25 minutes, GPU utilization for the fast trajectories: 12%. This is why async training (APRIL), hard timeouts, and max-turn limits are essential.

Common Failure Modes

FailureMitigation
Agent stuck in a loopMax-turn limit + loop detection + custom_filter_fn
Container OOMPer-container resource limits, memory monitoring
Network timeout to model serverRequest timeout + retry
Zombie containersPeriodic cleanup, container TTL
Worker crash mid-trajectoryHeartbeat monitoring, auto-restart

The most insidious failure is the loop: the agent tries a fix, it partially fails, it tries a slight variation, partially fails again, and repeats indefinitely. Without detection, this burns compute for zero training signal. Slime’s custom_filter_fn catches these: trajectories with >20 turns and zero reward are likely loops and should be discarded.


6. The GRPO Algorithm in Detail

GRPO (Group Relative Policy Optimization) is the algorithm behind virtually every native agentic RL system. Understanding it in detail — not just the intuition, but the actual loss computation — is essential for reasoning about training behavior.

Core Idea: No Value Model, Just Comparison

Classical RL algorithms like PPO need a value function to estimate how good a state is, so they can compute advantages (how much better or worse an action was compared to expectation). Training this value function requires a separate neural network with its own forward/backward passes — for a 7B policy, that’s another ~84 GB of GPU memory.

GRPO’s insight: replace the value function with group comparison. Generate G trajectories for the same prompt, compare their rewards, and use the group statistics as the baseline:

Ai=Rimean(R1..G)std(R1..G)A_i = \frac{R_i - \text{mean}(R_{1..G})}{\text{std}(R_{1..G})}

If G=8 and 3 out of 8 trajectories succeed (reward=1.0), the successful ones get positive advantage, the failed ones get negative advantage. No learned baseline needed — the group is the baseline.

The Loss Function

The pseudocode is deceptively simple:

   def grpo_loss(log_probs_new, log_probs_old, advantages, loss_mask, ref_log_probs):
    # Importance ratio
    ratio = exp(log_probs_new - log_probs_old)

    # Clipped objective (prevents too-large updates)
    surr1 = ratio * advantages
    surr2 = clip(ratio, 1 - eps, 1 + eps) * advantages
    policy_loss = -min(surr1, surr2)

    # KL penalty (keeps policy close to reference model)
    kl_loss = beta * (log_probs_new - ref_log_probs)

    # Apply mask: only compute on assistant tokens
    loss = ((policy_loss + kl_loss) * loss_mask).sum() / loss_mask.sum()
    return loss

Three components:

  1. Policy ratio exp(log_new - log_old): how much more likely the current policy makes these tokens compared to the policy that generated them. Ratio > 1 means the current policy prefers these tokens more; < 1 means less.

  2. Clipping: prevents catastrophically large updates. If the ratio exceeds [1-eps, 1+eps], the gradient is zeroed — the policy has already changed enough from this sample.

  3. KL penalty: prevents the policy from drifting too far from the reference model (the initial SFT checkpoint). Without this, the policy can degenerate into repetitive, reward-hacking behavior.

A Complete GRPO Training Step in PyTorch

Here’s the full implementation — advantage computation through gradient update — so you can see how the pieces fit together:

   import torch
import torch.nn.functional as F

def compute_group_advantages(rewards, group_size, clip_value=5.0):
    """
    Compute group-relative advantages.

    Args:
        rewards: (batch_size,) where batch_size = num_prompts * G
        group_size: G, number of trajectories per prompt
    Returns:
        advantages: (batch_size,) normalized within each group
    """
    G = group_size
    num_prompts = rewards.shape[0] // G

    # Reshape to (num_prompts, G)
    grouped = rewards.view(num_prompts, G)

    # Normalize within each group
    mean = grouped.mean(dim=1, keepdim=True)
    std = grouped.std(dim=1, keepdim=True).clamp(min=1e-8)
    advantages = (grouped - mean) / std

    # Clip extreme values (GRPO++)
    advantages = advantages.clamp(-clip_value, clip_value)
    return advantages.view(-1)


def grpo_training_step(model, ref_model, batch, optimizer,
                       group_size=8, clip_ratio=0.2, kl_coeff=0.01):
    """Complete GRPO training step."""
    input_ids = batch["input_ids"]           # (B, L)
    attention_mask = batch["attention_mask"]  # (B, L)
    loss_mask = batch["loss_mask"]            # (B, L) — 1 for assistant tokens
    old_log_probs = batch["old_log_probs"]   # (B, L)
    rewards = batch["rewards"]               # (B,)

    # Step 1: Group-relative advantages
    advantages = compute_group_advantages(rewards, group_size)

    # Step 2: Current policy forward pass
    logits = model(input_ids=input_ids, attention_mask=attention_mask).logits
    shift_logits = logits[:, :-1, :]
    shift_labels = input_ids[:, 1:]
    shift_mask = loss_mask[:, 1:]
    shift_old = old_log_probs[:, 1:]

    log_probs = F.log_softmax(shift_logits, dim=-1)
    token_log_probs = log_probs.gather(2, shift_labels.unsqueeze(2)).squeeze(2)

    # Step 3: Reference model log-probs (for KL)
    with torch.no_grad():
        ref_logits = ref_model(input_ids=input_ids,
                               attention_mask=attention_mask).logits[:, :-1, :]
        ref_log_probs = F.log_softmax(ref_logits, dim=-1)
        ref_token_log_probs = ref_log_probs.gather(
            2, shift_labels.unsqueeze(2)).squeeze(2)

    # Step 4: GRPO loss
    ratio = torch.exp(token_log_probs - shift_old)
    token_advantages = advantages.unsqueeze(1).expand_as(token_log_probs)

    surr1 = ratio * token_advantages
    surr2 = torch.clamp(ratio, 1 - clip_ratio, 1 + clip_ratio) * token_advantages
    policy_loss = -torch.min(surr1, surr2)

    kl_loss = kl_coeff * (token_log_probs - ref_token_log_probs)
    total_loss = ((policy_loss + kl_loss) * shift_mask).sum() / shift_mask.sum()

    # Step 5: Update
    optimizer.zero_grad()
    total_loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()

    return {
        "loss": total_loss.item(),
        "mean_advantage": advantages.mean().item(),
        "mean_ratio": ratio.mean().item(),
        "kl": (token_log_probs - ref_token_log_probs).mean().item(),
    }

The key detail to notice: token_advantages is a scalar per trajectory, broadcast across all token positions. Every assistant token in a successful trajectory gets the same positive advantage; every token in a failed trajectory gets the same negative advantage. The loss mask ensures only assistant tokens receive gradients — the shift_mask.sum() normalization in the denominator means the loss is averaged over trainable tokens, not all tokens.

GRPO++ : Length Normalization

Standard GRPO has a bias: longer trajectories contain more tokens, so they contribute more to the total loss. Since successful trajectories on hard tasks tend to be longer (more turns of exploration), the model learns to generate verbose responses even on easy tasks — a phenomenon called length hacking.

GRPO++ adjusts rewards by trajectory length:

   adjusted_reward = reward * (target_length / actual_length) ** alpha
# alpha = 0.5, target_length = 2000 tokens

Short successful trajectories get amplified rewards, discouraging unnecessary verbosity. This is particularly important in agentic tasks where trajectory lengths vary from 500 to 50,000 tokens.

Advantage Estimation Methods Compared

MethodValue Model?GranularityVarianceCost
GRPONoPer-trajectoryMediumLow
PPO/GAEYesPer-stepLowHigh (~84 GB)
RLOONoPer-trajectoryMediumLow
iStarNoPer-stepLowMedium

GRPO dominates for agentic RL because it’s cheap (no value model), effective (group comparison provides a good baseline), and simple (the implementation fits in ~50 lines of PyTorch).


7. From Single-Agent Training to Multi-Agent Deployment

The Train-Deploy Gap

Here’s a disconnect that bothered me when I first studied agentic RL: we train a single agent to solve tasks, but deploy it as part of a multi-agent system. In production, a coding agent might have a Lead that decomposes tasks, Experts that write code, and Reviewers that check work. During training? A single agent tries everything alone.

Shouldn’t we train the way we deploy?

The short answer: we should, but we mostly can’t yet. Multi-agent training is an active research frontier with promising but preliminary results.

Why Single-Agent Training Dominates

Search space explosion: A single agent’s trajectory already has an astronomical space of possible behaviors. Two interacting agents roughly doubles the dimensionality — exponentially more samples needed for convergence.

Credit assignment compounds: With one agent, GRPO’s coarse assignment (same advantage for all turns) works. With two agents, we need to determine which agent’s actions caused success. If the Lead delegated poorly but the Expert solved it anyway, the Lead gets undeserved credit.

Infrastructure cost multiplies: Multi-agent rollouts require multiple inference calls per step, coordinated container access, and more complex trajectory recording. The already-dominant rollout phase gets even longer.

”Train Single, Deploy Multi” — Why It Works

The same base model serves all roles, differentiated only by system prompt:

Train Single, Deploy Multi

This works better than you might expect because the capabilities of a strong single agent naturally transfer to multi-agent roles:

Single-Agent CapabilityMulti-Agent Application
Tool useInter-agent communication
Step-by-step planningTask delegation
Error recoveryHandling sub-agent failures
Context managementWorking with delegated sub-tasks
Instruction followingRole adaptation via system prompt

Three Approaches to Narrowing the Gap

Approach A: Diverse training prompts. Train with system prompts that simulate different roles — some rollouts use a “lead agent” prompt, some use an “expert” prompt, some use a “reviewer” prompt. The model learns role-adaptive behavior. Simplest approach, no infrastructure changes needed.

Approach B: Self-play. During rollout, the same model plays multiple roles in a simulated multi-agent interaction, with different system prompts. The trajectory captures the full team conversation. Reward is based on the team outcome. More expensive (3× inference calls per step) but produces more realistic training data.

Approach C: Sequential specialization. First train a strong base via single-agent RL. Then fine-tune role-specific variants — Lead, Expert, Reviewer — on role-appropriate trajectories. Creates specialized models from the same base.

The Research Frontier: True Multi-Agent RL

M-GRPO (Ant Group / Imperial College London): Decomposes team reward into per-agent contributions using Shapley values, then computes separate GRPO advantages for each role:

Multi-Agent Rollout example: Lead assigns “Fix the auth bug” → Expert1 searches, reads, edits → Expert2 runs tests, reports → Lead reviews, accepts. Team Reward: 1.0

RoleShapley ValueGRPO Advantage Formula
Lead0.3A=(0.3μlead)/σleadA = (0.3 - \mu_{\text{lead}}) / \sigma_{\text{lead}}
Expert10.5A=(0.5μexp1)/σexp1A = (0.5 - \mu_{\text{exp1}}) / \sigma_{\text{exp1}}
Expert20.2A=(0.2μexp2)/σexp2A = (0.2 - \mu_{\text{exp2}}) / \sigma_{\text{exp2}}

Each role gets its own gradient signal. The Lead learns what makes delegation effective, independent of the Expert’s coding quality.

The challenge: exact Shapley computation evaluates all 2N2^N agent subsets — exponential. For 3 agents, that’s 8 evaluations. For 10 agents, over 1000. Here’s the exact computation:

   def shapley_values(agents, value_fn):
    """
    Exact Shapley values for each agent.

    phi_i = sum_{S ⊂ N\{i}} |S|!(|N|-|S|-1)!/|N|! × [v(S∪{i}) - v(S)]
    """
    n = len(agents)
    shapley = {agent: 0.0 for agent in agents}

    for agent in agents:
        others = set(agents) - {agent}
        for size in range(len(others) + 1):
            for subset in combinations(others, size):
                coalition = frozenset(subset)
                marginal = value_fn(coalition | {agent}) - value_fn(coalition)
                weight = factorial(size) * factorial(n - size - 1) / factorial(n)
                shapley[agent] += weight * marginal
    return shapley

# Example: 3-agent team solving a coding task
# value_fn simulates team performance:
#   Lead alone: 0.2 (can plan but not code)
#   Coder alone: 0.4 (can code but unguided)
#   Lead + Coder: 0.7 (guided coding)
#   All three: 1.0 (full team)

# Result: Lead=0.283, Coder=0.483, Reviewer=0.233
# → Coder contributes most, but Lead's coordination adds synergy

Monte Carlo approximation helps at scale — sample random agent orderings, compute marginal contributions — but adds noise to an already noisy RL signal.

SHARP (Shapley-based Hierarchical Attribution): Exploits the hierarchical structure of multi-agent systems. Instead of computing Shapley over all agents simultaneously:

  1. Level 1: Manager vs. workers-as-a-group (a cheap 2-player game). Manager’s Shapley value: 0.5 × (v_manager - v_empty) + 0.5 × (v_all - v_workers).
  2. Level 2: Decompose the workers’ share among individual workers using standard Shapley within the worker group.

This reduces cost from O(2N)O(2^N) to O(2Nworkers)O(2^{N_{workers}}) while preserving fairness properties.

Honest Assessment

What works now: Single-agent training + role-specific system prompts. This is the production standard.

What’s promising: M-GRPO with 2-3 agents. Self-play for generating realistic trajectories.

What doesn’t work yet: Training 5+ agents jointly. Learning emergent role specialization. Multi-agent training without 3-10× compute overhead.

Where the field is heading: Constitutional AI for defining role behaviors as rules. Modular RL with pluggable skill modules. Curriculum approaches starting single-agent and gradually introducing collaboration.


8. Reward Design: Beyond Binary

While binary reward (1.0 if tests pass, 0.0 otherwise) is the baseline that works, the reward function is one of the most important design decisions in agentic RL. Let’s survey the options.

Binary Reward

   def binary_reward(trajectory):
    return 1.0 if trajectory.test_results["all_passed"] else 0.0

Pros: Clear signal, no reward hacking, simple to implement. Cons: Extremely sparse — in early training, most trajectories get 0.0, providing no gradient signal beyond “don’t do this.”

Composite Reward

Weighted combination of multiple signals:

   def composite_reward(trajectory):
    test_pass = 1.0 if all_tests_pass else 0.0
    partial = passed_tests / total_tests        # Partial credit
    format_ok = 1.0 if valid_patch else 0.0     # Format compliance
    efficiency = max(0, 1.0 - turns / 25)       # Fewer turns = better

    return 0.6 * test_pass + 0.2 * partial + 0.1 * format_ok + 0.1 * efficiency

Pros: Denser signal, especially early in training. Cons: Weights require tuning, and components can be gamed (model optimizes for format compliance at the expense of correctness).

Staged Reward

Partial credit for progress milestones:

StageMilestoneReward
0No meaningful action0.0
1Found the relevant file0.1
2Made a code change0.3
3Code compiles0.5
4Some tests pass0.7
5All tests pass1.0

Pros: Provides gradient signal even for trajectories that didn’t fully succeed — the model learns that finding the right file is better than random exploration. Cons: Stage detection adds implementation complexity, and intermediate stages can become local optima.

Length Penalty

GRPO++ applies a cosine-based length penalty:

   def length_penalized_reward(base_reward, num_tokens, target=2000, max_tokens=8000):
    if num_tokens <= target:
        return base_reward  # No penalty for concise solutions
    progress = (num_tokens - target) / (max_tokens - target)
    penalty = 0.5 * (1 + cos(pi * min(progress, 1.0)))
    return base_reward * penalty

Critical for agentic tasks where the model can learn to pad trajectories with unnecessary exploration to look like it’s “trying hard.”

Practical Recommendation

Start with binary reward. It’s the simplest and least likely to introduce reward hacking. If training stalls because the signal is too sparse (success rate stays near 0%), add staged rewards to provide gradient signal for partial progress. Add length penalty if the model starts generating excessively long trajectories. Composite rewards are useful for fine-tuning but require careful weight tuning.


9. Putting It All Together — Key Takeaways

After working through the full agentic RL training pipeline, a few themes emerge:

The Algorithm Is Settled, The Engineering Is Not

GRPO with binary reward is the algorithm that works. Every major framework uses it or a close variant. The research frontier is not about inventing new loss functions — it’s about making the training loop run faster, more reliably, and at larger scale.

Environments Are the Bottleneck

In math RLVR, generation is 80-90% of wall-clock time. In agentic RL, the environment adds another layer: container startup, tool execution, test suite running. The total rollout phase can be 95%+ of training time. Every optimization in the infrastructure — container pre-warming, async training, loop detection, over-provisioning — translates directly into faster experiments and more efficient research iteration. I covered these infrastructure challenges in depth in RL Infrastructure for Large-Scale Agentic Training, which complements this post’s focus on the training algorithm and data pipeline.

Loss Masking Is Subtle But Critical

The decision to train on all assistant tokens (while masking environment tokens) is conceptually simple but has deep implications. It means the model learns a unified strategy across all turns, with coarse credit assignment. More sophisticated masking strategies (action-only, progressive, turn-selective) exist but haven’t demonstrated clear advantages in practice.

Single-Agent Training Works for Multi-Agent Deployment

The “train single, deploy multi” paradigm is pragmatic and effective. A strong single agent naturally acquires capabilities (tool use, planning, error recovery) that transfer to multi-agent roles via system prompt adaptation. True multi-agent training (M-GRPO, SHARP) is the research frontier — promising but not yet production-ready.

The Numbers That Matter

ParameterValueNotes
Model size7B-32BCommon range for open-weight RL agents
Group sizeG = 4-8Balance signal quality vs. compute
Max turns25Cap worst-case trajectory length
Trajectory time3-8 minutesDominated by multi-turn interaction
Training iteration10 min sync, 5 min async
Full training run~17-34 hours3 epochs, 500 tasks
Target metricSWE-Bench VerifiedResolve rate
Current SOTA (open-weight)~50.8%DeepSWE (32B)

References

RL Training Infrastructure

Agentic RL Frameworks

Algorithms

Benchmarks

  • SWE-Bench VerifiedGitHub | 500 human-validated tasks