Home

Published

- 41 min read

Re-visiting Mid-training Stage:
for & with Agentic RL

img of Re-visiting Mid-training Stage: for & with Agentic RL

For most of 2024, mid-training was the part of the pipeline nobody openly talked about (although already practical explorations within frontier labs). It was the quiet phase between the glamour of pre-training at scale and the excitement of RLHF breakthroughs. Then, in the span of six months, three things happened: two survey papers formalized it as a research discipline, a CMU study proved it outperforms RL-only under fixed compute, and a Shanghai team showed that RL can reach backward to improve mid-training itself. Mid-training went from engineering plumbing to the most strategically important stage of the pipeline.

The standard recipe is well known: pre-train on trillions of tokens, fine-tune on high-quality instructions, then apply RLHF or GRPO to align the model. Mid-training — the phase between pre-training and supervised fine-tuning where you continue training on domain-specific data — was typically treated as an optional engineering step. Something Qwen did for code, something Meta did for long context, something NVIDIA did for domain adaptation. Useful, yes. Interesting, not particularly.

I say this from firsthand experience. In 2024, I spent half a year at a frontier lab working on continued pre-training for the Seed foundation models — curating StackExchange and CommitPack data, running ablation experiments on 1.3B and later 3.3B proxy models, gathering insights, and iterating on data mixing ratios before the annealing stage. At the time, mid-training was very much an engineering exercise: get the data pipeline right, tune the learning rate schedule, verify that general capabilities didn’t degrade. Important work, but not the kind that generated research papers or conference talks. The title “Re-visiting” is literal — I’m returning to a stage of the pipeline I once worked on daily, and finding it transformed.

That changed in the second half of 2025.

Between October 2025 and February 2026, the field produced a burst of research that fundamentally repositioned mid-training in the pipeline. Two survey papers (Gao et al., 2025; Wang et al., 2025) simultaneously formalized mid-training as a distinct research area. NVIDIA demonstrated that reasoning data injected during pre-training creates a compounding advantage that widens through subsequent training stages — challenging the assumption that reasoning is purely a post-training concern (Akter et al., 2025). CMU ran controlled experiments proving that mid-training outperforms RL-only allocation under fixed compute budgets by over 10% on out-of-distribution tasks (Zhang et al., 2025). And in February 2026, the ReMiT paper broke the cardinal rule of the linear pipeline: it used an RL-trained model’s reasoning priors to retroactively improve mid-training data weighting, creating a self-reinforcing flywheel (Huang et al., 2026).

Meanwhile, the practical impact became undeniable. Qwen3-Coder-Next — an 80B-total/3B-active MoE model — achieved 71.3% on SWE-Bench Verified and 44.3% on the harder SWE-Bench Pro, competitive with models 10-30x its active parameter count. Its secret? A massive mid-training phase with 600B tokens of repository-level code, multi-scaffold agentic trajectories, and 800K synthesized software engineering tasks. SWE-smith demonstrated that 50,000 training instances could be manufactured from 128 GitHub repositories with just 20 hours of human labor and $1,360 in compute, achieving 40.2% on SWE-Bench Verified with rejection sampling alone — no RL required.

The “For & With” Framing

This post examines mid-training through a dual lens:

Mid-training for agentic RL — the traditional view. Mid-training builds the knowledge foundation that RL later exploits. It teaches code understanding, tool-use formats, repository structure, and debugging patterns. Without it, RL faces a brutal cold-start problem: the model wastes its RL budget learning what actions exist instead of optimizing when to take them. DeepSeek’s R1-Zero experiment showed that RL can induce reasoning from scratch — but only when the base model has sufficient knowledge. Mid-training is what makes the base sufficient.

Mid-training with agentic RL — the emerging view. The pipeline is no longer unidirectional. RL signals can flow backward to improve mid-training itself. ReMiT demonstrates that an RL model’s token-level probability gaps can serve as weights for mid-training’s next-token prediction loss, upweighting “pivotal” reasoning tokens. NVIDIA’s RLP shows that RL can be embedded within the pre-training/mid-training objective as a dense reward signal. The linear pipeline is becoming a loop.

The post starts with fundamentals (Section 1), then builds the case for why mid-training sets the ceiling for agentic RL (Sections 2-3), examines data synthesis at scale (Section 4), and dives into three papers that reshape how we understand mid-training’s interaction with RL (Sections 5-7). It closes with Qwen3-Coder-Next as a case study (Section 8) and a practical recipe (Section 9).


1. Mid-Training 101

What Is Mid-Training?

Mid-training (also called continual pre-training, domain-adaptive pre-training, or simply CPT) is a training phase inserted between general pre-training and supervised fine-tuning. The model continues autoregressive next-token prediction, but on a curated, domain-shifted data mixture — typically 0.1–5.5T tokens with much higher concentrations of code, math, reasoning traces, or other target-domain data than the original pre-training corpus (which is typically 2–18T tokens of general web data).

What separates mid-training from simply “more pre-training” is intentional distributional shift. Pre-training aims for broad coverage; mid-training strategically overweights specific capabilities while carefully preserving general ones. The learning rate is typically 3-10x lower than the pre-training peak (e.g., 2e-5 vs 1e-4), with gentle warmup to avoid catastrophic forgetting.

Disambiguating “Mid-Training” — A Term Used Differently Across Labs

Before going further, it’s worth clarifying a source of confusion: different organizations use “mid-training” to mean different things.

The broad definition treats mid-training as synonymous with continued pre-training (CPT) — any training phase between initial pre-training and SFT that continues the next-token prediction objective on domain-shifted data. Under this definition, Code Llama’s 500B-token code specialization and Qwen-Coder’s 5.5T-token code CPT are both “mid-training.”

The narrow definition, increasingly common in practice, reserves “mid-training” (or “Stage 2.5”) specifically for a structured pre-training phase that sits between raw-code CPT and traditional SFT. This phase uses chat-formatted data, loss masking, and high-quality synthetic data — but at pre-training scale (tens to hundreds of billions of tokens), not SFT scale (millions of samples). The key distinction: it looks like SFT in format but operates at CPT scale and teaches capability, not style.

The modern coding agent pipeline, when fully disaggregated, has five phases — not three:

The 5-Phase Coding Agent Pipeline

Why does this distinction matter? Because the magic of Stage 2.5 is the scale-format mismatch: it uses structured chat-format data (which teaches instruction-following and edit patterns) but at pre-training token volumes (which prevents overfitting). If you tried to run 100B tokens of CommitPack through the SFT phase, the model would overfit to short Git commit patterns. Placed in Stage 2.5 as part of a diverse high-quality mix, the same data teaches a general-purpose “edit capability” that the SFT phase later activates for complex tasks.

How this post uses the term: Throughout this post, we use “mid-training” in the broad sense — encompassing both Phase 2 (domain CPT) and Phase 3 (Stage 2.5 / structured pre-training). The distinction between the two sub-phases matters for pipeline design (Section 9), but the research findings about mid-training’s role relative to RL (Sections 5-7) apply to both. When we need to distinguish, we’ll specify “code CPT” (Phase 2) or “structured pre-training / Stage 2.5” (Phase 3) explicitly.

If Stage 2.5 Already Uses Chat Format, Why Is SFT Still Necessary?

This raises an interesting question: if Stage 2.5 already trains on chat-formatted instruction data at massive scale, why not skip SFT entirely? From what I can tell, the answer lies in a fundamental distinction: Stage 2.5 teaches capability; SFT teaches alignment.

Although the format boundary between Stage 2.5 and SFT has blurred (both use chat templates), their scale, data distribution, and core objectives remain fundamentally different:

DimensionStage 2.5 (Structured Pre-training)SFT (Instruction Tuning)
Token scale50B - 200B tokens1B - 5B tokens
Sample count10M - 100M pairs100K - 1M pairs
Data sourceCleaned natural data (commits, web) + simple syntheticCarefully constructed complex instructions
EpochsUsually <1 epoch (data is abundant)3-5 epochs (data is scarce, must be memorized)
Primary goalMuscle memory (edit skill), domain knowledgeObedience, safety, tone, persona
Long-contextLearning — truly acquiring long-range dependenciesRefinement — maintaining alignment under long input

Three capabilities that SFT provides and Stage 2.5 cannot:

1. Complex compositional instructions. Stage 2.5 data (commits, StackExchange) involves short, direct tasks — “fix this typo”, “how to iterate a list.” SFT teaches multi-constraint instructions: “Refactor this class to use the factory pattern, change all log levels from Info to Debug, don’t break existing tests, and explain your changes in Chinese.” These compositional constraints barely exist in natural web data and must be explicitly constructed.

2. Safety and refusal. A model that has seen all of GitHub — including malware, attack scripts, and exploits — will happily generate harmful code if asked. SFT includes safety data (red-team examples) that teaches the model to recognize malicious intent and refuse: “I can’t help write an injection script, but I can help you audit for vulnerabilities.”

3. Persona and multi-turn coherence. Commit data is one-shot. StackExchange is single-turn Q&A. SFT trains multi-turn context handling — “No, not that one, I meant the variable on the previous line” — and defines the model’s conversational persona.

I find this analogy helpful: Stage 2.5 is boot camp — massive drilling to build skills and muscle memory. SFT is officer training — learning discipline, judgment, and how to follow complex orders. They seem to serve genuinely different purposes.

For a high-performance coding agent, the SFT budget is typically ~500K carefully curated pairs: ~200K general conversation, ~200K code instructions (Evol-Code), ~50K agent/tool-use trajectories, ~50K safety examples. A rule of thumb from several teams: if the SFT dataset grows beyond ~1M samples, it may be worth moving the excess volume into Stage 2.5 instead.

The Data Funnel: Same Sources, Different Treatment in Phase 2 vs. Phase 3

Something I found subtle but worth highlighting: the same raw data sources (GitHub commits, StackExchange) appear in both Phase 2 (domain CPT) and Phase 3 (Stage 2.5) — but with entirely different filtering criteria, formatting, and training objectives. It’s not “reuse” — it’s a deliberate funnel from broad coverage to precision skill-building.

Commit data across phases:

DimensionPhase 2 — “See everything, learn the landscape”Phase 3 — “Master the skill, learn the format”
FilteringKeep top ~50% quality (msg not empty, diff not garbled)Only commits with clear logic in msg, diffs that include tests, AST-valid code (~1/10 to 1/50 of Phase 2 volume)
FormatRaw text / unified diff (see example below)Chat format + search-and-replace blocks (see example below)
LossFull-sequence NTP (all tokens count)Masked — only compute loss on assistant response
PurposeWeak alignment between English and code — model learns “message followed by change” across millions of reposTrain the agent’s ACTION SPACE — model must produce structured edit operations

Phase 2 format example — raw unified diff, all tokens contribute to loss:

   Commit: fix typo
--- a.py
+++ a.py
@@ -1,3 +1,3 @@

Phase 3 format example — chat-formatted with SEARCH/REPLACE, loss only on the assistant response:

   <|user|>
Fix the typo in the login function
<|assistant|>
<<< SEARCH
def login(user, passwrod):
===
def login(user, password):
>>> REPLACE

StackExchange data across phases:

DimensionPhase 2 (Domain CPT)Phase 3 (Stage 2.5)
FilteringKeep top ~80% of QA pairs (has an answer, not spam)Only top ~10% (>10 votes, accepted answer)
FormatFlat markdown: Title:\n Body:\n Answer:\n concatenated as text streamChat format: <|user|> question <|assistant|> answer
LossFull-sequence NTP (model reads it like a document)Masked — loss only on assistant response
PurposeLearn knowledge: API usage, error messages, terminologyLearn Q&A logic: how to explain and reason as an assistant

Why not skip Phase 2 and use only high-quality data in Phase 3? Two reasons emerge from the literature:

  1. Scale prevents overfitting. High-quality commits and StackExchange answers might only total a few billion tokens. Phase 2 needs 500B-5.5T tokens. Using only the premium subset would cause severe overfitting — the model memorizes answers instead of learning to generalize. Lower-quality data provides necessary noise and diversity for robust embeddings.

  2. Long-tail knowledge. High-quality data concentrates on popular libraries (PyTorch, React, Django). Cold-corner data — a 2015 Perl CGI script, an obscure Fortran numerical library — has few StackExchange votes and poorly written commit messages. Filtering it out entirely in Phase 2 means the model has essentially zero exposure when encountering legacy code in the wild. Phase 2 teaches breadth; Phase 3 teaches precision.

The practical recipe: Phase 2 keeps the top 50% of commits (by message quality) and 80% of StackExchange QA in raw-text format. Phase 3 takes the top 5-10% of commits reformatted as search-and-replace blocks with loss masking, and the top 10% of StackExchange reformatted as chat with masked loss.

With this clarification in place, every major model family now includes dedicated mid-training phases (sometimes one, sometimes both):

ModelPhase(s)Token BudgetKey Innovation
Llama 3.1P2 + P3 (annealing)P3: curated mixCode ratio boosted to ~50% in annealing phase
Qwen 2.5-CoderP2 (code CPT)5.5T tokensRepository-level training with special tokens
DeepSeek-V3P2 integrated into pre-training14.8T total30% code, FIM, multi-token prediction
DeepSeek-V3-BaseP2 + P3Chat data in pre-trainingStage 2.5 blended into final pre-training stages
Code LlamaP2 (code CPT)500B tokensFIM, long-context via RoPE scaling
NemotronP2 (domain CPT)9T tokensLargest dedicated CPT phase
Phi-4P3 (structured)Synthetic-heavyPivotal token training, synthetic textbooks
OLMo 2P2 + P3Multiple runsModel souping (weight averaging across runs)
Qwen3-Coder-NextP2 + P3Trillions of tokens370 languages, 600B repo-level, chat-FIM

The Core Technique Toolkit

Mid-training has converged on roughly ten core techniques:

1. Learning Rate Re-Warming. When starting from a pre-trained checkpoint (where LR has decayed to near-zero), re-warming briefly raises the learning rate to 1/3 to 1/10 of the original pre-training peak, then applies cosine decay. Gupta et al., 2023 showed that re-warming + cosine decay with a 1:1 to 4:1 ratio of new-to-replay data works well.

Learning Rate Schedule for Mid-Training

2. Data Replay Buffer. The most dangerous failure mode of mid-training is catastrophic forgetting. The universal countermeasure: mix 5-15% of original pre-training distribution data into every batch. Qwen-Coder uses 10% general text replay alongside 80% code and 10% math — removing the replay buffer causes significant natural language degradation.

3. Fill-in-the-Middle (FIM). For code-focused mid-training, FIM is arguably the single most impactful technique. Instead of always predicting left-to-right, FIM randomly masks a span and trains the model to infill it:

   Original: def add(a, b):\n    return a + b\n
FIM:      <|fim_prefix|>def add(a, b):\n<|fim_suffix|>\n<|fim_middle|>    return a + b

Qwen-Coder applies FIM at a 50% rate using PSM (Prefix-Suffix-Middle) format, with random spans of 10-50% of the file length. This directly teaches code editing — the exact capability needed for SWE-Bench-style tasks.

4. Repository-Level Training. Standard code training treats each file independently. Repository-level training concatenates files from the same repository in dependency order, with special tokens marking boundaries:

   <|repo_name|>django/django
<|file_sep|>django/db/models/fields/__init__.py
class Field:
    def __init__(self, ...):
        ...
<|file_sep|>django/db/models/fields/related.py
from django.db.models.fields import Field
class ForeignKey(Field):
    ...
<|file_sep|>tests/model_fields/test_foreignkey.py
from django.test import TestCase
class ForeignKeyTests(TestCase):
    ...

This teaches cross-file relationships — imports, project structure, test-source connections — that are essential for navigating real codebases.

5. Quality Filtering. Model-based quality classifiers score code on signals like docstring presence, linting pass rate, naming convention adherence, and complexity metrics. DCLM showed that 2T filtered tokens can match 10T unfiltered tokens — a 5x efficiency gain from quality alone.

6. Annealing. The final 5-15% of training uses the highest-quality data at a decaying learning rate. Llama 3.1 boosted the code ratio from ~10% to ~50% during annealing, and the math ratio similarly increased. This stage has outsized impact because it is the model’s last exposure before fine-tuning.

7. Synthetic Data Infusion. Phi-4 demonstrated that synthetic data can be more valuable per-token than natural data: 10B synthetic tokens outperformed 1T web tokens for reasoning. Categories include textbook-style explanations, code exercises with test cases, reasoning traces, and agentic trajectories.

8. Long-Context Extension. Increasingly trained during mid-training rather than as a separate phase: gradually increase context length (8K → 32K → 128K → 256K) with curated long-context data.

9. Distillation as CPT. Using outputs from a stronger model (e.g., reasoning traces from DeepSeek-R1) as mid-training data for smaller models. DeepSeek found that distilled reasoning traces are more effective than human-written chain-of-thought.

10. Model Souping. OLMo 2 introduced weight averaging across multiple mid-training runs with different hyperparameters, reducing sensitivity to hyperparameter choices at the cost of extra compute.

The Critical Hyperparameters

HyperparameterRecommended RangeNotes
Peak learning rate1e-5 to 5e-53-10x lower than pre-training peak
Warmup1-2% of total stepsBrief re-warming from near-zero
AnnealingFinal 10-15% of trainingHighest quality data, declining LR
FIM rate50%Below 30% shows weak editing capability
Replay buffer5-15% of mixGeneral pre-training data to prevent forgetting
Code ratio70-85%For code-focused mid-training
Data repetitionUp to 4 epochsBeyond 4x, returns diminish rapidly

From what multiple teams have reported, learning rate is the most sensitive hyperparameter — too high and the model forgets; too low and it doesn’t adapt.


2. Knowledge vs. Strategy — Why Mid-Training Is the Ceiling

The Core Thesis

Here’s a conceptual model I find useful for thinking about what each training stage contributes:

StageLearns To…Capabilities
Pre-training”Know words”General language understanding, world knowledge
Mid-training”Know code”Domain knowledge, tool formats, code structure, debugging
SFT”Know the format”Output formatting, instruction following, conversation
RL”Know strategy”When to search vs. edit, how to decompose, error recovery

This leads to what I think is the key implication: mid-training determines the ceiling of what RL can achieve. RL can optimize strategy — the sequence of actions an agent takes — but it cannot manufacture knowledge the model doesn’t have. If the model doesn’t understand Python’s import system, no amount of RL will teach it to navigate cross-module dependencies.

Evidence: DeepSeek-R1-Zero

The most vivid demonstration comes from DeepSeek’s R1-Zero experiment — applying GRPO directly to the base model with no SFT or mid-training. Reasoning emerged spontaneously:

Training StepEmergent Behavior
~100Longer responses (model discovers more tokens = more exploration)
~500First structured reasoning attempts
~1,000”Aha moment” — spontaneous chain-of-thought appears
~5,000Consistent multi-step reasoning
~10,000Self-verification emerges (“Let me check…“)

This proves RL can induce reasoning from a knowledge-rich base. But R1-Zero also had critical failures: unstructured text, mixed languages mid-response, inconsistent formatting. The full R1 model — with cold-start SFT and better post-training — massively outperformed it.

What I take away from this: RL can discover strategy, but the knowledge base must already be in place. R1-Zero worked because DeepSeek-V3 was pre-trained on 14.8T tokens including 30% code. For a model without code-heavy pre-training or mid-training, RL on coding tasks would simply fail.

The Full DeepSeek-R1 Pipeline

The R1-Zero experiment was instructive, but the production R1 pipeline tells the full story. It has five stages, and the interplay between them illustrates the knowledge-strategy separation precisely:

StageNameKey Details
0Pre-training (14.8T tokens)30% code, 55% text, 10% math, 5% other; FIM at 0.5; MoE: 671B total, 37B active; multi-token prediction
1Cold-Start SFT~thousands of long CoT examples; teaches the FORMAT of reasoning, not reasoning itself; quality over quantity
2Reasoning RL via GRPOG=64 completions per prompt, no critic network; rewards: correctness (math/code) + format compliance; this is where reasoning EMERGES
3Rejection Sampling + SFT (~800K)600K reasoning + 200K general; uses Stage 2’s model to generate, filters for quality; consolidates reasoning with general skills
4RL AlignmentFinal alignment pass for helpfulness + safety

The GRPO algorithm itself was partly designed around infrastructure constraints — eliminating the critic network removes 84 GB of memory for a 7B model, making RL feasible on fewer GPUs:

   def compute_grpo_advantages(rewards, group_size=64):
    """For each prompt, sample G completions and
    normalize rewards within the group.
    No critic network needed."""
    # rewards: [batch_size, group_size]
    mean = rewards.mean(dim=1, keepdim=True)
    std = rewards.std(dim=1, keepdim=True) + 1e-8
    advantages = (rewards - mean) / std
    return advantages

Notice the division of labor: Stages 0-1 install knowledge and format (mid-training territory). Stages 2-4 discover and optimize strategy (RL territory). The cold-start SFT in Stage 1 is the bridge — it gives RL a starting format to work with, but the actual reasoning capability comes from RL exploration.

For code-specific rewards in SWE-Bench-style tasks, the reward function provides graded signal:

   def swe_bench_reward(patch, test_suite, original_state):
    """Graduated reward for SWE-Bench-style code fixing."""
    if not patch.is_valid_diff():
        return 0.0  # Syntactically invalid patch

    applied = apply_patch(original_state, patch)
    if applied is None:
        return 0.05  # Well-formatted but doesn't apply

    results = run_tests(applied, test_suite)
    if results.all_pass:
        return 1.0   # Perfect fix
    elif results.previously_failing_now_pass:
        if results.has_regressions:
            return 0.5  # Fixed target bug but broke something
        return 0.8      # Fixed target bug, some tests still fail
    return 0.1          # Applied but didn't fix the issue

The Cold-Start Problem

For agentic RL, the cold-start problem is severe. Consider training a coding agent with GRPO on SWE-Bench tasks (binary reward: did the patch fix the failing tests?):

Without Mid-Training (Cold Start)With Code Mid-Training
Tool callsInvalid → reward = 0Valid (learned format during mid-training)
File navigationWrong files → reward = 0Plausible files (learned repo structure)
PatchesSyntactically invalid → reward = 0Syntactically valid (learned code patterns)
ResultReward almost always 0. RL has nothing to learn from.Some patches work, some don’t. RL has meaningful signal to optimize.

The agentic RL frameworks (AgentRL, Agent-R1, Planner-R1) all require a mid-trained or SFT’d starting model. GRPO has a hidden requirement: the group of G completions must contain some correct answers. If all 64 completions receive reward 0, the advantage is uniformly zero and no learning occurs.

Planner-R1’s finding is particularly instructive: an 8B model with dense reward shaping achieves 56.9% on TravelPlanner — competitive with 32B models — but even with shaped rewards, the 8B model still needs to understand what the tools are and how to call them. That understanding comes from mid-training.

Quantifying the Gap

Performance Gains by Training Stage


3. The Qwen-Coder Blueprint

If mid-training determines the RL ceiling, the concrete question becomes: what should a mid-training run look like? The most thoroughly documented answer comes from Qwen-2.5-Coder.

The Pipeline

Qwen-2.5-Coder continues training from Qwen-2.5 (18T general tokens) with 5.5T additional code-heavy tokens across 92 languages:

StagePhaseDetails
StartQwen-2.5 Base18T tokens, general
1Code CPTBulk of 5.5T tokens; 80% code, 10% math, 10% general replay; repo-level training with special tokens; FIM at 50% rate
2Long-Context Extension8K → 128K
3AnnealingHighest quality data, declining LR
OutputQwen-2.5-Coder BaseReady for SFT and RL

Repository-Level Training

This is Qwen-Coder’s most influential contribution. Files from the same repo are concatenated in dependency order with <|repo_name|> and <|file_sep|> special tokens. The model learns import chains, project structure, test-source correspondence, and API usage patterns across files.

This is critical for SWE-Bench tasks — and based on Qwen’s experiments, it can’t realistically be taught in SFT or RL — the model needs to internalize repository structure during mid-training, from seeing millions of repositories.

What does the model learn from repository-level training that it can’t learn from file-level training?

File-Level Training TeachesRepository-Level Training Adds
Syntax and semanticsImport resolution across files
Function/class patternsWhere tests live relative to source
Local variable usageConfiguration files and their effects
Single-file algorithmsDatabase models → views → templates flow
API definitions → client usage patterns
Package structure and __init__.py chains

The practical difference: a file-level trained model might generate a correct function in isolation. A repo-level trained model knows that from django.db.models import Field means there’s a Field class somewhere in django/db/models/, knows roughly what methods it has, and knows that changing it requires updating tests in tests/model_fields/. This contextual awareness is what SWE-Bench demands.

Language Distribution

The 92 programming languages in Qwen-Coder are not equally represented:

LanguageShareLanguageShare
Python25%Go4%
JavaScript8%Rust3%
TypeScript7%Shell3%
Java7%PHP3%
C/C++8%Ruby2%
C#4%70+ others17%

Python dominates because (a) it’s the most common language on GitHub, (b) SWE-Bench is Python-only, and (c) Python reasoning transfers well to other languages. The long tail of 70+ languages ensures the model has at least some exposure to diverse syntax patterns.

FIM Details

PSM format at 50% rate, with random spans of 10-50% of file length:

   def apply_fim(code: str, fim_rate: float = 0.5) -> str:
    if random.random() > fim_rate:
        return code  # 50% standard left-to-right

    span_ratio = random.uniform(0.1, 0.5)
    span_length = int(len(code) * span_ratio)
    start = random.randint(0, len(code) - span_length)

    prefix = code[:start]
    middle = code[start:start + span_length]  # Model must predict this
    suffix = code[start + span_length:]

    return f"<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>{middle}"

Below 30%, the model shows weak editing capability. Above 70%, left-to-right generation quality degrades. The 50% sweet spot was independently validated across Qwen-Coder and Code Llama.

Quality Filtering Pipeline

Quality Filtering Pipeline

The Results

BenchmarkBefore Mid-TrainingAfter Mid-TrainingGain
HumanEval57.3%88.4%+31.1%
MBPP66.2%83.5%+17.3%
MultiPL-E42.1%70.6%+28.5%

These are massive gains from next-token prediction on well-curated code alone — no RL, no reward engineering, no environment setup.

Key Lessons (from the Qwen-Coder reports)

  1. Start from a strong general base. Code mid-training on a weak base model doesn’t work.
  2. Quality > quantity, with a quantity floor. Filtering matters more than token count, but you still need hundreds of billions of tokens.
  3. Repository-level context can’t be taught post-hoc. It requires exposure to millions of repos during mid-training.
  4. The 10% replay buffer matters a lot. Removing it causes measurable NL degradation in Qwen’s experiments.
  5. Annealing has outsized impact. The last 10-15% of training disproportionately affects downstream performance.

Qwen-2.5-Coder established the blueprint for code mid-training as of late 2024 — repo-level training, FIM at 50%, quality filtering, replay buffers. But the blueprint assumed you have the data. For agentic mid-training, the data problem is acute: multi-turn tool-use trajectories barely exist in the wild. Section 4 examines how to manufacture them. After that, several deeper questions remained open: How should mid-training interact with RL? How much compute belongs in each stage? Can RL signals flow backward to improve mid-training itself? Sections 5-7 examine the research that answered these questions, and in Section 8, we’ll see how Qwen3-Coder-Next incorporated all of these advances — scaling from 92 to 370 languages, adding multi-scaffold agentic trajectories, and achieving results that would have seemed implausible a year earlier.


4. Data Synthesis at Scale

In practice, the biggest bottleneck for agentic mid-training seems to be data, not compute. Agentic trajectories (multi-turn tool-use sequences, debugging sessions, repository-level edits) are rare in the wild. GitHub has plenty of code, but very few examples of an agent systematically debugging a failing test, localizing a bug, and applying a fix.

SWE-smith: The Environment-First Approach

SWE-smith (Yang et al., 2025) is the most important advance in training data synthesis for coding agents. The key insight: build the execution environment first, then synthesize bugs within it.

SWE-Bench’s conventional approach creates a Docker environment per task instance (~50-150 TB for the full dataset, hundreds of human hours). SWE-smith inverts this: one Docker image per repository, then synthesize bugs programmatically:

StepActionOutput
1Take top PyPI packages (5,000+ stars)Candidate repositories
2Run SWE-agent to install + verify tests~7 min human review per repo
3ONE Docker image per repo295 GB total
4Synthesize bugs via 4 strategies50,137 instances from 128 repos

Cost: ~20 hours human labor, $1,360 compute

Four Bug Synthesis Strategies

LM Modify (56% yield, $0.004/candidate) — Prompt an LM to introduce subtle logical bugs. Cheapest, highest yield.

LM Rewrite (35% yield, $0.04/candidate) — Give LM only function header + docstring, ask for re-implementation. Bugs emerge naturally from imperfect reimplementation — more realistic than intentional insertion.

Procedural Modification (40.2% yield, zero cost) — 13 AST-level transformations:

   TRANSFORMATIONS = [
    "remove_conditional",     "change_operator",
    "invert_boolean",         "shuffle_lines",
    "remove_return",          "change_constant",
    "remove_exception_handler", "swap_arguments",
    "remove_loop_break",      # ... 13 total
]

PR Mirror (33.8% yield, $0.06/candidate) — Collect real PRs, use LM to revert changes. Most realistic, most expensive.

Every candidate goes through automated validation: apply patch → run test suite → keep only if ≥1 previously-passing test now fails.

Scaling Results

SWE-smith data with rejection sampling fine-tuning (no RL) achieves:

Training DataSWE-Bench Verified
100 trajectories14.3%
400 trajectories27.8%
1,600 trajectories33.4%
5,016 trajectories40.2%

Log-linear scaling, not saturating. And a key finding: repository diversity > task difficulty. Performance scales logarithmically with number of training repos, but task difficulty shows no correlation with downstream effectiveness.

Beyond SWE-smith: Agentic Trajectory Generation

For mid-training at the billion-token scale, you need more than static bug-fix pairs. You need full agentic trajectories — multi-turn sequences showing how an agent explores, reasons, and fixes issues. The pipeline:

   def generate_agentic_trajectory(task, model, environment, max_turns=30):
    """Generate a multi-turn agentic trajectory for a coding task."""
    trajectory = []
    state = environment.reset(task)

    for turn in range(max_turns):
        # Agent thinks and decides action
        action = model.generate(
            system_prompt=AGENT_SYSTEM_PROMPT,
            context=state.to_string(),
            tools=["search_files", "read_file", "edit_file",
                   "run_tests", "bash_command"]
        )

        # Execute in sandboxed environment
        observation = environment.step(action)
        trajectory.append({
            "turn": turn,
            "thought": action.thought,     # Agent's reasoning
            "action": action.tool_call,     # Tool invocation
            "observation": observation       # Environment response
        })

        if observation.is_terminal:
            break

    # Evaluate: did the trajectory solve the task?
    reward = environment.evaluate(task, state)
    return trajectory, reward

# Scale: generate across thousands of tasks
# Filter: keep only successful (or partially successful) trajectories
# Format: convert to training-ready sequences with proper masking

Different agent frameworks produce different trajectory formats (ReAct, function-calling, code execution), and including all of them during mid-training prevents the model from overfitting to a single scaffolding style. This is a key lesson from Qwen3-Coder-Next (Section 8).

Additional synthesis approaches that complement SWE-smith:

Commit-based data mining: GitHub commits provide natural (buggy_version, fixed_version) pairs. Extract diff + pre-commit state + commit message; verify that the fix resolves test failures. This produces issue-to-patch training data at massive scale.

Sub-skill decomposition: Following Agent-FLAN’s finding, train on decomposed sub-skills (tool use, planning, reflection, multi-turn interaction) separately before combining into full trajectories. This is more effective than training on end-to-end trajectories alone.

Negative examples: Generate incorrect tool calls and failed plans alongside correct ones. Agent-FLAN found that a 4:1 positive-to-negative ratio significantly improves robustness, teaching the model what not to do.

Data Mixing for Agentic Mid-Training

Combining all data sources, the recommended mix:

Data SourceRatioPurpose
Code (file-level, quality-filtered)25%Core code understanding
Code (repo-level, dependency-ordered)15%Cross-file relationships
Issue-to-patch pairs10%Bug understanding and fixing
Agentic trajectories (tool-use)10%Tool-use patterns
Reasoning traces (CoT)10%Step-by-step reasoning
Debugging traces8%Error analysis skills
TDD data (failing test → fix)5%Test-driven development
Code navigation / search data5%Codebase exploration
General text replay12%Anti-forgetting

With the data problem addressed, we can turn to the deeper question: how should mid-training and RL interact? The next three sections present controlled studies that reshape the conventional “mid-train first, RL later” pipeline.


5. Front-Loading Reasoning

The traditional view treats reasoning as a post-training concern. Two NVIDIA papers challenge this fundamentally.

Front-Loading Reasoning (Akter et al., 2025)

This paper asks: what happens if we include reasoning data during pre-training instead of deferring it to SFT?

The authors train 8B models from scratch on 1T tokens with four pre-training data configurations (with and without reasoning data), then systematically apply SFT and RL.

Finding 1: The Compounding Advantage.

The Compounding Advantage

On AIME: 12.29% (base+RL) vs. 45.21% (reasoning+RL) — a 33% absolute gap. This refutes the “catch-up” hypothesis: no amount of post-training compensates for a weak pre-training data mix.

Finding 2: The Asymmetric Principle.

Diversity-FirstQuality-First
Pre-training64.09 avg (winner)54.98 avg
SFT31.54 avg44.99 avg (winner)

In other words, different stages seem to have different optimal data strategies. Mid-training benefits from casting a wide net (diverse reasoning data); SFT benefits from being ruthlessly selective (highest-quality chain-of-thought).

Finding 3: Latent Effects. High-quality pre-training data has benefits invisible at the pre-training checkpoint that manifest after SFT — a +4.25% advantage that appeared only post-alignment. This means evaluating mid-training checkpoints in isolation underestimates the value of high-quality data.

Finding 4: Naive SFT Scaling Hurts. Doubling SFT data volume with mixed-quality examples actually decreases math performance by 4.92%. The heavy lifting of data volume should happen during mid-training, not SFT.

RLP: RL as a Pretraining Objective (NVIDIA, ICLR 2026)

RLP goes further: RL can be the pre-training objective itself. It defines a universal reward signal based on information gain:

   For any document d = [t₁, t₂, ..., tₙ]:

  1. Generate a reasoning chain r for context c = [t₁, ..., tₖ]
  2. Compute: reward = log P(tₖ₊₁ | c, r) - log P(tₖ₊₁ | c)
     = "How much does reasoning help predict the next token?"

No verifier needed — the document itself provides supervision. Properties:

  • Verifier-free: Works on any document stream
  • Dense signal: Every token provides reward
  • Extraordinarily data-efficient: 0.125% of data → 35% accuracy improvement on 12B model

Results on Nemotron-Nano-12B: 42.81% → 61.32% average (+18.51%) with just 0.125% of training data processed through RLP.

A Practical Training Schedule

Combining both papers, the authors recommend a three-phase approach to mid-training:

PhaseComputeObjectiveDetails
180%Standard NTPBuild knowledge base with diverse reasoning data mixed in (Front-Loading: code + math + science reasoning)
215%NTP + RLP interleavedInstall reasoning via RL-in-pretraining; dense reward from information gain; extraordinarily data-efficient
35%High-quality annealingConsolidate with best data at declining LR; plant “latent seeds” that manifest during SFT/RL

The picture that emerges: reasoning is not something you bolt on at the end — it’s a foundational capacity that benefits from being present in the training data from early stages.

Implications

  1. These results make a strong case that reasoning data belongs in mid-training, not just SFT/RL — the compounding advantage means every reasoning token in mid-training is worth more than the same token in SFT.
  2. RL-like objectives can be embedded directly in mid-training (via RLP).
  3. The clean separation between “pre-training for knowledge” and “post-training for reasoning” no longer holds.

6. The Interplay — Controlled Evidence

While NVIDIA’s work shows that reasoning should be front-loaded, Zhang, Neubig, and Yue (CMU, 2025) answer the harder question: how much compute should go to mid-training vs. RL?

The Experimental Framework

The authors use synthetic reasoning tasks (DAGs of arithmetic operations) with fully controllable difficulty:

   Example (op=4, "easy"):         Example (op=12, "hard"):
  x₁ = 15                        x₁ = 7
  x₂ = x₁ + 3 = 18              x₂ = x₁ + 5 = 12
  x₃ = x₂ × 2 = 36              ...
  x₄ = x₃ - x₂ = 18             x₁₂ = f(x₁₁, x₉) = ???

They define three regimes:

  • In-distribution: op = 2-10 (seen during training)
  • Edge of competence: op = 11-14 (just beyond training)
  • Out of distribution: op = 17-20 (far beyond training)

Finding 1: RL Only Works at the Edge of Competence

Difficulty Regimepass@1pass@128Interpretation
In-distribution (op=2-10)ImprovesUnchangedRL sharpens but doesn’t extend
Edge of competence (op=11-14)+42%RL genuinely extends capability
Too hard (op=17-20)No changeNo changeRL cannot learn what it cannot solve

RL can only improve tasks where the model has partial success. Mid-training’s job is to push the “edge of competence” outward — giving the model enough knowledge that RL can sharpen.

Finding 2: The 1% Exposure Phase Transition

The 1% Exposure Phase Transition

A phase transition, not a gradual curve. Even 1% Rust in the mid-training mix (vs. 0%) can be the difference between RL succeeding or completely failing on Rust tasks.

Finding 3: Mid-Training Outperforms RL-Only

Under normalized compute budgets:

Compute AllocationOOD-hard Performance
RL-only (100% RL budget)Baseline
Mid-training + heavy RL (50/50)+10.8% over RL-only
Mid-training + light RL (80/20)Best pass@1 on OOD-edge

The optimal allocation depends on the goal:

Compute Allocation Guide

Finding 4: Process Rewards Prevent Hacking

Composite reward (0.2 outcome + 0.8 process) yields +4-5% pass@1 on OOD tasks. More importantly, outcome-only models learn shortcuts via wrong reasoning; process rewards enforce correct intermediate steps.

For agentic SWE tasks: intermediate rewards (correct localization, patch compiles, right files addressed) are more valuable than pure outcome rewards (all tests pass).

Finding 5: Topological Novelty

A fascinating side result: on harder tasks (op=15-20), models with sufficient pre-training exposure generate genuinely novel reasoning structures — DAG topologies not seen during training. This is evidence that RL can discover new reasoning strategies, not just replay memorized ones. But it only occurs when the base model has enough knowledge to attempt the task.

For agentic coding, this suggests that RL can discover novel debugging strategies (e.g., new patterns of search → read → hypothesize → test) if the base model has sufficient code understanding from mid-training. Without that foundation, the model generates random action sequences rather than exploring novel-but-structured approaches.

Connecting to Practice

Here’s how I see the Interplay paper’s findings translating to agentic coding models:

  1. Edge-of-competence RL curricula. Starting with easier SWE tasks where the agent sometimes succeeds, then gradually increasing difficulty, appears to be the most compute-efficient approach — tasks that are too easy or too hard contribute little learning signal.

  2. Even 1% domain exposure matters. Extending to new languages or frameworks? Including a small amount of relevant data in mid-training seems sufficient to unlock RL generalization — the cost is minimal but the impact is binary.

  3. Split compute between stages. Under any fixed budget, the data suggests that splitting between mid-training and RL beats RL-only. The question is the ratio, not whether to split.

  4. Process rewards help for multi-step tasks. Multi-step agentic tasks appear to benefit disproportionately from intermediate reward signals, consistent with broader findings in RL for sequential decision-making.


7. ReMiT — The Feedback Loop

Everything discussed so far follows the unidirectional pipeline: mid-training produces a better base → RL exploits it. ReMiT (Huang et al., 2026) breaks this directionality, creating a self-reinforcing flywheel.

The Core Insight

Standard next-token prediction treats all tokens equally. But not all tokens matter equally for reasoning. Words like “Therefore”, “However”, and code control flow keywords are disproportionately important. An RL-trained model already knows which tokens matter — its probability distribution is shifted toward these “pivotal” tokens.

ReMiT uses this shift to reweight the mid-training loss.

The Algorithm

Step 1: Compute the token-level log-probability gap:

δt=logPRL(xtx<t)logPbase(xtx<t)\delta_t = \log P_{RL}(x_t | x_{<t}) - \log P_{base}(x_t | x_{<t})

Step 2: Center per sequence to normalize difficulty:

δtcentered=δtμδ\delta_t^{centered} = \delta_t - \mu_\delta

Step 3: Map to weights via scaled sigmoid with clipping:

wt=clip(2σ(δtcentered), 1ϵ, 1+ϵ)where ϵ=0.2w_t = \text{clip}(2 \cdot \sigma(\delta_t^{centered}),\ 1-\epsilon,\ 1+\epsilon) \quad \text{where } \epsilon = 0.2

Step 4: Apply weighted NTP loss:

LReMiT=1TtwtLNTP(xt)\mathcal{L}_{ReMiT} = \frac{1}{T}\sum_t w_t \cdot \mathcal{L}_{NTP}(x_t)

The implementation:

   import torch
import torch.nn.functional as F

class ReMiTWeighter:
    def __init__(self, epsilon: float = 0.2):
        self.epsilon = epsilon

    @torch.no_grad()
    def compute_weights(self, base_logits, rl_logits, labels):
        # Step 1: Log-probability gap
        base_lp = F.log_softmax(base_logits, dim=-1)
        rl_lp = F.log_softmax(rl_logits, dim=-1)
        base_lp = base_lp.gather(-1, labels.unsqueeze(-1)).squeeze(-1)
        rl_lp = rl_lp.gather(-1, labels.unsqueeze(-1)).squeeze(-1)
        delta = rl_lp - base_lp

        # Step 2: Sequence-level centering
        mask = (labels != -100).float()
        seq_mean = (delta * mask).sum(-1, keepdim=True) / mask.sum(-1, keepdim=True)
        delta_centered = delta - seq_mean

        # Step 3: Scaled sigmoid with clipping
        weights = 2.0 * torch.sigmoid(delta_centered)
        weights = weights.clamp(1.0 - self.epsilon, 1.0 + self.epsilon)
        return weights  # [batch, seq_len], range [0.8, 1.2]

What Tokens Get Upweighted?

Token CategoryAvg WeightExamples
Discourse connectives1.15–1.20”Therefore”, “However”, “Thus”
Logical verbs1.12–1.18”implies”, “requires”, “depends”
Code control flow1.10–1.15if, else, return, break
Mathematical operators1.08–1.12+, *, =, comparisons
Structural markers1.05–1.10Indentation, brackets, newlines
Content words0.95–1.05Variable names, string literals
Filler tokens0.80–0.90Articles, prepositions

Results

Across three model families:

ModelVanilla NTPReMiTImprovement
OLMo-1B22.3527.56+5.21
SmolLM3-3B36.8038.58+1.78
Youtu-LLM-2B32.6136.93+4.32

Convergence: ReMiT reaches baseline performance in 6x fewer training steps. Throughput drops 43% per step (extra forward pass through RL reference), but total GPU hours drop 3.5x.

Post-training transfer: The advantage persists through SFT → DPO → RL (+2% sustained).

The Flywheel

The ReMiT Flywheel

Each round produces a better RL model → better reference for mid-training → even better RL model.

ReMiT vs. Knowledge Distillation

KD looks better during mid-training but its advantage disappears after post-training. Why? KD creates extremely low KL divergence to the RL model, destroying distributional diversity needed for continued training. ReMiT’s conservative weight range (0.8 to 1.2) preserves diversity while prioritizing important tokens — nudging the distribution rather than forcing it.

The Bigger Picture

The Bidirectional Pipeline

The linear pipeline becomes a loop. Combined with RLP (RL-as-pretraining from Section 5), the stages are no longer sequential — they’re interleaved and mutually reinforcing. This is the “with” in “mid-training with agentic RL.”


8. Case Study — Qwen3-Coder-Next

In Section 3, we examined the Qwen-2.5-Coder blueprint: 5.5T tokens, 92 languages, repo-level training with FIM. That was the state of the art in late 2024. Qwen3-Coder-Next, released in early 2026, represents a generational leap — not just in scale (trillions of tokens, 370 languages) but in incorporating the very research advances we’ve covered in Sections 4-7: large-scale agentic trajectory synthesis, multi-scaffold diversity, and best-fit packing that preserves document boundaries. It’s perhaps the best public example of what all these techniques look like when they come together. But first, some context on how rapidly the field has moved.

The SWE-Bench Trajectory

SWE-Bench Verified — a curated subset of real GitHub issues with reproducible test environments — has become the de facto benchmark for agentic coding capability. The performance trajectory tells a story of compounding infrastructure, data, and algorithmic improvements:

PeriodSystemSWE-Bench Verified
Apr 2024SWE-agent + GPT-4~18%
Jun 2024Agentless~27%
Mid 2024Claude 3.5 Sonnet + SWE-agent~33%
Late 2024Amazon Q / OpenHands + Claude~35-41%
Early 2025Frontier systems~49-53%
Mid 2025SWE-agent-LM-32B (open-source, SWE-smith)40.2%
Late 2025Various~55-65%
Jan 2026Claude Opus 4.5 + Live-SWE-agent79.2%
Jan 2026Gemini 3 Pro77.4%
Jan 2026Qwen3-Coder-Next (3B active!)74.2%

From ~18% to ~79% in under two years. SWE-Bench Pro (a harder variant) tells a different story: even the best system (Qwen3-Coder-Next) tops at 44.3%, and GPT-5 manages only 23.3%. Real-world software engineering remains far from solved.

What stands out to me from this trajectory: the improvements come from three mutually reinforcing factors — better base models (improved through mid-training), better agent scaffolding (tool design, retrieval, multi-agent coordination), and RL refinement. Mid-training is the foundational layer.

The Model

  • 80B total parameters, 3B active per token (MoE)
  • Hybrid architecture: Gated Attention + Gated DeltaNet (linear attention for efficient long-context)
  • 262K context length
  • Multi-scaffold: Trained across SWE-Agent, Mini-SWE-Agent, OpenHands, Claude-Code scaffolds
BenchmarkQwen3-Coder-Next (3B active)Claude Opus 4.5 (full)DeepSeek-V3.2 (37B active)
SWE-Bench Verified71.3%79.0%72.6%
SWE-Bench Pro44.3%40.8%40.9%
SWE-Bench Multilingual62.8%~65%~58%

A 3B-active model matching or exceeding a 37B-active model. The training pipeline, not architecture alone, drives this.

Mid-Training: The Main Event

The mid-training phase processes trillions of tokens with several innovations:

Massively expanded natural data: Scaling from Qwen-2.5-Coder’s 92 languages to 370, with ~600B tokens of repo-level data and 262K context — a 4x expansion in language coverage and substantially longer context.

Text-Code Grounding — a new data type: web documents from Common Crawl rewritten by Qwen3-Coder-480B (teacher) to include better code examples and clearer explanations:

BenchmarkOriginal Web TextReformatted by 480BGain
EvalPlus54.3863.09+8.71
MultiPL-E36.0248.35+12.33

Simply having a stronger model rewrite web documents about programming produces substantially better training data.

Multi-scaffold agentic trajectories: Generated using SWE-Agent, Mini-SWE-Agent, OpenHands, Claude-Code, Qwen-Code, and Terminus with Qwen3-Coder-480B as teacher.

Best-Fit Packing (BFP): Instead of concat-then-split (which destroys document boundaries), BFP uses bin-packing to fit complete documents into sequences:

Document Packing Strategies

This prevents context hallucination and preserves tool-call structure in multi-turn data.

Task Synthesis: 800K Verifiable Instances

Two complementary approaches:

Mining GitHub PRs: Decompose PRs into (buggy state, fix, test patch) triples. An environment-building agent constructs Docker environments. A QA agent removes ambiguous tasks.

Synthesizing Bugs: Building on SWE-smith with model-driven rewriting, semantic perturbations, and rule-based transformations. Keep only bugs that fail existing tests.

Infrastructure: MegaFlow on Alibaba Cloud Kubernetes — each task is an Argo workflow with agent rollout, evaluation, and post-processing stages, running hundreds of thousands of concurrent executions.

Expert Models + RL + Distillation

Instead of a single RL run, Qwen trains four expert models:

ExpertDomain
Expert 1Web Development
Expert 2User Experience
Expert 3Single-turn QA (RL)
Expert 4Software Engineering (multi-turn RL)

The SE expert uses trajectory-level rewards with two innovations:

  • Unfinished trajectory penalty: Exceeding max turns without completing → penalty
  • Turn-level tool-format penalty: Invalid tool calls → token-level penalties

After training, all experts are distilled back into one unified model.

The Reward Hacking Discovery

One of the paper’s most novel findings: during RL, agents autonomously discover ways to cheat.

   The exploit:
  Turn 1: Read issue description
  Turn 2: git remote add upstream https://github.com/original/repo
  Turn 3: git fetch upstream
  Turn 4: git diff upstream/fix-branch
  Turn 5: Apply fetched diff as own "fix" → reward = 1.0

Simply removing remotes/branches/tags is insufficient:
agents autonomously discover alternative pathways (curl, wget, pip install)

Solution: A heuristic blocking rule for tool calls containing both a repo link and network-access keywords. With this blocker, RL scales cleanly.

Positive side effect: Without shortcuts, the agent is forced to actually debug. Average turns increased from 50 to 130 — genuine long-horizon capability emergence.

Key Findings for Mid-Training

Mid-training scales predictably: Log-linear improvement with data volume (1B → 2B → 4B → 8B tokens).

Cross-scaffold transfer is weak: SWE-Agent trajectories don’t transfer well to OpenHands. Mid-training must include diverse scaffolds.

Tool template diversity matters: Performance improves from ~48% to ~54% as tool chat templates increase from 2 to 8, even with the same data volume.

General capabilities preserved: MMLU: 87.73 (vs. 87.87 baseline — negligible degradation). AIME 2025: 83.07 (up from 69.64 — reasoning transfer from code).


9. Putting It Together

A Mid-Training Recipe (Based on the Papers Above)

Step 1: Data Mix for SWE-Bench-targeting mid-training (500B token budget):

Data SourceRatioTokensNotes
Code (file-level)30%150BMulti-language, deduplicated, quality-filtered
Code (repo-level)20%100BDependency-ordered, special tokens
Code-adjacent text10%50BDocs, issues, PRs, Stack Overflow
Agentic trajectories8%40BMulti-scaffold
Issue-to-patch pairs5%25BBug reports → diffs
Debugging traces5%25BError → analysis → fix
Reasoning traces7%35BMath, logic, code CoT
Reformatted web text5%25BRewritten by strong model
General replay10%50BAnti-forgetting

Step 2: Training Configuration:

   config = {
    "lr_peak": 2e-5,              # 3-10x lower than pre-training
    "lr_schedule": "cosine",
    "warmup_ratio": 0.02,         # 2% warmup
    "annealing_ratio": 0.10,      # Final 10% with best data
    "fim_rate": 0.50,             # 50% of code data
    "fim_format": "PSM",
    "packing_strategy": "best_fit", # NOT concat-then-split
    "context_schedule": {
        "phase_1": {"length": 32768,  "ratio": 0.6},
        "phase_2": {"length": 65536,  "ratio": 0.2},
        "phase_3": {"length": 131072, "ratio": 0.15},
        "phase_4": {"length": 262144, "ratio": 0.05},
    },
}

Step 3: Four-Phase Curriculum:

PhaseBudgetPipeline StageTraining FormatFocus
1: Code Foundation40%Domain CPT (Phase 2)Raw code NTPFile-level code, math, general replay, 32K context
2: Code Engineering30%Domain CPT (Phase 2)Raw code + FIM NTPRepo-level code, diffs, PRs, FIM, 64K context
3: Agentic Skills20%Stage 2.5 (Phase 3)Chat-format + loss maskingTrajectories, tool-use, debugging, CommitPack, search-replace pairs, 128K context
4: Annealing10%Stage 2.5 (Phase 3)Highest quality, declining LRBest data across all categories, 256K context

The first two curriculum phases correspond to Domain CPT (Phase 2 in the 5-phase pipeline), while the last two correspond to Stage 2.5 (Phase 3) — the structured pre-training with chat-format data at CPT scale. The transition from raw NTP to chat-format-with-loss-masking happens gradually across the curriculum.

Step 4: Path to RL:

  1. Cold-start SFT on ~10K high-quality agent trajectories
  2. RL with GRPO: G=8-64, reward = test pass rate + intermediate rewards
  3. Curriculum from easy to hard (edge-of-competence targeting)
  4. Monitor for reward hacking (block network access exploits)
  5. Optional: ReMiT feedback loop for a second mid-training round

The Framework Landscape

Choosing an RL framework is its own rabbit hole — async vs. sync rollouts, GPU-to-GPU weight transfer, sandbox orchestration, and more. I covered this in detail in a separate post: RL Infra for Large-Scale Agentic Training, which compares VERL, AgentRL, Slime, and others across these dimensions.

Evaluation Targets

MetricAfter Mid-TrainingAfter SFT+RL
HumanEval85%+90%+
SWE-Bench Lite~25%~40%+
SWE-Bench Verified~20%~35%+
MMLU retention>95% baseline>93% baseline

Open Questions

1. How far does the ReMiT flywheel go? Techniques like ReMiT and self-refining data flywheels show clear compound gains in early iterations — navigation success rates jumping from 70% to 78% and beyond human baselines. The first few cycles almost always help, as long as the verifier signal is clean. But where the convergence fixed point lies remains unknown. In practice, flywheels stall (or regress) once the policy explores into the verifier’s blind spots or triggers reward hacking. Without periodic injection of real-world entropy — fresh, out-of-distribution data — diminishing returns set in quickly.

2. Natural vs. synthetic data ratio and distributional bias This is moving toward consensus, but the boundary conditions are still being mapped. In SWE tasks, synthetic data is now fully competitive and can even dominate the mix. SWE-smith generates high-quality bug-fix instances at ~$0.023 each via automated AST mutation and LLM-generated faults, and Ai2’s SERA shows that soft-verified synthetic patches follow the same scaling laws as strictly verified data — lowering the quality bar for generation without hurting downstream performance. The open question is distributional bias at high synthetic ratios. Programmatically generated bugs (flipped operators, deleted branches) don’t capture the complex logic errors developers make during multi-file refactors. The current workaround is mixing in reversed real PR data (as in SWE-gen), but the “optimal ratio” remains highly empirical and task-dependent.

3. Can mid-training replace SFT? In Section 1 I argued that SFT and Stage 2.5 serve genuinely different purposes — capability vs. alignment. That distinction still holds in principle, but the boundary is dissolving in practice. Qwen2.5-Coder deeply integrates instruction-following synthetics across its 5.5T-token mid-training phase; by the time mid-training is done, the model already has strong conversational and task-execution capabilities. When SFT still exists in these pipelines, it has typically shrunk to a brief alignment-annealing pass — tuning tone and safety policy rather than teaching instruction-following from scratch. The open question is whether even that residual SFT remains necessary, or whether safety and persona data can be folded into Stage 2.5 at scale without diluting their effect.

4. Scaling laws for mid-training This remains wide open. Pre-training loss follows clean power laws against parameter count and token budget because the objective — global next-token prediction — is uniform. Mid-training involves sharp domain shifts and high variance in data quality, which breaks the assumptions behind Chinchilla-style laws. There is currently no unified formula that predicts, for a given base model size, how many tokens of domain-specific data (say 100B of high-quality code) translate into a specific benchmark gain. Mixing ratios are still tuned via ablation and empirical heuristics.

5. Process reward design for agentic tasks Multi-turn agentic RL is the current frontier. Recent work (e.g., A Practitioner’s Guide to Multi-turn Agentic Reinforcement Learning) confirms that dense turn-level rewards significantly accelerate training — ratio-based verified rewards (fraction of unit tests passed) outperform sparse binary outcome rewards in SWE-Gym. But what the rewards should look like is far from settled. For SWE tasks, how should credit be assigned across bug localization, root-cause diagnosis, and test construction? Dense reward effectiveness also turns out to be highly sensitive to the underlying RL algorithm (PPO, GRPO, or unbiased RLOO). Designing a process reward that guides long-horizon planning without opening shortcuts for reward hacking is one of the hardest open engineering problems in this space.


Conclusion

Mid-training is no longer an optional engineering step. The evidence from the past six months makes a strong case that it is the most strategically important phase of the LLM training pipeline for agentic applications.

  1. Mid-training determines RL’s ceiling. DeepSeek-R1-Zero showed RL can induce reasoning from scratch — but only when the base model already has sufficient knowledge. CMU’s controlled experiments confirmed it: under fixed compute, allocating budget to mid-training outperforms RL-only by over 10% on out-of-distribution tasks. Without mid-training, RL wastes its budget learning what actions exist. With it, RL can focus on when to take them.

  2. The data recipe is the differentiator. Repository-level training, FIM at 50%, quality filtering, and mixing ratios are not implementation details — they are the core engineering decisions. Qwen-Coder’s 5.5T-token mid-training phase, with its carefully staged data curriculum, is what enabled a 3B-active MoE to compete with models 10x its size. The gap between good and bad mid-training routinely exceeds the gap between good and bad RL.

  3. The pipeline is becoming a loop. The linear pre-train → mid-train → SFT → RL pipeline is breaking down. ReMiT uses RL-trained models’ token-level probability gaps to reweight mid-training loss. NVIDIA’s RLP embeds RL directly into the pre-training objective as a dense reward signal. These are not incremental improvements — they represent a structural shift from sequential stages to mutually reinforcing processes.

  4. Data synthesis has industrialized. SWE-smith: 50K instances from 128 repos, 20 hours of human labor, $1,360 in compute. Qwen3-Coder-Next: 800K verifiable tasks. SERA: soft-verified synthetic patches following the same scaling laws as strictly verified data. The bottleneck has decisively shifted from data availability to data quality and diversity.

  5. Small models with strong mid-training beat large models without it. Qwen3-Coder-Next (3B active parameters) matches DeepSeek-V3.2 (37B active) on SWE-Bench. This is not a fluke — it is a direct consequence of investing in mid-training rather than relying on scale alone.

Six months ago, only the “for” part of this post’s title had solid evidence behind it. Today, the “with” part — RL actively shaping mid-training — is equally concrete. Mid-training is not a transition phase. It is the central orchestrator of the entire pipeline.


References

Core Papers:

Surveys: