Home

Published

- 20 min read

Building a Complete SFT Data Pipeline: Exceeding Qwen2.5-Coder-Instruct

img of Building a Complete SFT Data Pipeline: Exceeding Qwen2.5-Coder-Instruct

🎯 Key Result: Using the same Qwen2.5-Coder-7B-Base model, our SFT data curation pipeline produces an instruct model that outperforms the official Qwen2.5-Coder-7B-Instruct across Generic, Coding, and Math benchmarks:

  • Generic: MMLU 68.75 vs 65.15, C-Eval 66.12 vs 61.59
  • Coding: Arena Hard 50.49 vs 36.47 (+14🔥), LiveCode Bench 39.13 vs 34.50
  • Math: MATH 70.90 vs 68.28, GSM8K 90.90 vs 88.17, AMC 2023 52.50 vs 41.75 (+10.8🔥)

1. Methodologies

Given the open-source Qwen2.5 coder base model, we collect & curate its instruction finetuning data.

1-1. Data Collection & Preprocess

DatasetHuggingFace SourceMore Info
LIMA

tulu-v2-sft-mixture-lima

Paper: arxiv, Size: 1,018

lmsys-chat-1mlmsys-chat-1m

Paper: arxiv, Size: 1,000,000. 1M real conversations (210K users with 25 different SOTA LLMs), 154 languages, avg 2 turns per conversation

WizardLM v1

WizardLM_evol_instruct_70k

Paper: arxiv, v1: 70,000 (single-turn)

WizardLM v2

WizardLM_evol_instruct_V2_196k

v2: 143,000 (includes multi-turn)
WebInstructFullWebInstructFull

Paper: arxiv, Size: 11,621,594 (5B tokens). Mined 10M from CC, rewritten with Mixtral & Qwen. 20%+ improvement on MATH/GSM8K, ~10% on MBPP/ArenaHard

InfinityInstructInfinity-InstructGitHub

1-2. Data Analysis

1-2-1. Infinity Instruct

To construct a ten-million high-quality instruction dataset, we collect a large amount of open-source data as seed and iterate the dataset using two strategies: instruction selection and instruction evolution. We recommend applying the Foundational Dataset, which contains millions of instructions selected from open-source datasets, to improve the performance of models on challenging downstream tasks (e.g., code, math). We recommend applying the Chat Dataset, which contains about 1M instructions evolved from a small subset of high-quality seed data, to further improve the instruction-following ability of models in real conversation scenarios.

InfinityInstruct Versions

Subjective dataset components:

InfinityInstruct Category Distribution
Raw DatasetRowsHuggingFace URLPaper URL
Alpaca GPT4 data13,490alpaca-gpt4-dataN/A
Alpaca GPT4 data zh32,589

alpaca-gpt4-data-zh

N/A
Baize14,906

baize-v2-13b

arxiv
BELLE Generated Chat43,775

generated_chat_0.4M

GitHub
BELLE Multiturn Chat210,685

multiturn_chat_0.8M

BELLE 3.5M CN312,598train_3.5M_CN
BELLE School Math38,329school_math_0.25M
databricks-dolly-15K10,307

databricks-dolly-15k

N/A
LIMA-sft712

tulu-v2-sft-mixture-lima

arxiv
CodeContest523code_contestsarxiv
LongForm3,290LongFormarxiv
ShareGPT-Chinese-English-90k8,919

ShareGPT-Chinese-English-90k

N/A
UltraChat276,345ultrachat_200karxiv
Wizard evol instruct zh44,738

EvolInstruct_zh_DeepseekAPI

arxiv
Wizard evol instruct 196K88,681-arxiv
Code Alpaca 20K13,296-GitHub
WildChat61,873WildChat-1Marxiv
COIG-CQIA45,793COIG-CQIAarxiv
BAGEL55,193code_bagelN/A
DEITA10,000deita-10k-v0arxiv
Math320,130-N/A
Summary1,362,000

1-2-2. WebInstructFull

WebInstruct Website Distribution WebInstruct Domain Distribution

Existing SFT Datasets:

Dataset#PairsDomainFormatDataset Source
FLAN V2100KGeneralSFTNLP data + Human CoT
Self-Instruct82KGeneralSFTGenerated by GPT3
GPT4-Alpaca52KGeneralSFTGenerated by GPT4
SuperNI96KGeneralSFTNLP Datasets
Tora16KMathSFTGPT4 GSM+MATH Synthesis
WizardMath96KMathSFTGPT4 GSM+MATH Synthesis
MathInstruct262KMathSFTGPT4 Math datasets Synthesis
MetaMathQA395KMathSFTGPT-3.5-Turbo GSM+MATH Synthesis
XwinMath1.4MMathSFTGPT4 GSM+MATH Synthesis
OpenMathInstruct1.8MMathSFTMixtral GSM+MATH Synthesis

Existing CT (Continue Training) Datasets:

Dataset#TokensDomainFormatDataset Source
OpenWebMath12BMathLMFiltered from Web
MathPile10BMathLMFiltered from Web
Cosmopeida25BGeneralLMSynthesized by Mixtral
MINERVA38BMathLMFiltered from Web
Proof-Pile-255BMathLMOWM+Arxiv+Code
Galactica106BMath & Sci.LMFiltered from Web
DeepseekMath120BMathLMRecalled from Web
WebInstruct(10M) 5BMath & Sci.SFTRecall and Extracted from Web

The SFT datasets are mostly from NLP datasets or completely synthesized by GPT-4. The CT datasets are much larger because they are filtered or recalled from the web. The content contains lots of noise. We are the first dataset to combine these two to build a high-quality yet large-scale SFT dataset.

1-3. Data Pipelines

1-3-1. Open-source Evol & Synthesis

Infinity Instruct is a typical example of using opensource SFT data as seed and performing instruct-evol as data augmentation.

Infinity Instruct Pipeline

1. High-Quality Open Source Instruction Collection and Tag System

We start by collecting high-quality open-source instruction sets. We assign each instruction in the collection a set of tags that describe the abilities and knowledge necessary to complete the instruction.

  • Instruction collection: We systematically reviewed available open-source instruction sets and included sets created by humans and advanced LLMs.
  • Tag System with two levels:
    • First level tag: Describe the specific knowledge and abilities required for completing each instruction (e.g., Arithmetic Calculation, Knowledge of Biology). The tags are automatically generated by LLM.
    • Second level tags: Macro categories such as “Natural Language Processing” and “Math Reasoning.” Including 25 categories in total.

2. Informative Instruction Selection

Aimed at selecting the most informative instructions from the whole collection for enhancing the performance of LLM and improving user experience.

  • [Complexity] Instructions demand multiple kinds of abilities or multiple domains of knowledge
  • [Diversity] Instructions with long-tailed ability or knowledge
  • [Difficulty] Instructions with high following difficulty

3. Instruction Generation by Data Evolution Strategy

We expand the seed instructions in directions breadth, depth, difficulty, and complexity with a method built based on Evol-Instruct method.

  • Validate the evolved data, and use AI assistants to eliminate data that failed to evolve from the perspective of instruction compliance
  • Use the evolved instructions as the initial input, and use an AI assistant to play different roles to generate 2 to 4 rounds of dialogue for each instruction

4. Instruction Generation by Model Ability Deficient Diagnosis

Automatically identifying weaknesses in the model’s capabilities to guide the synthesis of data.

  • Model performance evaluation System
  • Automatic ability deficient diagnosis
  • Targeted data synthesis

1-3-2. Web Crawling, Extracting & Refining

WebInstructFull is a typical example of crawling data from the web, extracting QA pairs out of it, and refining the responses.

Method Comparison WebInstruct Three Stages

Stages: (1) high-quality data recall from the web corpus, (2) Q-A pair extraction and (3) Q-A pair refinement.

1. Recall from Common Crawl

To ensure diversity in our training data across various disciplines like math, science, and engineering, we propose crawling exam problems from educational websites such as stemez.com, homeworkstudy.com, and khanacademy.org. We collected 100K diverse seed examples and randomly selected 100K negative documents from Common Crawl (CC) for training a fastText model.

In the initial stage, the trained fastText model recalls the top 100B documents from CC, categorizing them by domain (root URL). We employ GPT-4 to identify domains likely to contain instructional content. Subsequently, we sample additional documents from these selected domains as positive examples and use documents from non-selected domains and the general CC as negative examples to refine the fastText classifier. The updated classifier then recalls the top 18M documents for further processing.

2. Q-A Pair Extraction

Recalled documents contain diverse content from forums, homework, quizzes, and exams. Despite noise like ads and HTML, they contain valuable Q&A pairs. We preprocess by parsing HTML to remove unrelated info. We then use Mixtral-8×7B to identify Q&A pairs, resulting in 5M candidates.

3. Q-A Pair Refinement

To further improve extracted Q-A pair candidates, we prompt Mixtral-7B×8 and Qwen-72B to reformat the extracted Q-A pairs. If the answer does not contain any explanation, we prompt the LLMs to complete the intermediate reasoning steps leading to the answer. We adopt two models to increase diversity. Eventually, we harvest 10M Q-A pairs as our final instruction-tuning dataset WebInstruct.

1-3-3. Collecting Real Conversations between Human and LLMs

LMsys1M is a typical example of such a genre.

The dataset contains 1 million real-world conversations with 25 state-of-the-art LLMs. It is collected from 210k unique IP addresses in the wild on the Vicuna demo and Chatbot Arena website, from April to August 2023.

Arena-Hard-200, the 200 most challenging and high-quality user prompts are selected and curated from this LMSYS-Chat-1M dataset.

The data-collection website contains three types of chat interfaces: (1) single model, (2) chatbot arena (battle), and (3) chatbot arena (side-by-side).

Conversation Dataset Stats LMSYS Model Distribution LMSYS Language Distribution LMSYS Topic Distribution

1-4. Our Pipeline

Given the two different sources of data mentioned above, we further process the large collection of data with our own pipeline:

Data Curation Pipeline

Merge General SFT data

Number of Rows: 18,922,281

Exact Dedup & Decontamination

Exact dedup: 18,922,281 → 18,629,484

Quality Filter

Let our LLM-scorer assign each prompt a rating between 1-10.

For this comparatively easier task, we can just use a smaller inhouse or opensource model. We query the result with a required format in this “Reasons first, score last” order.

   {
	"reasons": "point out the issues and your reasons for the rating",
	"score": "<integer>"
}

Suffering from insufficient prompting, where the scoring guidelines are absent and only output spaces (e.g. 0-100) are provided, this will result in inconsistent and misaligned evaluations.

Therefore, we follow an LLM-as-evaluator prompt template:

Evaluation Prompt Template

Generic Criterions:

  • A good prompt for benchmarking
  • A greater score represents a greater potential to evaluate the LLMs in problem-solving, creativity, and truthfulness
  • Trivial or ambiguous user prompts should get lower scores

Domain-based Templates:

Code Template Example
   You are an excellent user question qualifier. You are responsible for evaluating the quality of programming questions submitted by users, including various types such as QA questions, multiple-choice, debugging tasks, code explanations, and more. Your goal to ensure they meet the standards for precision, clarity and solvability.

Steps:
Think and Understand: Start by thoroughly understanding the question's intent. Consider whether the question meets the qualifying principles listed below.
Analysis: Based on your understanding, explain whether the question satisfies each qualifying principle.
Score: Assign a quality score based on your analysis.

Scoring System:
1: The question is clear, precise, solvable and ready for use by the programming community.
0: The question is ambiguous, unclear or unsolvable and needs further clarification before it can be answered.

Qualifying Principles:
1. **Clear and Concise Problem Statement**:
   - Clarity: The problem should be stated clearly. The reader should immediately understand the task after reading the problem.

2. **Relevant Background and Context**:
   - Allow simpler or high-level questions to meet the clarity requirement without needing extensive detail. If the question is understandable, it should be rated positively.

3. **Reasonable Assumption**
   - Assumption: The question should avoid unnecessary details, allowing the reader to make minor, reasonable assumptions to solve the problem without changing the intent of the question.

4. **Non-Trivial and Achievable Problem**:
   - Difficulty: The question should present a problem that is solvable given the tools and constraints available.

5. **No Further Detail is Asked**:
    - The question should not explicitly ask for more details or context to be provided by the responder. If the question itself is framed as a request for further clarification or additional information, it should be considered incomplete and not valid for evaluation.

Instruction Following:
    - Please adhere strictly to the provided output format in the few-shot examples.
    - Your response should consist of three essential sections: Thinking Steps, Analysis, Json Output.

Complexity Filter

It has been recommended that we consider a prompt to be challenging if it requires integrating various knowledge and skills to derive appropriate responses. This will require both knowledge tagging & skill tagging. But this has some overlap with diversity.

Therefore, as the first resort, we simply decompose complexity here as the number of explicit instructions in a user query.

Instruction Counting: Count the number of specific instructions requested by the user.

PE for general instruction counting
   You are an expert in analyzing user queries. Your task is to identify and enumerate all the specific explicit instructions present in a given user query.

---

**Requirements:**

Please note that there is no response provided. Your focus should be solely on the user's query.

Please list all the specific explicit instructions found in the user's query.

---

**Output Format:**

Please provide your output in the following JSON format:

{
  "instructions": [
    "Instruction 1",
    "Instruction 2",
    "Instruction 3"
  ],
  "instruction_count": X
}

---

**User Query:**
{user_prompt}
PE for instruction-following analysis (with response)
   You are an expert in evaluating how well responses follow user instructions. Your task is to analyze the given user query and the corresponding response, identify all the specific explicit instructions in the user query, and assess how well the response fulfills each instruction.

---

**Requirements:**

1. **Instruction Identification:**
   - List all the specific explicit instructions found in the user's query.

2. **Response Analysis:**
   - For each instruction, analyze whether the response satisfies it completely, partially, or not at all.
   - Provide reasons for your assessment.

3. **Scoring:**
   - Assign a score between 0 and 10 based on how well the response follows the instructions, where:
     - **0** means the response is completely unrelated to the user's instructions.
     - **10** means the response fully satisfies all the instructions in the user's query.

---

**Output Format:**

Please provide your output in the following JSON format:

{
  "instructions": [
    "Instruction 1",
    "Instruction 2",
    "Instruction 3"
  ],
  "analysis": {
    "Instruction 1": "Analysis of how well the response fulfills Instruction 1.",
    "Instruction 2": "Analysis of how well the response fulfills Instruction 2.",
    "Instruction 3": "Analysis of how well the response fulfills Instruction 3."
  },
  "score": X
}

Diversity: Intention Tagging & Reweighting

Plus, intention tagging is important. This tagging will also cover both knowledge and skill.

InsTag for diversity and complexity.

This algorithm adopts a “complexity-first” strategy by prioritizing queries with more (# of required knowledge and skills) tags and checking if each addition to the sub-dataset increases tag diversity. By doing so, it ensures that the final sampled sub-dataset not only meets the size requirement N but also maintains a high level of tag diversity. This approach helps the sampled subset represent as many tag categories from the original dataset as possible, even with a limited sample size.

InsTag Algorithm InsTag Diversity
PE for annotating intention tags

PE backbone:

   You are an excellent user query tagging expert. Given the following example tags described below, for a given user query, you are responsible for providing one or more tags that fully cover the intentions of the user. Your goal is to ensure user prompts are tagged properly for user intention analysis.

**Example tags:**
{examples}

The example tags are not exhaustive, you can provide other finer tags as you see appropriate.

---

**Output Format:**

Please provide the evaluation for the given query in the following JSON format:

{
  "explanation": "Explain what the tag is about",
  "reasons": "Point out your reasons in assigning the tags to this query",
  "tag": "list<string>"
}

---

**Query:**

Example tags include: Tagging System, Instruction Classification, Format Specification, Intent Analysis, Content Generation, Information Retrieval, Data Extraction, Summarization, Translation, Sentiment Analysis, Error Correction, Advice Seeking, Educational Instruction, Task Automation, Content Moderation, Code Generation, Emotion Detection, Personalization, Knowledge Retrieval, Data Analysis, Opinion Generation, Scheduling, Problem Solving, Hypothetical Scenario, Comparative Analysis, Definition Request, Paraphrasing, Trend Analysis, Formatting, Simulation, Clarification Request, Example Generation, Step-by-Step Explanation, Creative Writing, Role-playing, Algorithm Explanation, Metadata Extraction, Pattern Recognition, Policy Compliance Check, Benchmarking, Contextual Understanding, Conversational Continuation, Error Diagnosis, Code Refactoring, Language Learning Assistance, Voice Tone Analysis, Mood Setting, Personal Development, Mind Mapping, Goal Setting, Proofreading, Fact Checking, Joke Telling, Storytelling, Songwriting, Poem Composition, Historical Contextualization, Cultural Explanation, Mathematical Calculation, Scientific Explanation, Logical Reasoning, Analogy Creation, Visual Description, User Feedback Analysis, Resource Recommendation, Time Management, Memory Recall, Event Planning, Feedback Provision, Stress Testing, Hypothesis Formation, Data Visualization, Conflict Resolution, Etiquette Guidance, Idea Brainstorming, Priority Setting, Budget Planning, Negotiation Strategy, Product Review, Risk Assessment, Language Style Conversion, Protocol Simulation, Statistical Analysis, Energy Conservation Tips, Environmental Impact Assessment, Health and Wellness Guidance, Ethical Dilemma Discussion, Memory Enhancement Techniques, etc.

Response Enhancement

After filtering for high-quality instructions, we also improve the response quality through two strategies:

(1) GPT4-o response replacement: We regenerate responses using GPT-4o to ensure higher quality and more consistent formatting across the dataset.

(2) Strategy tag for response selection: For instructions with multiple candidate responses, we use strategy tags to select the most appropriate response based on the instruction’s domain and complexity.

Attribute Analysis

With all the filtering and enhancement steps complete, we now analyze the processed dataset to understand its characteristics. We examine the statistical distributions of various attributes, identify correlations between features, and ensure the final dataset maintains high quality and diversity.

Outlier Detection and Elimination

To ensure data quality, we identify and remove statistical outliers that could negatively impact model training. The following tables show the dataset statistics before and after outlier removal.

Before pruning the outliers:

Outlier Statistics Before Pruning Outlier Data Details

After pruning: 2,407,758 rows left. [v1]

Outlier Statistics After Pruning V1

2,334,645 rows left. [v2]

Outlier Statistics After Pruning V2

Distributions

Based on the distribution below, we filter the data accordingly.

Attribute DistributionsAttribute Boxplots

Covariance:

Correlation Heatmap
  • High Correlations: attribute_total_quality_score_recalculate has a very high correlation with attribute_coh_score (0.97), attribute_inf_score (0.92), attribute_flu_score (0.93), and attribute_rel_score (0.90). This suggests that these four attributes heavily influence the overall quality score.
  • Moderate Correlations: attribute_rel_score shows a decent correlation with attribute_inf_score (0.84) and attribute_flu_score (0.76).
  • Low Correlations: attribute_fact_score has relatively low correlations with most other attributes.
  • Insignificant Correlations: Attributes like attribute_intent_tag_ct and attribute_inst_ct have very low correlations with others.

Intention Analysis

Understanding the intention behind each instruction helps us better categorize and balance the dataset. We use the InsTag algorithm to automatically assign intention tags to each instruction, enabling fine-grained analysis of what users are trying to accomplish. Some examples of intention tags are listed below:

Intent Tags Sample Top 30 Intent Tags by Count

Common Tag Combinations:

Top 30 Tag Pairs by Co-occurrence

Quality Analysis

We analyze the quality scores across different data sources to understand which datasets contribute higher-quality instructions. This helps inform our sampling strategy and identify potential areas for improvement.

Group analysis on Source:

Quality Scores Table by Source Average Quality Score by Source

We can see that on average, the instructions from WebInstructFull have the highest overall quality score. Instructions from lmsys-chat-1m have the lowest overall quality score, which is reasonable because these instructions are casual human inputs to LLMs.

Diversity Analysis

Diversity in the training data is crucial for building robust models that can handle a wide range of tasks. We measure diversity using two metrics: the number of unique intention tags and Shannon Entropy (which accounts for the balance of tag distribution).

Interestingly, although lmsys-chat-1m has a much lower number of unique intent tags than Infinity-instruct-7m, its tags are quite balanced (as we can see from the word clouds). Therefore, the diversity of tags measured by Shannon Entropy has Lmsys-chat-1m ranking first.

Unique Intent Tags by SourceDiversity Entropy

Word Clouds by Source:

WordCloud Infinity-InstructWordCloud LMSYSWordCloud WebInstructWordCloud WizardLM

2. Experiments

We perform SFT experiments and evaluate models with OpenBenchmarks. Based on the open-sourced Qwen2.5-Coder-7B-Base model, we train a baseline instruct model with a pre-released collection of opensourced SFT data. We then aim at beating the baseline with an in-house curated version of SFT data. Ultimately, we aim to beat Qwen2.5-Coder-7B-Instruct.

2-1. Baseline Setup

We choose Qwen2.5-Coder-7B-Base as our foundation model due to its strong coding capabilities and open-source availability. This section describes the base model’s characteristics and establishes baseline performance metrics.

2-1-1. Base Model

Based on the open-sourced Qwen2.5-Coder-7B-Base model, we train a baseline instruct model.

💻 Code More: Qwen2.5-Coder builds on the strong Qwen2.5 and continues training on a larger scale of code data, including source code, text-code grounding data, and synthetic data, totaling 5.5 trillion tokens.

📚 Learn More: While enhancing coding abilities, we aimed to retain strengths in math and general capabilities from the base model. Therefore, Qwen2.5-Coder incorporates additional data on mathematics and general abilities.

  • ✨ Supporting long context understanding and generation with the context length of 128K tokens
  • ✨ Supporting 92 coding languages
  • ✨ Retain strengths in math and general capabilities from base model

Special tokens:

   {
	"<|fim_prefix|>": 151659,
	"<|fim_middle|>": 151660,
	"<|fim_suffix|>": 151661,
	"<|fim_pad|>": 151662,
	"<|repo_name|>": 151663,
	"<|file_sep|>": 151664,
	"<|im_start|>": 151644,
	"<|im_end|>": 151645
}
Qwen Coder Architecture Qwen Coder Benchmarks

If we want to compare our model against Qwen2.5-Coder-instruct on popular benchmarks, we mainly focus on these benchmarks: MMLU, ARC-Challenge, TruthfulQA, WinoGrande, HellaSwag.

Qwen Instruct Comparison Benchmark Metrics

Additionally, if we want to compare general language abilities, including English, Chinese and Multilingual, we may want to compare with the Qwen 7B non-coder instruct model as well.

Qwen Multilingual

2-1-2. Data Candidates

For the baseline instruct model, we look for the best pre-released collection of opensourced SFT data.

WebInstructFull dataset helps build a baseline with strong reasoning capabilities, which is suitable for our coding LLM model. Consisting of 10 million instruction-response pairs, this dataset helps LLM improve 10%+ on MBPP and ArenaHard, 20%+ on MATH and GSM8K, respectively.

Recommended finetuning hyper-parameters for Qwen2-7B on InfinityInstruct & WebInstruct:

   epoch: 3
lr: 1e-5
min_lr: 0
lr_warmup_steps: 40
lr_decay_style: cosine
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.95
global_batch_size: 528
clip_grad: 1.0

Benchmark Descriptions:

Natural Language, Knowledge & Reasoning

  • MMLU: massive multitask language understanding. Performance has begun to plateau.
  • MMLU-Pro: includes more complex and challenging reasoning questions.
  • C-Eval: knowledge and reasoning in a Chinese context.

Commonsense:

  • HellaSwag: commonsense natural language inference.

Instruction Following:

  • IFEval: focuses on a set of “verifiable instructions” such as “write in more than 400 words” and “mention the keyword of AI at least 3 times”.

Science Knowledge:

  • GPQA: PhDs with 65% accuracy; highly skilled non-expert validators with only 34% even spending 30 mins searching answers online (called “Google-Proof”).

Baseline Results:

ModelMMLUMMLU-ProIFEvalC-EvalGPQAHellaSwag

Qwen2.5-Coder-7B-Base (official): Our Eval

63.7522.9534.270.3625.2521.72

Qwen2.5-Coder-7B-Instruct (official): Paper

68.745.658.661.435.6-

Qwen2.5-Coder-7B-Instruct (official): Our Eval

65.1547.7560.8161.5932.7878.07

Qwen2.5-Coder-7B-Base-SFT-baseline (LIMA)

63.135.429.7662.8529.862.07

Qwen2.5-Coder-7B-Base-SFT-baseline (WebInsFull)

5.1525.7539.9356.914.0438.97

Qwen2.5-Coder-7B-Base-SFT-baseline (Infinity-Gen)

65.345.163.0361.6631.8270.31

Qwen2.5-Coder-7B-Base-SFT-baseline (MergeAll_sampled_2.5m)

3634.5540.4860.5516.1678.89

2-2. Curation Ablation

To understand the contribution of each component in our data curation pipeline, we conduct systematic ablation studies. We compare models trained on data processed through different stages of our pipeline to quantify the impact of each filtering and enhancement step.

Based on the open-sourced Qwen2.5-Coder-7B-Base model, and our curated SFT data, we aim to train a SOTA instruct model.

Ablation with Original Responses:

First, we test our curation pipeline while keeping the original responses from the source datasets:

ProcedureData QuantityMMLUMMLU-ProIFEvalC-EvalGPQAHellaSwag
Qwen2.5-Coder-7B-Base (official)N/A63.7522.9534.270.3625.2521.72
Qwen2.5-Coder-7B-Instruct (official)N/A65.1547.7560.8161.5932.7878.07
0. dedup+decontam+sample2.5m3634.5540.4860.5516.1678.89
1. instruction quality filter1.7m29.7533.4539.5661.9620.267.94

Ablation with GPT-4o Responses:

Next, we replace the original responses with GPT-4o generated responses, which significantly improves model performance across all metrics:

ProcedureData QuantityMMLUMMLU-ProIFEvalC-EvalGPQAHellaSwag
  1. dedup+decontam+sample (gpt4o response)
2.2m67.9551.7558.667.9830.8182.16
  1. instruction quality filter (gpt4o response)
1.7m67.550.554.7168.5728.2881.05

2.1 complexity-only filter (gpt4o response)

152K55.1545.8548.863.2230.374.74
2.2 complexity-first diversity filter (gpt4o response)152K60.9544.6549.7263.7434.3479.57

Comparison at Same Data Size (152K):

To isolate the effect of our sampling strategy, we compare different sampling methods using the same data budget of 152K samples:

ProcedureData QuantityMMLUMMLU-ProIFEvalC-EvalGPQAHellaSwag
2.0 random sampling (gpt4o response)152K54.443.550.4664.9325.2578.5

2.1 complexity-only filter (gpt4o response)

152K55.1545.8548.863.2230.374.74
2.2 complexity-first diversity filter (gpt4o response)152K60.9544.6549.7263.7434.3479.57

Final Results with Scaled Data:

Finally, we scale up our best sampling strategy (complexity-first diversity) to larger data sizes and compare against baselines:

Procedure / DatasetData QuantityMMLUMMLU-ProIFEvalC-EvalGPQAHellaSwag
Qwen2.5-Coder-7B-Base (official)N/A63.7522.9534.270.3625.2521.72

Qwen2.5-Coder-7B-Instruct (official): Paper

N/A68.745.658.661.435.6-

Qwen2.5-Coder-7B-Instruct (official): Our Eval

N/A65.1547.7560.8161.5932.7878.07
complexity-first diversity (v1, k=1)389k65.349.853.2367.926.7779.97
complexity-first diversity (v1, k=5)703k68.1550.8556.0167.5329.2980.47
complexity-first diversity (v2, k=1)1.8m69.4552.156.9365.7531.3181.11
complexity-first diversity (v2, k=5)2m69.0552.660.0767.8331.8282.25
Baseline: Infinity-Gen1.4m65.345.163.0361.6631.8270.31

2-3. Full Benchmark Comparison

Our best model (General-Only) vs Qwen2.5-Coder-7B-Instruct (official)

Generic Benchmarks

Benchmark

Qwen2.5-Coder-7B-Instruct

Ours (General-Only)

Delta
MMLU65.1568.75+3.60
MMLU-Pro47.7552.05+4.30
C-Eval61.5966.12+4.53
HellaSwag78.0781.16+3.09
IFEval60.8161.18+0.37
GPQA32.7836.87+4.09

Coding Benchmarks

Benchmark

Qwen2.5-Coder-7B-Instruct

Ours (General-Only)

Delta
AutoEval v743.7746.96+3.19
Arena Hard36.4750.49

+14.02 🔥

HumanEval85.3785.370
MBPP80.4075.80-4.60
BigCode Bench46.3248.25+1.93
Aider Bench49.6242.11-7.51
LiveCode Bench34.5039.13+4.63

Math Benchmarks

Benchmark

Qwen2.5-Coder-7B-Instruct

Ours (General-Only)

Delta
MATH68.2870.90+2.62
GSM8K88.1790.90+2.73
Olympiad Bench32.8939.11+6.22
AMC 202341.7552.50

+10.75 🔥

AIME 20247.336.67-0.66

Summary: Our model outperforms the official Qwen2.5-Coder-7B-Instruct on 15 out of 18 benchmarks, with particularly strong gains on Arena Hard (+14.02) and AMC 2023 (+10.75).

2-4. Analysis

In this section, we analyze the key factors contributing to our model’s performance improvements and derive insights that can guide future data curation efforts.

Diversity matters

Our experiments reveal a clear hierarchy in sampling strategies. Given the same sampling budget (a fixed number of sampled prompt-response pairs), Complexity-first diverse sampling > complexity-first sampling > random sampling.

Procedure / DatasetData QuantityMMLUMMLU-ProC-EvalHellaSwagIFEvalGPQA
2.0 random sampling (qsh5s response)150K54.443.564.9378.550.4625.25
2.1 complexity-only filter150K55.1545.8563.2274.7448.830.3
2.2 complexity-first diversity150K62.9146.6663.7473.5749.7234.34

Scaling with complexity-first diverse sampling

We also investigate how performance scales with data quantity under our complexity-first diverse sampling strategy. The results show that more sampling budget generally leads to better performance, but not necessarily on all metrics:

Data Quantity (sampled from 9.8m)

MMLUMMLU-ProC-EvalHellaSwagIFEvalGPQA
380k (v1, k=5)65.348.6767.8978.8753.2326.77
700k (v1, k=20)68.1550.8567.5380.4756.0129.29
1.8m (v2, k=1)69.4552.165.7581.1156.9331.82
2m (v2, k=5, bound=2m)69.0552.667.8382.2560.0731.82
ModelData QuantityMMLUMMLU-ProC-EvalHellaSwagIFEvalGPQA
Qwen2.5-Coder-7B-InstructN/A65.1547.7561.5978.0750.2732.67

Ours (complexity-first diversity)

380k (v1, k=5)68.1550.8567.5380.4756.0129.29
700k (v1, k=20)68.7551.5866.1281.1661.1836.87
2m (v2, k=5, bound=2m)69.0552.667.8382.2560.0731.82

Furthermore, we discover a sampling scaling law: given the same sampling budget, the larger the source pool we sample from, the better performance we generally can achieve (though not necessarily on all metrics):

complexity-first diversity

Sampled QuantityMMLUMMLU-ProC-EvalHellaSwagIFEvalGPQA
sample from 2.5m150K (v2, k=5)60.8644.0566.7479.5549.7234.34
sample from 9.8m380K (v1, k=5)65.348.6767.8978.8753.2326.77
sample from 9.8m

1md (v1, k=5, bound = 2m)

68.5552.6768.282.0765.0632.32
sample from 18m

1md (v1, k=5, bound = 2m)

69.0552.667.8382.2560.0731.82

Stable Tag Frequency Matters

An important finding is that directly mixing data obtained through diverse sampling with other datasets (regardless of their origin) tends to result in incompatibilities. This often leads to some metrics showing improvement while others decrease—likely due to a significant shift in the frequency distribution of certain tags. The table below illustrates this phenomenon:

GenericMMLUMMLU-ProC-EvalHellaSwagIFEvalGPQA
Qwen-2.5-coder65.1547.7561.5978.0750.2732.67
Our-SFT-Curation-1.8m68.7551.5866.1281.1661.1836.87
+inst_follow_2k(yi)68.853.465.7063.2763.3241.85
+inst_follow_23k69.5553.0565.981.4252.536.87
+inst_follow_5k68.9551.8665.8281.5262.6636.36
+inst_follow_0.5k/qa168.5552.1565.8281.2760.0733.84
infinity-Gen (qpt4 response)65.4551.6670.3183.0131.82-

What tags help with GPQA and IFEval?

We further analyze which intention tags contribute to improvements on specific benchmarks. By comparing models trained on data from different source pools, we can identify the tags that are responsible for performance gains:

Sample fromSampledMMLUMMLU-ProC-EvalHellaSwagIFEvalGPQA
9.8m2m (v2, k=5)69.0552.667.8382.2560.0731.82
18.5m1.8m* (v2, k=5)68.7552.0566.1281.1661.1836.87

1.8m: The data was originally capped at 2m as well, but after filtering out multi-round conversations, only 1.8m was left. The other 2m was sampled from the 9.8m, which already has multi-round pruned out, so it did not suffer a quantity reduction.

Analysis: Depending on the analysis, the improvement of the GPQA metric is mainly due to the frequency uplift of science-related tags, including the following:

Tags contributing to GPQA improvement: Geographical Location, Result Rounding, Financial Data Analysis, Word Limit Specification, Kinematics Problem, Interpretation of Results, Literary Explanation, Chemical Reaction Explanation, Game Theory, Matrix Analysis, Psychological Explanation, Physical Science Problem, Entity Relationship Analysis, Electrostatics Problem, Polynomial Root Finding, Production Analysis, Graphical Illustration, Vector-Valued Function

The newly-added tags, however, are fairly low in value, not the main factor in contributing to performance improvement.

3. Conclusions

Based on our extensive experiments, we summarize the key findings from this SFT data curation study:

  • The effectiveness of complexity-first diverse sampling has been validated.
  • A diversified sampling method based on a fine-grained open-label system (rather than a top-down, comprehensive label system built on prior knowledge) can also achieve good results.
  • Properly sampled data is superior to using the full dataset.
  • Complexity-first diverse sampling > complexity-first sampling > random sampling.
  • Through ablation experiments, a sampling scaling law was discovered: under the settings of this method, the larger the total source dataset, the better the performance of the sampled data of the same size.
  • Under the premise of ensuring diversity, the complexity-first method inherently prioritizes code and mathematical elements (sampling results: code of any form/mathematical calculations/others = 26.74%/35.23%/38.03%).

References

  1. DEITA: What Makes Good Data for Alignment
  2. Judging LLM-as-a-Judge
  3. Self-Rewarding Language Models
  4. WebInstruct: MAmmoTH2
  5. Infinity-Instruct
  6. LMSYS-Chat-1M
  7. Qwen2.5-Coder
  8. InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models