Building a Complete SFT Data Pipeline: Exceeding Qwen2.5-Coder-Instruct • Singularity Notes

🎯 Key Result: Using the same Qwen2.5-Coder-7B-Base model, our SFT data curation pipeline produces an instruct model that outperforms the official Qwen2.5-Coder-7B-Instruct across Generic, Coding, and Math benchmarks:

Generic: MMLU 68.75 vs 65.15, C-Eval 66.12 vs 61.59
Coding: Arena Hard 50.49 vs 36.47 (+14🔥), LiveCode Bench 39.13 vs 34.50
Math: MATH 70.90 vs 68.28, GSM8K 90.90 vs 88.17, AMC 2023 52.50 vs 41.75 (+10.8🔥)

1. Methodologies

Given the open-source Qwen2.5 coder base model, we collect & curate its instruction finetuning data.

1-1. Data Collection & Preprocess

Dataset	HuggingFace Source	More Info
LIMA	tulu-v2-sft-mixture-lima	Paper: arxiv, Size: 1,018
lmsys-chat-1m	lmsys-chat-1m	Paper: arxiv, Size: 1,000,000. 1M real conversations (210K users with 25 different SOTA LLMs), 154 languages, avg 2 turns per conversation
WizardLM v1	WizardLM_evol_instruct_70k	Paper: arxiv, v1: 70,000 (single-turn)
WizardLM v2	WizardLM_evol_instruct_V2_196k	v2: 143,000 (includes multi-turn)
WebInstructFull	WebInstructFull	Paper: arxiv, Size: 11,621,594 (5B tokens). Mined 10M from CC, rewritten with Mixtral & Qwen. 20%+ improvement on MATH/GSM8K, ~10% on MBPP/ArenaHard
InfinityInstruct	Infinity-Instruct	GitHub

1-2. Data Analysis

1-2-1. Infinity Instruct

To construct a ten-million high-quality instruction dataset, we collect a large amount of open-source data as seed and iterate the dataset using two strategies: instruction selection and instruction evolution. We recommend applying the Foundational Dataset, which contains millions of instructions selected from open-source datasets, to improve the performance of models on challenging downstream tasks (e.g., code, math). We recommend applying the Chat Dataset, which contains about 1M instructions evolved from a small subset of high-quality seed data, to further improve the instruction-following ability of models in real conversation scenarios.

Subjective dataset components:

Raw Dataset	Rows	HuggingFace URL	Paper URL
Alpaca GPT4 data	13,490	alpaca-gpt4-data	N/A
Alpaca GPT4 data zh	32,589	alpaca-gpt4-data-zh	N/A
Baize	14,906	baize-v2-13b	arxiv
BELLE Generated Chat	43,775	generated_chat_0.4M	GitHub
BELLE Multiturn Chat	210,685	multiturn_chat_0.8M
BELLE 3.5M CN	312,598	train_3.5M_CN
BELLE School Math	38,329	school_math_0.25M
databricks-dolly-15K	10,307	databricks-dolly-15k	N/A
LIMA-sft	712	tulu-v2-sft-mixture-lima	arxiv
CodeContest	523	code_contests	arxiv
LongForm	3,290	LongForm	arxiv
ShareGPT-Chinese-English-90k	8,919	ShareGPT-Chinese-English-90k	N/A
UltraChat	276,345	ultrachat_200k	arxiv
Wizard evol instruct zh	44,738	EvolInstruct_zh_DeepseekAPI	arxiv
Wizard evol instruct 196K	88,681	-	arxiv
Code Alpaca 20K	13,296	-	GitHub
WildChat	61,873	WildChat-1M	arxiv
COIG-CQIA	45,793	COIG-CQIA	arxiv
BAGEL	55,193	code_bagel	N/A
DEITA	10,000	deita-10k-v0	arxiv
Math	320,130	-	N/A
Summary	1,362,000

1-2-2. WebInstructFull

Existing SFT Datasets:

Dataset	#Pairs	Domain	Format	Dataset Source
FLAN V2	100K	General	SFT	NLP data + Human CoT
Self-Instruct	82K	General	SFT	Generated by GPT3
GPT4-Alpaca	52K	General	SFT	Generated by GPT4
SuperNI	96K	General	SFT	NLP Datasets
Tora	16K	Math	SFT	GPT4 GSM+MATH Synthesis
WizardMath	96K	Math	SFT	GPT4 GSM+MATH Synthesis
MathInstruct	262K	Math	SFT	GPT4 Math datasets Synthesis
MetaMathQA	395K	Math	SFT	GPT-3.5-Turbo GSM+MATH Synthesis
XwinMath	1.4M	Math	SFT	GPT4 GSM+MATH Synthesis
OpenMathInstruct	1.8M	Math	SFT	Mixtral GSM+MATH Synthesis

Existing CT (Continue Training) Datasets:

Dataset	#Tokens	Domain	Format	Dataset Source
OpenWebMath	12B	Math	LM	Filtered from Web
MathPile	10B	Math	LM	Filtered from Web
Cosmopeida	25B	General	LM	Synthesized by Mixtral
MINERVA	38B	Math	LM	Filtered from Web
Proof-Pile-2	55B	Math	LM	OWM+Arxiv+Code
Galactica	106B	Math & Sci.	LM	Filtered from Web
DeepseekMath	120B	Math	LM	Recalled from Web
WebInstruct	(10M) 5B	Math & Sci.	SFT	Recall and Extracted from Web

The SFT datasets are mostly from NLP datasets or completely synthesized by GPT-4. The CT datasets are much larger because they are filtered or recalled from the web. The content contains lots of noise. We are the first dataset to combine these two to build a high-quality yet large-scale SFT dataset.

1-3. Data Pipelines

1-3-1. Open-source Evol & Synthesis

Infinity Instruct is a typical example of using opensource SFT data as seed and performing instruct-evol as data augmentation.

1. High-Quality Open Source Instruction Collection and Tag System

We start by collecting high-quality open-source instruction sets. We assign each instruction in the collection a set of tags that describe the abilities and knowledge necessary to complete the instruction.

Instruction collection: We systematically reviewed available open-source instruction sets and included sets created by humans and advanced LLMs.
Tag System with two levels:
- First level tag: Describe the specific knowledge and abilities required for completing each instruction (e.g., Arithmetic Calculation, Knowledge of Biology). The tags are automatically generated by LLM.
- Second level tags: Macro categories such as “Natural Language Processing” and “Math Reasoning.” Including 25 categories in total.

2. Informative Instruction Selection

Aimed at selecting the most informative instructions from the whole collection for enhancing the performance of LLM and improving user experience.

[Complexity] Instructions demand multiple kinds of abilities or multiple domains of knowledge
[Diversity] Instructions with long-tailed ability or knowledge
[Difficulty] Instructions with high following difficulty

3. Instruction Generation by Data Evolution Strategy

We expand the seed instructions in directions breadth, depth, difficulty, and complexity with a method built based on Evol-Instruct method.

Validate the evolved data, and use AI assistants to eliminate data that failed to evolve from the perspective of instruction compliance
Use the evolved instructions as the initial input, and use an AI assistant to play different roles to generate 2 to 4 rounds of dialogue for each instruction

4. Instruction Generation by Model Ability Deficient Diagnosis

Automatically identifying weaknesses in the model’s capabilities to guide the synthesis of data.

Model performance evaluation System
Automatic ability deficient diagnosis
Targeted data synthesis

1-3-2. Web Crawling, Extracting & Refining

WebInstructFull is a typical example of crawling data from the web, extracting QA pairs out of it, and refining the responses.

Stages: (1) high-quality data recall from the web corpus, (2) Q-A pair extraction and (3) Q-A pair refinement.

1. Recall from Common Crawl

To ensure diversity in our training data across various disciplines like math, science, and engineering, we propose crawling exam problems from educational websites such as stemez.com, homeworkstudy.com, and khanacademy.org. We collected 100K diverse seed examples and randomly selected 100K negative documents from Common Crawl (CC) for training a fastText model.

In the initial stage, the trained fastText model recalls the top 100B documents from CC, categorizing them by domain (root URL). We employ GPT-4 to identify domains likely to contain instructional content. Subsequently, we sample additional documents from these selected domains as positive examples and use documents from non-selected domains and the general CC as negative examples to refine the fastText classifier. The updated classifier then recalls the top 18M documents for further processing.

2. Q-A Pair Extraction

Recalled documents contain diverse content from forums, homework, quizzes, and exams. Despite noise like ads and HTML, they contain valuable Q&A pairs. We preprocess by parsing HTML to remove unrelated info. We then use Mixtral-8×7B to identify Q&A pairs, resulting in 5M candidates.

3. Q-A Pair Refinement

To further improve extracted Q-A pair candidates, we prompt Mixtral-7B×8 and Qwen-72B to reformat the extracted Q-A pairs. If the answer does not contain any explanation, we prompt the LLMs to complete the intermediate reasoning steps leading to the answer. We adopt two models to increase diversity. Eventually, we harvest 10M Q-A pairs as our final instruction-tuning dataset WebInstruct.

1-3-3. Collecting Real Conversations between Human and LLMs

LMsys1M is a typical example of such a genre.

The dataset contains 1 million real-world conversations with 25 state-of-the-art LLMs. It is collected from 210k unique IP addresses in the wild on the Vicuna demo and Chatbot Arena website, from April to August 2023.

Arena-Hard-200, the 200 most challenging and high-quality user prompts are selected and curated from this LMSYS-Chat-1M dataset.

The data-collection website contains three types of chat interfaces: (1) single model, (2) chatbot arena (battle), and (3) chatbot arena (side-by-side).

1-4. Our Pipeline

Given the two different sources of data mentioned above, we further process the large collection of data with our own pipeline:

Merge General SFT data

Number of Rows: 18,922,281

Exact Dedup & Decontamination

Exact dedup: 18,922,281 → 18,629,484

Quality Filter

Let our LLM-scorer assign each prompt a rating between 1-10.

For this comparatively easier task, we can just use a smaller inhouse or opensource model. We query the result with a required format in this “Reasons first, score last” order.

{
	"reasons": "point out the issues and your reasons for the rating",
	"score": "<integer>"
}

Suffering from insufficient prompting, where the scoring guidelines are absent and only output spaces (e.g. 0-100) are provided, this will result in inconsistent and misaligned evaluations.

Therefore, we follow an LLM-as-evaluator prompt template:

Generic Criterions:

A good prompt for benchmarking
A greater score represents a greater potential to evaluate the LLMs in problem-solving, creativity, and truthfulness
Trivial or ambiguous user prompts should get lower scores

Domain-based Templates:

Code Template Example

You are an excellent user question qualifier. You are responsible for evaluating the quality of programming questions submitted by users, including various types such as QA questions, multiple-choice, debugging tasks, code explanations, and more. Your goal to ensure they meet the standards for precision, clarity and solvability.

Steps:
Think and Understand: Start by thoroughly understanding the question's intent. Consider whether the question meets the qualifying principles listed below.
Analysis: Based on your understanding, explain whether the question satisfies each qualifying principle.
Score: Assign a quality score based on your analysis.

Scoring System:
1: The question is clear, precise, solvable and ready for use by the programming community.
0: The question is ambiguous, unclear or unsolvable and needs further clarification before it can be answered.

Qualifying Principles:
1. **Clear and Concise Problem Statement**:
   - Clarity: The problem should be stated clearly. The reader should immediately understand the task after reading the problem.

2. **Relevant Background and Context**:
   - Allow simpler or high-level questions to meet the clarity requirement without needing extensive detail. If the question is understandable, it should be rated positively.

3. **Reasonable Assumption**
   - Assumption: The question should avoid unnecessary details, allowing the reader to make minor, reasonable assumptions to solve the problem without changing the intent of the question.

4. **Non-Trivial and Achievable Problem**:
   - Difficulty: The question should present a problem that is solvable given the tools and constraints available.

5. **No Further Detail is Asked**:
    - The question should not explicitly ask for more details or context to be provided by the responder. If the question itself is framed as a request for further clarification or additional information, it should be considered incomplete and not valid for evaluation.

Instruction Following:
    - Please adhere strictly to the provided output format in the few-shot examples.
    - Your response should consist of three essential sections: Thinking Steps, Analysis, Json Output.

Complexity Filter

It has been recommended that we consider a prompt to be challenging if it requires integrating various knowledge and skills to derive appropriate responses. This will require both knowledge tagging & skill tagging. But this has some overlap with diversity.

Therefore, as the first resort, we simply decompose complexity here as the number of explicit instructions in a user query.

Instruction Counting: Count the number of specific instructions requested by the user.

PE for general instruction counting

You are an expert in analyzing user queries. Your task is to identify and enumerate all the specific explicit instructions present in a given user query.

---

**Requirements:**

Please note that there is no response provided. Your focus should be solely on the user's query.

Please list all the specific explicit instructions found in the user's query.

---

**Output Format:**

Please provide your output in the following JSON format:

{
  "instructions": [
    "Instruction 1",
    "Instruction 2",
    "Instruction 3"
  ],
  "instruction_count": X
}

---

**User Query:**
{user_prompt}

PE for instruction-following analysis (with response)

You are an expert in evaluating how well responses follow user instructions. Your task is to analyze the given user query and the corresponding response, identify all the specific explicit instructions in the user query, and assess how well the response fulfills each instruction.

---

**Requirements:**

1. **Instruction Identification:**
   - List all the specific explicit instructions found in the user's query.

2. **Response Analysis:**
   - For each instruction, analyze whether the response satisfies it completely, partially, or not at all.
   - Provide reasons for your assessment.

3. **Scoring:**
   - Assign a score between 0 and 10 based on how well the response follows the instructions, where:
     - **0** means the response is completely unrelated to the user's instructions.
     - **10** means the response fully satisfies all the instructions in the user's query.

---

**Output Format:**

Please provide your output in the following JSON format:

{
  "instructions": [
    "Instruction 1",
    "Instruction 2",
    "Instruction 3"
  ],
  "analysis": {
    "Instruction 1": "Analysis of how well the response fulfills Instruction 1.",
    "Instruction 2": "Analysis of how well the response fulfills Instruction 2.",
    "Instruction 3": "Analysis of how well the response fulfills Instruction 3."
  },
  "score": X
}

Diversity: Intention Tagging & Reweighting

Plus, intention tagging is important. This tagging will also cover both knowledge and skill.

InsTag for diversity and complexity.

This algorithm adopts a “complexity-first” strategy by prioritizing queries with more (# of required knowledge and skills) tags and checking if each addition to the sub-dataset increases tag diversity. By doing so, it ensures that the final sampled sub-dataset not only meets the size requirement N but also maintains a high level of tag diversity. This approach helps the sampled subset represent as many tag categories from the original dataset as possible, even with a limited sample size.

PE for annotating intention tags

PE backbone:

You are an excellent user query tagging expert. Given the following example tags described below, for a given user query, you are responsible for providing one or more tags that fully cover the intentions of the user. Your goal is to ensure user prompts are tagged properly for user intention analysis.

**Example tags:**
{examples}

The example tags are not exhaustive, you can provide other finer tags as you see appropriate.

---

**Output Format:**

Please provide the evaluation for the given query in the following JSON format:

{
  "explanation": "Explain what the tag is about",
  "reasons": "Point out your reasons in assigning the tags to this query",
  "tag": "list<string>"
}

---

**Query:**

Example tags include: Tagging System, Instruction Classification, Format Specification, Intent Analysis, Content Generation, Information Retrieval, Data Extraction, Summarization, Translation, Sentiment Analysis, Error Correction, Advice Seeking, Educational Instruction, Task Automation, Content Moderation, Code Generation, Emotion Detection, Personalization, Knowledge Retrieval, Data Analysis, Opinion Generation, Scheduling, Problem Solving, Hypothetical Scenario, Comparative Analysis, Definition Request, Paraphrasing, Trend Analysis, Formatting, Simulation, Clarification Request, Example Generation, Step-by-Step Explanation, Creative Writing, Role-playing, Algorithm Explanation, Metadata Extraction, Pattern Recognition, Policy Compliance Check, Benchmarking, Contextual Understanding, Conversational Continuation, Error Diagnosis, Code Refactoring, Language Learning Assistance, Voice Tone Analysis, Mood Setting, Personal Development, Mind Mapping, Goal Setting, Proofreading, Fact Checking, Joke Telling, Storytelling, Songwriting, Poem Composition, Historical Contextualization, Cultural Explanation, Mathematical Calculation, Scientific Explanation, Logical Reasoning, Analogy Creation, Visual Description, User Feedback Analysis, Resource Recommendation, Time Management, Memory Recall, Event Planning, Feedback Provision, Stress Testing, Hypothesis Formation, Data Visualization, Conflict Resolution, Etiquette Guidance, Idea Brainstorming, Priority Setting, Budget Planning, Negotiation Strategy, Product Review, Risk Assessment, Language Style Conversion, Protocol Simulation, Statistical Analysis, Energy Conservation Tips, Environmental Impact Assessment, Health and Wellness Guidance, Ethical Dilemma Discussion, Memory Enhancement Techniques, etc.

Response Enhancement

After filtering for high-quality instructions, we also improve the response quality through two strategies:

(1) GPT4-o response replacement: We regenerate responses using GPT-4o to ensure higher quality and more consistent formatting across the dataset.

(2) Strategy tag for response selection: For instructions with multiple candidate responses, we use strategy tags to select the most appropriate response based on the instruction’s domain and complexity.

Attribute Analysis

With all the filtering and enhancement steps complete, we now analyze the processed dataset to understand its characteristics. We examine the statistical distributions of various attributes, identify correlations between features, and ensure the final dataset maintains high quality and diversity.

Outlier Detection and Elimination

To ensure data quality, we identify and remove statistical outliers that could negatively impact model training. The following tables show the dataset statistics before and after outlier removal.

Before pruning the outliers:

After pruning: 2,407,758 rows left. [v1]

2,334,645 rows left. [v2]

Distributions

Based on the distribution below, we filter the data accordingly.

Covariance:

High Correlations: attribute_total_quality_score_recalculate has a very high correlation with attribute_coh_score (0.97), attribute_inf_score (0.92), attribute_flu_score (0.93), and attribute_rel_score (0.90). This suggests that these four attributes heavily influence the overall quality score.
Moderate Correlations: attribute_rel_score shows a decent correlation with attribute_inf_score (0.84) and attribute_flu_score (0.76).
Low Correlations: attribute_fact_score has relatively low correlations with most other attributes.
Insignificant Correlations: Attributes like attribute_intent_tag_ct and attribute_inst_ct have very low correlations with others.

Intention Analysis

Understanding the intention behind each instruction helps us better categorize and balance the dataset. We use the InsTag algorithm to automatically assign intention tags to each instruction, enabling fine-grained analysis of what users are trying to accomplish. Some examples of intention tags are listed below:

Common Tag Combinations:

Quality Analysis

We analyze the quality scores across different data sources to understand which datasets contribute higher-quality instructions. This helps inform our sampling strategy and identify potential areas for improvement.

Group analysis on Source:

We can see that on average, the instructions from WebInstructFull have the highest overall quality score. Instructions from lmsys-chat-1m have the lowest overall quality score, which is reasonable because these instructions are casual human inputs to LLMs.

Diversity Analysis

Diversity in the training data is crucial for building robust models that can handle a wide range of tasks. We measure diversity using two metrics: the number of unique intention tags and Shannon Entropy (which accounts for the balance of tag distribution).

Interestingly, although lmsys-chat-1m has a much lower number of unique intent tags than Infinity-instruct-7m, its tags are quite balanced (as we can see from the word clouds). Therefore, the diversity of tags measured by Shannon Entropy has Lmsys-chat-1m ranking first.

Word Clouds by Source:

2. Experiments

We perform SFT experiments and evaluate models with OpenBenchmarks. Based on the open-sourced Qwen2.5-Coder-7B-Base model, we train a baseline instruct model with a pre-released collection of opensourced SFT data. We then aim at beating the baseline with an in-house curated version of SFT data. Ultimately, we aim to beat Qwen2.5-Coder-7B-Instruct.

2-1. Baseline Setup

We choose Qwen2.5-Coder-7B-Base as our foundation model due to its strong coding capabilities and open-source availability. This section describes the base model’s characteristics and establishes baseline performance metrics.

2-1-1. Base Model

Based on the open-sourced Qwen2.5-Coder-7B-Base model, we train a baseline instruct model.

💻 Code More: Qwen2.5-Coder builds on the strong Qwen2.5 and continues training on a larger scale of code data, including source code, text-code grounding data, and synthetic data, totaling 5.5 trillion tokens.

📚 Learn More: While enhancing coding abilities, we aimed to retain strengths in math and general capabilities from the base model. Therefore, Qwen2.5-Coder incorporates additional data on mathematics and general abilities.

✨ Supporting long context understanding and generation with the context length of 128K tokens

✨ Supporting 92 coding languages

✨ Retain strengths in math and general capabilities from base model

Special tokens:

{
	"<|fim_prefix|>": 151659,
	"<|fim_middle|>": 151660,
	"<|fim_suffix|>": 151661,
	"<|fim_pad|>": 151662,
	"<|repo_name|>": 151663,
	"<|file_sep|>": 151664,
	"<|im_start|>": 151644,
	"<|im_end|>": 151645
}

If we want to compare our model against Qwen2.5-Coder-instruct on popular benchmarks, we mainly focus on these benchmarks: MMLU, ARC-Challenge, TruthfulQA, WinoGrande, HellaSwag.

Additionally, if we want to compare general language abilities, including English, Chinese and Multilingual, we may want to compare with the Qwen 7B non-coder instruct model as well.

2-1-2. Data Candidates

For the baseline instruct model, we look for the best pre-released collection of opensourced SFT data.

WebInstructFull dataset helps build a baseline with strong reasoning capabilities, which is suitable for our coding LLM model. Consisting of 10 million instruction-response pairs, this dataset helps LLM improve 10%+ on MBPP and ArenaHard, 20%+ on MATH and GSM8K, respectively.

Recommended finetuning hyper-parameters for Qwen2-7B on InfinityInstruct & WebInstruct:

epoch: 3
lr: 1e-5
min_lr: 0
lr_warmup_steps: 40
lr_decay_style: cosine
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.95
global_batch_size: 528
clip_grad: 1.0

Benchmark Descriptions:

Natural Language, Knowledge & Reasoning

MMLU: massive multitask language understanding. Performance has begun to plateau.

MMLU-Pro: includes more complex and challenging reasoning questions.

C-Eval: knowledge and reasoning in a Chinese context.

Commonsense:

HellaSwag: commonsense natural language inference.

Instruction Following:

IFEval: focuses on a set of “verifiable instructions” such as “write in more than 400 words” and “mention the keyword of AI at least 3 times”.

Science Knowledge:

GPQA: PhDs with 65% accuracy; highly skilled non-expert validators with only 34% even spending 30 mins searching answers online (called “Google-Proof”).

Baseline Results:

Model	MMLU	MMLU-Pro	IFEval	C-Eval	GPQA	HellaSwag
Qwen2.5-Coder-7B-Base (official): Our Eval	63.75	22.95	34.2	70.36	25.25	21.72
Qwen2.5-Coder-7B-Instruct (official): Paper	68.7	45.6	58.6	61.4	35.6	-
Qwen2.5-Coder-7B-Instruct (official): Our Eval	65.15	47.75	60.81	61.59	32.78	78.07
Qwen2.5-Coder-7B-Base-SFT-baseline (LIMA)	63.1	35.4	29.76	62.85	29.8	62.07
Qwen2.5-Coder-7B-Base-SFT-baseline (WebInsFull)	5.15	25.75	39.93	56.91	4.04	38.97
Qwen2.5-Coder-7B-Base-SFT-baseline (Infinity-Gen)	65.3	45.1	63.03	61.66	31.82	70.31
Qwen2.5-Coder-7B-Base-SFT-baseline (MergeAll_sampled_2.5m)	36	34.55	40.48	60.55	16.16	78.89

2-2. Curation Ablation

To understand the contribution of each component in our data curation pipeline, we conduct systematic ablation studies. We compare models trained on data processed through different stages of our pipeline to quantify the impact of each filtering and enhancement step.

Based on the open-sourced Qwen2.5-Coder-7B-Base model, and our curated SFT data, we aim to train a SOTA instruct model.

Ablation with Original Responses:

First, we test our curation pipeline while keeping the original responses from the source datasets:

Procedure	Data Quantity	MMLU	MMLU-Pro	IFEval	C-Eval	GPQA	HellaSwag
Qwen2.5-Coder-7B-Base (official)	N/A	63.75	22.95	34.2	70.36	25.25	21.72
Qwen2.5-Coder-7B-Instruct (official)	N/A	65.15	47.75	60.81	61.59	32.78	78.07
0. dedup+decontam+sample	2.5m	36	34.55	40.48	60.55	16.16	78.89
1. instruction quality filter	1.7m	29.75	33.45	39.56	61.96	20.2	67.94

Ablation with GPT-4o Responses:

Next, we replace the original responses with GPT-4o generated responses, which significantly improves model performance across all metrics:

Procedure	Data Quantity	MMLU	MMLU-Pro	IFEval	C-Eval	GPQA	HellaSwag
dedup+decontam+sample (gpt4o response)	2.2m	67.95	51.75	58.6	67.98	30.81	82.16
instruction quality filter (gpt4o response)	1.7m	67.5	50.5	54.71	68.57	28.28	81.05
2.1 complexity-only filter (gpt4o response)	152K	55.15	45.85	48.8	63.22	30.3	74.74
2.2 complexity-first diversity filter (gpt4o response)	152K	60.95	44.65	49.72	63.74	34.34	79.57

Comparison at Same Data Size (152K):

To isolate the effect of our sampling strategy, we compare different sampling methods using the same data budget of 152K samples:

Procedure	Data Quantity	MMLU	MMLU-Pro	IFEval	C-Eval	GPQA	HellaSwag
2.0 random sampling (gpt4o response)	152K	54.4	43.5	50.46	64.93	25.25	78.5
2.1 complexity-only filter (gpt4o response)	152K	55.15	45.85	48.8	63.22	30.3	74.74
2.2 complexity-first diversity filter (gpt4o response)	152K	60.95	44.65	49.72	63.74	34.34	79.57

Final Results with Scaled Data:

Finally, we scale up our best sampling strategy (complexity-first diversity) to larger data sizes and compare against baselines:

Procedure / Dataset	Data Quantity	MMLU	MMLU-Pro	IFEval	C-Eval	GPQA	HellaSwag
Qwen2.5-Coder-7B-Base (official)	N/A	63.75	22.95	34.2	70.36	25.25	21.72
Qwen2.5-Coder-7B-Instruct (official): Paper	N/A	68.7	45.6	58.6	61.4	35.6	-
Qwen2.5-Coder-7B-Instruct (official): Our Eval	N/A	65.15	47.75	60.81	61.59	32.78	78.07
complexity-first diversity (v1, k=1)	389k	65.3	49.8	53.23	67.9	26.77	79.97
complexity-first diversity (v1, k=5)	703k	68.15	50.85	56.01	67.53	29.29	80.47
complexity-first diversity (v2, k=1)	1.8m	69.45	52.1	56.93	65.75	31.31	81.11
complexity-first diversity (v2, k=5)	2m	69.05	52.6	60.07	67.83	31.82	82.25
Baseline: Infinity-Gen	1.4m	65.3	45.1	63.03	61.66	31.82	70.31

2-3. Full Benchmark Comparison

Our best model (General-Only) vs Qwen2.5-Coder-7B-Instruct (official)

Generic Benchmarks

Benchmark	Qwen2.5-Coder-7B-Instruct	Ours (General-Only)	Delta
MMLU	65.15	68.75	+3.60
MMLU-Pro	47.75	52.05	+4.30
C-Eval	61.59	66.12	+4.53
HellaSwag	78.07	81.16	+3.09
IFEval	60.81	61.18	+0.37
GPQA	32.78	36.87	+4.09

Coding Benchmarks

Benchmark	Qwen2.5-Coder-7B-Instruct	Ours (General-Only)	Delta
AutoEval v7	43.77	46.96	+3.19
Arena Hard	36.47	50.49	+14.02 🔥
HumanEval	85.37	85.37	0
MBPP	80.40	75.80	-4.60
BigCode Bench	46.32	48.25	+1.93
Aider Bench	49.62	42.11	-7.51
LiveCode Bench	34.50	39.13	+4.63

Math Benchmarks

Benchmark	Qwen2.5-Coder-7B-Instruct	Ours (General-Only)	Delta
MATH	68.28	70.90	+2.62
GSM8K	88.17	90.90	+2.73
Olympiad Bench	32.89	39.11	+6.22
AMC 2023	41.75	52.50	+10.75 🔥
AIME 2024	7.33	6.67	-0.66

Summary: Our model outperforms the official Qwen2.5-Coder-7B-Instruct on 15 out of 18 benchmarks, with particularly strong gains on Arena Hard (+14.02) and AMC 2023 (+10.75).

2-4. Analysis

In this section, we analyze the key factors contributing to our model’s performance improvements and derive insights that can guide future data curation efforts.

Diversity matters

Our experiments reveal a clear hierarchy in sampling strategies. Given the same sampling budget (a fixed number of sampled prompt-response pairs), Complexity-first diverse sampling > complexity-first sampling > random sampling.

Procedure / Dataset	Data Quantity	MMLU	MMLU-Pro	C-Eval	HellaSwag	IFEval	GPQA
2.0 random sampling (qsh5s response)	150K	54.4	43.5	64.93	78.5	50.46	25.25
2.1 complexity-only filter	150K	55.15	45.85	63.22	74.74	48.8	30.3
2.2 complexity-first diversity	150K	62.91	46.66	63.74	73.57	49.72	34.34

Scaling with complexity-first diverse sampling

We also investigate how performance scales with data quantity under our complexity-first diverse sampling strategy. The results show that more sampling budget generally leads to better performance, but not necessarily on all metrics:

Data Quantity (sampled from 9.8m)	MMLU	MMLU-Pro	C-Eval	HellaSwag	IFEval	GPQA
380k (v1, k=5)	65.3	48.67	67.89	78.87	53.23	26.77
700k (v1, k=20)	68.15	50.85	67.53	80.47	56.01	29.29
1.8m (v2, k=1)	69.45	52.1	65.75	81.11	56.93	31.82
2m (v2, k=5, bound=2m)	69.05	52.6	67.83	82.25	60.07	31.82

Model	Data Quantity	MMLU	MMLU-Pro	C-Eval	HellaSwag	IFEval	GPQA
Qwen2.5-Coder-7B-Instruct	N/A	65.15	47.75	61.59	78.07	50.27	32.67
Ours (complexity-first diversity)	380k (v1, k=5)	68.15	50.85	67.53	80.47	56.01	29.29
	700k (v1, k=20)	68.75	51.58	66.12	81.16	61.18	36.87
	2m (v2, k=5, bound=2m)	69.05	52.6	67.83	82.25	60.07	31.82

Furthermore, we discover a sampling scaling law: given the same sampling budget, the larger the source pool we sample from, the better performance we generally can achieve (though not necessarily on all metrics):

complexity-first diversity	Sampled Quantity	MMLU	MMLU-Pro	C-Eval	HellaSwag	IFEval	GPQA
sample from 2.5m	150K (v2, k=5)	60.86	44.05	66.74	79.55	49.72	34.34
sample from 9.8m	380K (v1, k=5)	65.3	48.67	67.89	78.87	53.23	26.77
sample from 9.8m	1md (v1, k=5, bound = 2m)	68.55	52.67	68.2	82.07	65.06	32.32
sample from 18m	1md (v1, k=5, bound = 2m)	69.05	52.6	67.83	82.25	60.07	31.82

Stable Tag Frequency Matters

An important finding is that directly mixing data obtained through diverse sampling with other datasets (regardless of their origin) tends to result in incompatibilities. This often leads to some metrics showing improvement while others decrease—likely due to a significant shift in the frequency distribution of certain tags. The table below illustrates this phenomenon:

Generic	MMLU	MMLU-Pro	C-Eval	HellaSwag	IFEval	GPQA
Qwen-2.5-coder	65.15	47.75	61.59	78.07	50.27	32.67
Our-SFT-Curation-1.8m	68.75	51.58	66.12	81.16	61.18	36.87
+inst_follow_2k(yi)	68.8	53.4	65.70	63.27	63.32	41.85
+inst_follow_23k	69.55	53.05	65.9	81.42	52.5	36.87
+inst_follow_5k	68.95	51.86	65.82	81.52	62.66	36.36
+inst_follow_0.5k/qa1	68.55	52.15	65.82	81.27	60.07	33.84
infinity-Gen (qpt4 response)	65.45	51.66	70.31	83.01	31.82	-

What tags help with GPQA and IFEval?

We further analyze which intention tags contribute to improvements on specific benchmarks. By comparing models trained on data from different source pools, we can identify the tags that are responsible for performance gains:

Sample from	Sampled	MMLU	MMLU-Pro	C-Eval	HellaSwag	IFEval	GPQA
9.8m	2m (v2, k=5)	69.05	52.6	67.83	82.25	60.07	31.82
18.5m	1.8m* (v2, k=5)	68.75	52.05	66.12	81.16	61.18	36.87

1.8m: The data was originally capped at 2m as well, but after filtering out multi-round conversations, only 1.8m was left. The other 2m was sampled from the 9.8m, which already has multi-round pruned out, so it did not suffer a quantity reduction.

Analysis: Depending on the analysis, the improvement of the GPQA metric is mainly due to the frequency uplift of science-related tags, including the following:

Tags contributing to GPQA improvement: Geographical Location, Result Rounding, Financial Data Analysis, Word Limit Specification, Kinematics Problem, Interpretation of Results, Literary Explanation, Chemical Reaction Explanation, Game Theory, Matrix Analysis, Psychological Explanation, Physical Science Problem, Entity Relationship Analysis, Electrostatics Problem, Polynomial Root Finding, Production Analysis, Graphical Illustration, Vector-Valued Function

The newly-added tags, however, are fairly low in value, not the main factor in contributing to performance improvement.

3. Conclusions

Based on our extensive experiments, we summarize the key findings from this SFT data curation study:

The effectiveness of complexity-first diverse sampling has been validated.
A diversified sampling method based on a fine-grained open-label system (rather than a top-down, comprehensive label system built on prior knowledge) can also achieve good results.
Properly sampled data is superior to using the full dataset.
Complexity-first diverse sampling > complexity-first sampling > random sampling.
Through ablation experiments, a sampling scaling law was discovered: under the settings of this method, the larger the total source dataset, the better the performance of the sampled data of the same size.
Under the premise of ensuring diversity, the complexity-first method inherently prioritizes code and mathematical elements (sampling results: code of any form/mathematical calculations/others = 26.74%/35.23%/38.03%).