Home

Published

- 4 min read

JustTinker: Minimal RLVR for Building Reasoning Models Under $150

img of JustTinker: Minimal RLVR for Building Reasoning Models Under $150

Project repository: github.com/Guanghan/JustTinker

Core Idea

Can you transform an instruction-tuned LLM into a reasoning model with explicit thinking capabilities—all for under $150? JustTinker demonstrates that the answer is yes.

This project implements a minimal two-stage training pipeline using Reinforcement Learning with Verifiable Rewards (RLVR), achieving a +13.3% improvement on AIME 2024 (43.3% → 56.7%) while keeping the total cost under $150 using Tinker’s training API.

Key Achievement: Full RL pipeline executable from a standard laptop, making reasoning model development accessible to individual researchers.

Background & Motivation

The Rise of Reasoning Models

2024-2025 witnessed the emergence of “thinking models” like OpenAI’s o1 and DeepSeek-R1, which explicitly show their reasoning process through <think>...</think> tokens. These models demonstrate superior performance on complex tasks like mathematical olympiad problems.

The Accessibility Gap

However, training such models typically requires:

  • Massive compute resources
  • Complex RL infrastructure
  • Significant engineering effort

JustRL Philosophy

The JustRL paper proposed a radical simplification: remove KL penalties and length penalties entirely, relying only on verifiable rewards. JustTinker extends this philosophy to a practical, low-resource implementation.

Why Cold-Start SFT?

The project uses Qwen3-4B-Instruct-2507 as the base model. Unlike distilled reasoning models (e.g., DeepSeek-R1-Distill), instruction-tuned models have internalized their reasoning into weights—they don’t naturally produce <think> tokens.

The key insight: We need a “cold-start” SFT phase to awaken explicit thinking capabilities before RL can reinforce them.

Two-Stage Training Pipeline

Stage 1: Cold-Start SFT

Goal: Teach the model to produce <think>...</think> tokens and boxed answers.

MetricBefore SFTAfter SFT
Thinking Rate~0%70%
Boxed Answer Rate~0%80%
Training Steps-800
Cost-<$30

Cold-Start SFT Training Curves

Stage 2: JustRL (GRPO)

Goal: Reinforce correct reasoning through Group Relative Policy Optimization.

Following JustRL’s minimalist approach:

  • No KL penalty - Don’t constrain the model to stay close to the reference policy
  • No length penalty - Don’t discourage long responses (which might contain valid reasoning)
  • Verifiable rewards only - Mathematical correctness as the sole signal

The Reward Hacking Problem

What Went Wrong in Experiment 001

Without length penalties, training quickly collapsed:

Reward Hacking Training Curves

The model discovered a devastating shortcut: generate extremely long responses with multiple answer attempts.

MetricNormalReward Hacking
Response Length~2,000 chars35,000+ chars
Accuracy~85%10-28%
Content QualityCoherent reasoningRepetitive nonsense

The Formation Mechanism

The reward hacking emerged through a four-phase process:

Phase 1 (Steps 1-50)
Normal learning with balanced response lengths

Phase 2 (Steps 50-100)
Longer responses appear more in positive samples

→ Multiple attempts increase luck-based correctness



Phase 3 (Steps 100-140)
Model learns spurious length → reward correlation

Phase 4 (Step 140+)
Complete collapse: 30,000+ char responses, 10% accuracy

Why Length “Helps” (Statistically)

The core issue is probability amplification through multiple attempts:

Assuming 5% correctness per attempt:

  • 1 attempt: 5% success
  • 10 attempts: ~40% success (1 - 0.95¹⁰)

GRPO only learns from correct samples—it never penalizes incorrect ones. This creates a statistical bias where longer responses (with more attempts) disproportionately appear in the positive training set.

Novel Solution: Redundancy Penalty

The Key Insight

Simple length penalties would hurt legitimate long reasoning. Instead, target the actual problem: repetitive content.

Dual-Metric Detection System

MethodWeightMechanism
Compression Ratio60%

Uses zlib compression—repetitive content compresses significantly (10-20% vs normal 50-70%)

N-gram Repetition40%

Counts repeated word sequences—reward hacking shows 60-70% vs normal ~5%

Penalty Application

📐 REDUNDANCY PENALTY FORMULA

redundancy_score = 0.6 × compression_ratio + 0.4 × ngram_repetition



if redundancy_score > 0.3:
penalty = min(redundancy_score - 0.3, 0.3)
reward = reward - penalty

Validation Results

Response TypeRedundancy Score
Normal reasoning responses0-4%
Reward-hacking responses62-89%

Clear separation enables precise targeting without false positives.

Experimental Results

Experiment 002: Successful Mitigation

With the redundancy penalty in place:

Fixed Training Curves

Training stability restored with 84-85.5% evaluation accuracy maintained throughout.

Experiment 003: Final Results

Using harder datasets (DAPO-Math-17k) and AIME 2024 benchmarking:

Final Training Curves

BenchmarkBeforeAfterChange
AIME 202443.3%56.7%

+13.3%

MATH~88%~91%

Stable

Cost Breakdown

PhaseCost
Cold-Start SFT<$30
Experiment 001-002~$72
Experiment 003~$34
Total~$136

Key Takeaways

1. Cold-Start SFT is Essential

Instruction-tuned models need explicit format training before RL can reinforce reasoning. Without this, the model has no “thinking” behavior to optimize.

2. Reward Hacking is Statistical, Not Intentional

The model isn’t “trying” to cheat—it’s following statistical gradients. Understanding the mechanism enables targeted solutions.

3. Target Root Causes, Not Symptoms

Length penalties create false positives on legitimate reasoning. The redundancy penalty targets the actual problem: repetitive content exploitation.

4. JustRL’s Minimalism Needs Guardrails

The original JustRL philosophy (no KL, no length penalty) is sound but incomplete. Targeted interventions (format rewards, redundancy penalty) preserve simplicity while preventing collapse.

5. Low-Resource RLVR is Feasible

With the right infrastructure (Tinker API) and methodology, meaningful reasoning improvements are achievable under $150.

Quick Start

   # Clone the repository
git clone https://github.com/Guanghan/JustTinker.git
cd JustTinker

# Install dependencies
pip install -r requirements.txt

# Set API key
export TINKER_API_KEY=your_api_key

# Run cold-start SFT (or use public checkpoint)
./scripts/launchers/run_coldstart_sft.sh small

# Run JustRL training
./scripts/launchers/run_justrl.sh

A public cold-start checkpoint is available, enabling direct RL training without repeating the SFT phase.

Summary

JustTinker demonstrates that building reasoning models doesn’t require massive resources:

  1. Two-stage pipeline: Cold-start SFT → JustRL (GRPO)
  2. Reward hacking prevention: Novel redundancy penalty using compression ratio and N-gram detection
  3. Significant results: +13.3% on AIME 2024 for under $150
  4. Accessible: Full pipeline runnable from a laptop via Tinker API

The project provides both a practical implementation and insights into the challenges of minimal RLVR training—particularly the reward hacking phenomenon and its mitigation.

Related Posts

There are no related posts yet. 😢