JustTinker: Minimal RLVR for Building Reasoning Models Under $150 • Singularity Notes

Project repository: github.com/Guanghan/JustTinker

Core Idea

Can you transform an instruction-tuned LLM into a reasoning model with explicit thinking capabilities—all for under $150? JustTinker demonstrates that the answer is yes.

This project implements a minimal two-stage training pipeline using Reinforcement Learning with Verifiable Rewards (RLVR), achieving a +13.3% improvement on AIME 2024 (43.3% → 56.7%) while keeping the total cost under $150 using Tinker’s training API.

Key Achievement: Full RL pipeline executable from a standard laptop, making reasoning model development accessible to individual researchers.

Background & Motivation

The Rise of Reasoning Models

2024-2025 witnessed the emergence of “thinking models” like OpenAI’s o1 and DeepSeek-R1, which explicitly show their reasoning process through <think>...</think> tokens. These models demonstrate superior performance on complex tasks like mathematical olympiad problems.

The Accessibility Gap

However, training such models typically requires:

Massive compute resources
Complex RL infrastructure
Significant engineering effort

JustRL Philosophy

The JustRL paper proposed a radical simplification: remove KL penalties and length penalties entirely, relying only on verifiable rewards. JustTinker extends this philosophy to a practical, low-resource implementation.

Why Cold-Start SFT?

The project uses Qwen3-4B-Instruct-2507 as the base model. Unlike distilled reasoning models (e.g., DeepSeek-R1-Distill), instruction-tuned models have internalized their reasoning into weights—they don’t naturally produce <think> tokens.

The key insight: We need a “cold-start” SFT phase to awaken explicit thinking capabilities before RL can reinforce them.

Two-Stage Training Pipeline

Stage 1: Cold-Start SFT

Goal: Teach the model to produce <think>...</think> tokens and boxed answers.

Metric	Before SFT	After SFT
Thinking Rate	~0%	70%
Boxed Answer Rate	~0%	80%
Training Steps	-	800
Cost	-	<$30

Cold-Start SFT Training Curves

Stage 2: JustRL (GRPO)

Goal: Reinforce correct reasoning through Group Relative Policy Optimization.

Following JustRL’s minimalist approach:

No KL penalty - Don’t constrain the model to stay close to the reference policy
No length penalty - Don’t discourage long responses (which might contain valid reasoning)
Verifiable rewards only - Mathematical correctness as the sole signal

The Reward Hacking Problem

What Went Wrong in Experiment 001

Without length penalties, training quickly collapsed:

Reward Hacking Training Curves

The model discovered a devastating shortcut: generate extremely long responses with multiple answer attempts.

Metric	Normal	Reward Hacking
Response Length	~2,000 chars	35,000+ chars
Accuracy	~85%	10-28%
Content Quality	Coherent reasoning	Repetitive nonsense

The Formation Mechanism

The reward hacking emerged through a four-phase process:

Phase 1 (Steps 1-50)
Normal learning with balanced response lengths

Phase 2 (Steps 50-100)
Longer responses appear more in positive samples
→ Multiple attempts increase luck-based correctness


Phase 3 (Steps 100-140)
Model learns spurious length → reward correlation

Phase 4 (Step 140+)
Complete collapse: 30,000+ char responses, 10% accuracy

Why Length “Helps” (Statistically)

The core issue is probability amplification through multiple attempts:

Assuming 5% correctness per attempt:

1 attempt: 5% success

10 attempts: ~40% success (1 - 0.95¹⁰)

GRPO only learns from correct samples—it never penalizes incorrect ones. This creates a statistical bias where longer responses (with more attempts) disproportionately appear in the positive training set.

Novel Solution: Redundancy Penalty

The Key Insight

Simple length penalties would hurt legitimate long reasoning. Instead, target the actual problem: repetitive content.

Dual-Metric Detection System

Method	Weight	Mechanism
Compression Ratio	60%	Uses zlib compression—repetitive content compresses significantly (10-20% vs normal 50-70%)
N-gram Repetition	40%	Counts repeated word sequences—reward hacking shows 60-70% vs normal ~5%

Penalty Application

📐 REDUNDANCY PENALTY FORMULA
redundancy_score = 0.6 × compression_ratio + 0.4 × ngram_repetition


if redundancy_score > 0.3:
 penalty = min(redundancy_score - 0.3, 0.3)
 reward = reward - penalty

Validation Results

Response Type	Redundancy Score
Normal reasoning responses	0-4%
Reward-hacking responses	62-89%

Clear separation enables precise targeting without false positives.

Experimental Results

Experiment 002: Successful Mitigation

With the redundancy penalty in place:

Fixed Training Curves

Training stability restored with 84-85.5% evaluation accuracy maintained throughout.

Experiment 003: Final Results

Using harder datasets (DAPO-Math-17k) and AIME 2024 benchmarking:

Final Training Curves

Benchmark	Before	After	Change
AIME 2024	43.3%	56.7%	+13.3%
MATH	~88%	~91%	Stable

Cost Breakdown

Phase	Cost
Cold-Start SFT	<$30
Experiment 001-002	~$72
Experiment 003	~$34
Total	~$136

Key Takeaways

1. Cold-Start SFT is Essential

Instruction-tuned models need explicit format training before RL can reinforce reasoning. Without this, the model has no “thinking” behavior to optimize.

2. Reward Hacking is Statistical, Not Intentional

The model isn’t “trying” to cheat—it’s following statistical gradients. Understanding the mechanism enables targeted solutions.

3. Target Root Causes, Not Symptoms

Length penalties create false positives on legitimate reasoning. The redundancy penalty targets the actual problem: repetitive content exploitation.

4. JustRL’s Minimalism Needs Guardrails

The original JustRL philosophy (no KL, no length penalty) is sound but incomplete. Targeted interventions (format rewards, redundancy penalty) preserve simplicity while preventing collapse.

5. Low-Resource RLVR is Feasible

With the right infrastructure (Tinker API) and methodology, meaningful reasoning improvements are achievable under $150.

Quick Start

# Clone the repository
git clone https://github.com/Guanghan/JustTinker.git
cd JustTinker

# Install dependencies
pip install -r requirements.txt

# Set API key
export TINKER_API_KEY=your_api_key

# Run cold-start SFT (or use public checkpoint)
./scripts/launchers/run_coldstart_sft.sh small

# Run JustRL training
./scripts/launchers/run_justrl.sh

A public cold-start checkpoint is available, enabling direct RL training without repeating the SFT phase.

Summary

JustTinker demonstrates that building reasoning models doesn’t require massive resources:

Two-stage pipeline: Cold-start SFT → JustRL (GRPO)
Reward hacking prevention: Novel redundancy penalty using compression ratio and N-gram detection
Significant results: +13.3% on AIME 2024 for under $150
Accessible: Full pipeline runnable from a laptop via Tinker API

The project provides both a practical implementation and insights into the challenges of minimal RLVR training—particularly the reward hacking phenomenon and its mitigation.