New DeepSeek Research - The Future Is Here!

Two Minute Papers · 2026-05-21 ·▶ Watch on YouTube ·via captions ·2 min read

TL;DR

DeepSeek released an 80-page paper providing what may be the first fully open, reproducible recipe for creating ChatGPT-level AI. Five key techniques — from teacher-free training to distillation — together produce small, cheap models that dramatically outperform older frontier models, and the entire framework is free and public. ---

Key Concepts

GRPO (Group Relative Policy Optimization)

tap to reveal ↩

Training method where the model generates multiple answers to one question and the answers are graded against each other — no second "teacher" model required

Reinforcement Learning from scratch

tap to reveal ↩

Training AI purely by self-play against rules, with no human examples provided

Cold start / flashlight nudge

tap to reveal ↩

Providing just a few examples at the start of RL training to stabilize language and direction without constraining reasoning ability

Distillation

tap to reveal ↩

Using a large, capable model to generate a training dataset (800K examples) that teaches smaller models to replicate its reasoning style

Emergent "aha" moment

tap to reveal ↩

AI spontaneously learning to pause, reconsider, and think longer — without being explicitly taught to do so

Notes

§The Openness Problem with Existing Labs

OpenAI's GPT-4 paper explicitly states it contains "no further details about the architecture, hardware, training, compute, dataset construction, or training method"
DeepSeek's new paper is 80 pages (up from 20 pages a year prior) and contains substantive, reproducible methodology
Author frames this as a meaningful step toward science being open and reproducible

§Technique 1 — Group Relative Policy Optimization (GRPO)

Traditional training uses a second equally large AI as a "teacher" to grade every sentence → expensive and slow
GRPO eliminates the teacher model entirely
Instead: model generates 16 different answers to one question
Answers are graded against each other using objective criteria (e.g., did the code run? was the answer correct?)
Best answer is rewarded; poor answers are discarded
Cheap enough to run at massive scale

§Technique 2 — Emergent Thinking (The "Aha Moment")

Model was observed spontaneously generating phrases like "wait" or "let me recalculate"
Over time it learned that longer thinking → higher scores, so it extended its own reasoning without being told to
No human explicitly taught this behavior — it emerged from the training process
First documented case of an AI naturally learning to think before responding

§Technique 3 — Pure Reinforcement Learning (No Human Examples)

Analogy: learning chess by playing millions of games vs. reading a textbook — self-play has no ceiling
DeepSeek proved you can train reasoning ability with zero human demonstration examples
Model started at ~15% success rate on competition math; reached ~80% on its own
Discovered novel strategies not present in any training data

§Technique 4 — The Flashlight Nudge (Cold Start)

Starting from zero is possible but causes instability: gibberish output, random language switching
Providing a small number of examples at the start fixes direction immediately
Effect on math: minimal improvement (only ~2 percentage points, sometimes negative) — math is language-agnostic
Effect on natural language tasks (AlpacaEval): more than tripled performance — language coherence matters here
Takeaway: a small guided start is high value for language quality, low cost to reasoning quality

§Technique 5 — Distillation (Learn from Giants)

DeepSeek R1 (large model) was used to generate 800,000 reasoning examples — essentially a textbook
Small, cheap models are then trained on this textbook
Result: a 7-billion parameter model beats the previous GPT-4o on competition math by nearly 6×
7B models can run on many laptops today; likely on phones within a few years
The predecessor required billions of dollars to train; the distilled version is free

Actionable Takeaways

1Generate multiple options before committing — produce 5+ solutions to a problem and grade them against each other before picking one
2Pause deliberately when facing a hard question — say "wait" out loud, double-check logic; the extra time consistently pays off
3Prioritize practice over theory — learn fundamentals minimally, then do the task and self-correct through failure rather than consuming endless tutorials

Quotes Worth Keeping

“

"Given the competitive landscape, this report contains no further details about the architecture, hardware, training, compute, dataset construction, or training method." — OpenAI, GPT-4 technical report

“

It started generating words like 'wait' or 'let me recalculate.' And over time it realizes that spending more time thinking leads to a higher score. So it started thinking longer and longer by itself.

“

You need the genius to write it, but not to read it.

“

We are not maximizing money here. We are maximizing meaning.

↓ Down the rabbit hole

Mike Boyd · Tech Tools · Language/Learning

This Week I Learned to Split an Apple with my Hands

Mike Boyd documents learning to split an apple in half with bare hands, figuring out the technique through trial and error. The video also…