New DeepSeek Research - The Future Is Here!
DeepSeek released an 80-page paper providing what may be the first fully open, reproducible recipe for creating ChatGPT-level AI. Five key techniques — from teacher-free training to distillation — together produce small, cheap models that dramatically outperform older frontier models, and the entire framework is free and public. ---
Key Concepts
| Concept | Definition |
|---|---|
| GRPO (Group Relative Policy Optimization) | Training method where the model generates multiple answers to one question and the answers are graded against each other — no second "teacher" model required |
| Reinforcement Learning from scratch | Training AI purely by self-play against rules, with no human examples provided |
| Cold start / flashlight nudge | Providing just a few examples at the start of RL training to stabilize language and direction without constraining reasoning ability |
| Distillation | Using a large, capable model to generate a training dataset (800K examples) that teaches smaller models to replicate its reasoning style |
| Emergent "aha" moment | AI spontaneously learning to pause, reconsider, and think longer — without being explicitly taught to do so |
Notes
The Openness Problem with Existing Labs
- OpenAI's GPT-4 paper explicitly states it contains "no further details about the architecture, hardware, training, compute, dataset construction, or training method"
- DeepSeek's new paper is 80 pages (up from 20 pages a year prior) and contains substantive, reproducible methodology
- Author frames this as a meaningful step toward science being open and reproducible
Technique 1 — Group Relative Policy Optimization (GRPO)
- Traditional training uses a second equally large AI as a "teacher" to grade every sentence → expensive and slow
- GRPO eliminates the teacher model entirely
- Instead: model generates **16 different answers** to one question
- Answers are graded against each other using objective criteria (e.g., did the code run? was the answer correct?)
- Best answer is rewarded; poor answers are discarded
- Cheap enough to run at massive scale
Technique 2 — Emergent Thinking (The "Aha Moment")
- Model was observed spontaneously generating phrases like "wait" or "let me recalculate"
- Over time it learned that **longer thinking → higher scores**, so it extended its own reasoning without being told to
- No human explicitly taught this behavior — it emerged from the training process
- First documented case of an AI naturally learning to think before responding
Technique 3 — Pure Reinforcement Learning (No Human Examples)
- Analogy: learning chess by playing millions of games vs. reading a textbook — self-play has no ceiling
- DeepSeek proved you can train reasoning ability with **zero human demonstration examples**
- Model started at ~15% success rate on competition math; reached ~80% on its own
- Discovered novel strategies not present in any training data
Technique 4 — The Flashlight Nudge (Cold Start)
- Starting from zero is possible but causes instability: gibberish output, random language switching
- Providing a **small number of examples** at the start fixes direction immediately
- Effect on math: minimal improvement (only ~2 percentage points, sometimes negative) — math is language-agnostic
- Effect on natural language tasks (AlpacaEval): **more than tripled performance** — language coherence matters here
- Takeaway: a small guided start is high value for language quality, low cost to reasoning quality
Technique 5 — Distillation (Learn from Giants)
- DeepSeek R1 (large model) was used to generate **800,000 reasoning examples** — essentially a textbook
- Small, cheap models are then trained on this textbook
- Result: a **7-billion parameter model** beats the previous GPT-4o on competition math by nearly **6×**
- 7B models can run on many laptops today; likely on phones within a few years
- The predecessor required billions of dollars to train; the distilled version is free
Actionable Takeaways
- **Generate multiple options** before committing — produce 5+ solutions to a problem and grade them against each other before picking one
- **Pause deliberately** when facing a hard question — say "wait" out loud, double-check logic; the extra time consistently pays off
- **Prioritize practice over theory** — learn fundamentals minimally, then do the task and self-correct through failure rather than consuming endless tutorials
Quotes Worth Keeping
"Given the competitive landscape, this report contains no further details about the architecture, hardware, training, compute, dataset construction, or training method." — OpenAI, GPT-4 technical report
It started generating words like 'wait' or 'let me recalculate.' And over time it realizes that spending more time thinking leads to a higher score. So it started thinking longer and longer by itself.
You need the genius to write it, but not to read it.
We are not maximizing money here. We are maximizing meaning.