Transformers, the tech behind LLMs | Deep Learning Chapter 5

3Blue1Brown · 2026-05-22 ·▶ Watch on YouTube ·via captions ·2 min read

TL;DR

A visual, ground-up explanation of how data flows through a Transformer — from tokenization and word embeddings through attention blocks and MLP layers to final probability prediction. Uses GPT-3's specific architecture numbers to make the abstract concrete, and lays the conceptual groundwork needed to understand the attention mechanism. ---

Key Concepts

Transformer

tap to reveal ↩

A specific neural network architecture underlying modern LLMs, originally designed for translation (Google, 2017)

Token

tap to reveal ↩

The atomic unit of text input — words, sub-words, punctuation, or character combinations; can also be image patches or audio chunks

Embedding

tap to reveal ↩

Converting a token into a high-dimensional vector such that directions in that space carry semantic meaning

Attention block

tap to reveal ↩

An operation where vectors "talk to each other" to update their meanings based on context

Multi-layer perceptron (MLP) / feed-forward layer

tap to reveal ↩

Each vector is updated independently and in parallel; conceptually like asking a list of questions about each vector

Context size

tap to reveal ↩

The fixed maximum number of vectors (tokens) a Transformer can process at once (GPT-3: 2,048)

Weights

tap to reveal ↩

The learned parameters of the model, always organized as matrices; distinct from the data being processed

Logits

tap to reveal ↩

The raw, unnormalized output values before softmax is applied

Softmax

tap to reveal ↩

A function that converts an arbitrary list of numbers into a valid probability distribution

Temperature

tap to reveal ↩

A scalar parameter added to softmax that controls how peaked or uniform the output distribution is

Embedding Matrix (W_E)

tap to reveal ↩

Maps each token in the vocabulary to its embedding vector

Unembedding Matrix (W_U)

tap to reveal ↩

Maps the final context vector back to a probability distribution over all tokens

Notes

§How LLMs Generate Text

Model takes in a text snippet, outputs a probability distribution over possible next tokens
To generate: sample from that distribution → append to text → repeat
This is exactly what happens when ChatGPT produces one word at a time
GPT-2 (small) produces incoherent stories; GPT-3 (same architecture, ~100× larger) produces coherent ones
Chatbot behavior is achieved by prepending a system prompt that establishes the "helpful AI assistant" context, then letting the model autocomplete dialogue

§High-Level Data Flow Through a Transformer

§Deep Learning Background

Input: always an array (tensor) of real numbers
Layers: data is progressively transformed through many intermediate arrays
Parameters (weights): learned during training; interact with data only via weighted sums (i.e., matrix-vector multiplication)
Backpropagation: the training algorithm; requires models to follow a specific format
GPT-3 has 175 billion weights organized into ~28,000 distinct matrices across 8 categories

§Tokenization and Embedding

GPT-3 vocabulary: 50,257 tokens
Embedding dimension: 12,288
Embedding matrix size: 50,257 × 12,288 ≈ 617 million weights
Each token is initially just its column from the embedding matrix — no context yet
The network's job is to progressively enrich each vector with contextual meaning

§Semantic Structure of the Embedding Space

Similar words land near each other in the high-dimensional space
Directions carry semantic meaning (not just individual points)
Classic example: king − man + woman ≈ queen
Other examples: Italy − Germany + Hitler ≈ Mussolini; Germany − Japan + Sushi ≈ Bratwurst
Dot product measures alignment between vectors:
Positive → vectors point in similar directions
Zero → perpendicular
Negative → opposite directions
Useful for testing semantic directions (e.g., a "plurality direction": cats − cat)

§Context Size

GPT-3 context size: 2,048 tokens
Data flowing through the network: array of 2,048 columns × 12,288 dimensions
Exceeding context size = model "forgets" earlier parts of conversation

§The Unembedding Matrix and Final Prediction

A second matrix maps the last vector in the context to 50,257 logits (one per token)
During training, every vector in the final layer simultaneously predicts what comes after its position — more efficient
Unembedding matrix: 50,257 × 12,288 ≈ another ~617 million weights
Running total so far: ~1.2 billion (out of 175 billion total)

§Softmax

Converts arbitrary real numbers (logits) into a valid probability distribution (values ∈ [0,1], sum = 1)
Mechanics: exponentiate each value → divide each by the sum
Largest logit dominates but smaller values still get weight — "softer than argmax"
Temperature T: inserted as a divisor in the exponent
Higher T → more uniform distribution → more creative / risky outputs
Lower T → distribution more peaked → more predictable outputs
T = 0 → always picks the single most probable token (deterministic)

Actionable Takeaways

1When reading about Transformers, mentally separate weights (learned, static during inference) from data (the changing vectors flowing through) — they play fundamentally different roles
2To build intuition for attention, first get comfortable with dot products as similarity measures and softmax as a normalization tool — both appear repeatedly inside attention blocks
3Use the GPT-3 parameter count as a sanity-check scaffold: embedding matrix (~617M) + unembedding matrix (~617M) account for ~1.2B of 175B total; the remaining ~174B live in attention and MLP layers
4Think of each token vector not as a fixed word representation but as a mutable container that accumulates contextual meaning as it passes through the network

Quotes Worth Keeping

“

You should think of them as having the capacity to soak in context. A vector that started its life as the embedding of the word King might progressively get tugged and pulled by various blocks in this network so that by the end it points in a much more specific and nuanced direction.

“

You should draw a very sharp distinction in your mind between the weights of the model — the actual brains, the things learned during training — and the data being processed, which simply encodes whatever specific input is fed into the model for a given run.

“

It really doesn't feel like this should actually work.

↓ Down the rabbit hole

Kurzgesagt – In a Nutshell · AI/ML · Tech Tools

The Rise of the Machines – Why Automation is Different this Time

Modern automation is fundamentally different from historical waves because it targets cognitive and complex work — not just physical or…