Transformers, the tech behind LLMs | Deep Learning Chapter 5

3Blue1Brown · 2026-05-22 ·▶ Watch on YouTube ·via captions

A visual, ground-up explanation of how data flows through a Transformer — from tokenization and word embeddings through attention blocks and MLP layers to final probability prediction. Uses GPT-3's specific architecture numbers to make the abstract concrete, and lays the conceptual groundwork needed to understand the attention mechanism. ---

Key Concepts

ConceptDefinition
TransformerA specific neural network architecture underlying modern LLMs, originally designed for translation (Google, 2017)
TokenThe atomic unit of text input — words, sub-words, punctuation, or character combinations; can also be image patches or audio chunks
EmbeddingConverting a token into a high-dimensional vector such that directions in that space carry semantic meaning
Attention blockAn operation where vectors "talk to each other" to update their meanings based on context
Multi-layer perceptron (MLP) / feed-forward layerEach vector is updated independently and in parallel; conceptually like asking a list of questions about each vector
Context sizeThe fixed maximum number of vectors (tokens) a Transformer can process at once (GPT-3: 2,048)
WeightsThe learned parameters of the model, always organized as matrices; distinct from the data being processed
LogitsThe raw, unnormalized output values before softmax is applied
SoftmaxA function that converts an arbitrary list of numbers into a valid probability distribution
TemperatureA scalar parameter added to softmax that controls how peaked or uniform the output distribution is
Embedding Matrix (W_E)Maps each token in the vocabulary to its embedding vector
Unembedding Matrix (W_U)Maps the final context vector back to a probability distribution over all tokens

Notes

How LLMs Generate Text

  • Model takes in a text snippet, outputs a **probability distribution** over possible next tokens
  • To generate: sample from that distribution → append to text → repeat
  • This is exactly what happens when ChatGPT produces one word at a time
  • GPT-2 (small) produces incoherent stories; GPT-3 (same architecture, ~100× larger) produces coherent ones
  • Chatbot behavior is achieved by prepending a **system prompt** that establishes the "helpful AI assistant" context, then letting the model autocomplete dialogue

High-Level Data Flow Through a Transformer

    Deep Learning Background

    • **Input**: always an array (tensor) of real numbers
    • **Layers**: data is progressively transformed through many intermediate arrays
    • **Parameters (weights)**: learned during training; interact with data only via **weighted sums** (i.e., matrix-vector multiplication)
    • **Backpropagation**: the training algorithm; requires models to follow a specific format
    • GPT-3 has 175 billion weights organized into ~28,000 distinct matrices across 8 categories

    Tokenization and Embedding

    • GPT-3 vocabulary: **50,257 tokens**
    • Embedding dimension: **12,288**
    • Embedding matrix size: 50,257 × 12,288 ≈ **617 million weights**
    • Each token is initially just its column from the embedding matrix — no context yet
    • The network's job is to progressively enrich each vector with contextual meaning

    Semantic Structure of the Embedding Space

    • Similar words land near each other in the high-dimensional space
    • **Directions** carry semantic meaning (not just individual points)
    • Classic example: `king − man + woman ≈ queen`
    • Other examples: `Italy − Germany + Hitler ≈ Mussolini`; `Germany − Japan + Sushi ≈ Bratwurst`
    • **Dot product** measures alignment between vectors:
    • Positive → vectors point in similar directions
    • Zero → perpendicular
    • Negative → opposite directions
    • Useful for testing semantic directions (e.g., a "plurality direction": `cats − cat`)

    Context Size

    • GPT-3 context size: **2,048 tokens**
    • Data flowing through the network: array of 2,048 columns × 12,288 dimensions
    • Exceeding context size = model "forgets" earlier parts of conversation

    The Unembedding Matrix and Final Prediction

    • A second matrix maps the **last vector** in the context to 50,257 logits (one per token)
    • During training, **every** vector in the final layer simultaneously predicts what comes after its position — more efficient
    • Unembedding matrix: 50,257 × 12,288 ≈ **another ~617 million weights**
    • Running total so far: ~1.2 billion (out of 175 billion total)

    Softmax

    • Converts arbitrary real numbers (logits) into a valid probability distribution (values ∈ [0,1], sum = 1)
    • Mechanics: exponentiate each value → divide each by the sum
    • Largest logit dominates but smaller values still get weight — "softer than argmax"
    • **Temperature T**: inserted as a divisor in the exponent
    • Higher T → more uniform distribution → more creative / risky outputs
    • Lower T → distribution more peaked → more predictable outputs
    • T = 0 → always picks the single most probable token (deterministic)

    Actionable Takeaways

    1. When reading about Transformers, mentally separate **weights** (learned, static during inference) from **data** (the changing vectors flowing through) — they play fundamentally different roles
    2. To build intuition for attention, first get comfortable with dot products as similarity measures and softmax as a normalization tool — both appear repeatedly inside attention blocks
    3. Use the GPT-3 parameter count as a sanity-check scaffold: embedding matrix (~617M) + unembedding matrix (~617M) account for ~1.2B of 175B total; the remaining ~174B live in attention and MLP layers
    4. Think of each token vector not as a fixed word representation but as a **mutable container** that accumulates contextual meaning as it passes through the network

    Quotes Worth Keeping

    You should think of them as having the capacity to soak in context. A vector that started its life as the embedding of the word King might progressively get tugged and pulled by various blocks in this network so that by the end it points in a much more specific and nuanced direction.
    You should draw a very sharp distinction in your mind between the weights of the model — the actual brains, the things learned during training — and the data being processed, which simply encodes whatever specific input is fed into the model for a given run.
    It really doesn't feel like this should actually work.