The test sentence (neutral, fresh)
Let’s take a new sentence, so readers see generality:
“Emma thanked David politely.”
We will follow one token’s journey (say, “thanked”) through the Transformer.
Step 0 — Tokenization (splitting the input)
For simplicity (as in the blogs), we treat words as tokens:
[Emma] [thanked] [David] [politely]
💡(Real tokenizers may split further — concept unchanged.)
Step 1 — Token IDs (labels, not meaning)
Each token is mapped to an ID:
Emma → 314
thanked → 982
David → 217
politely → 654
💡 IDs are just name tags.
Step 2 — Embeddings (profiles)
Each ID looks up a learned embedding vector.
Assume tiny dimension d = 4 (toy world):
Emma → [ 0.10, 0.30, -0.20, 0.40]
thanked → [ 0.50, -0.10, 0.20, 0.10]
David → [ 0.15, 0.25, -0.30, 0.35]
politely → [-0.20, 0.60, 0.10, -0.10]
💡 Now tokens have “profiles”.
Step 3 — Positional Encoding (seat numbers)
Assign positions:
Emma → position 0
thanked → position 1
David → position 2
politely → position 3
Assume simple positional vectors:
p0 → [0.01, 0.00, 0.00, 0.00]
p1 → [0.00, 0.01, 0.00, 0.00]
p2 → [0.00, 0.00, 0.01, 0.00]
p3 → [0.00, 0.00, 0.00, 0.01]
Add embedding + position:
x_thanked = [0.50, -0.10, 0.20, 0.10]
+ [0.00, 0.01, 0.00, 0.00]
= [0.50, -0.09, 0.20, 0.10]
💡 Profile + seat number
Step 4 — Q, K, V (three learned views)
From the same vector x, create three views:
Q = x · WQ
K = x · WK
V = x · WV
Assume tiny learned matrices produce:
Q_thanked = [0.6, 0.2]
K_thanked = [0.4, 0.1]
V_thanked = [0.3, 0.5]
Every token gets its own Q/K/V.
💡 Same person, three badges.
Step 5 — Self-Attention (who matters to me?)
Now “thanked” compares its Query with everyone’s Key:
score(thanked, Emma) = Q_thanked · K_Emma
score(thanked, thanked) = Q_thanked · K_thanked
score(thanked, David) = Q_thanked · K_David
score(thanked, politely) = Q_thanked · K_politely
Assume raw scores:
[ 1.2, 2.4, 1.8, 0.6 ]
Apply softmax:
attention weights ≈ [0.15, 0.45, 0.30, 0.10]
Now compute weighted sum of Values:
output_thanked = 0.15·V_Emma + 0.45·V_thanked + 0.30·V_David + 0.10·V_politely
This becomes a context-aware vector.
💡 “thanked” now understands who did what to whom.
Step 6 — Multi-Head Attention
Repeat Step 5 multiple times in parallel, each with different WQ/WK/WV.
Each head notices something different:
- one focuses on subject–verb,
- one on verb–object,
- one on adverbs.
Concatenate all head outputs → project back.
💡 Multiple spotlights.
Step 7 — Feed-Forward Network (private thinking)
Each token independently goes through:
Linear → ReLU → Linear
Example:
[0.7, -0.3, 0.2, 0.1]
→ ReLU
[0.7, 0.0, 0.2, 0.1]
💡 Refinement, not interaction.
Step 8 — Masking (no peeking)
When generating next words, tokens cannot attend to future positions.
Future scores are masked → softmax → zero.
💡 Language must be earned left-to-right.
Step 9 — Logits → Softmax → Next Word
Final vector → linear layer → logits:
[ "politely": 2.1,
"sincerely": 1.7,
"today": 0.4,
"." : 1.9 ]
Softmax → probabilities:
politely → 0.41
"." → 0.32
sincerely → 0.21
today → 0.06
Choose one → append → repeat.
💡 The model speaks.
The Transformer Mnemonic
Think of the Transformer as a well-run conversation, remembered as:
“Meet → Sit → Look → Listen → Think → Block → Speak”
Now map it gently:
-
Meet → Tokenization + IDs
Words enter the room. -
Sit → Embeddings + Positional Encoding
Everyone gets a profile and a seat. -
Look → Queries (Q)
“Who should I pay attention to?” -
Listen → Keys & Values (K, V) + Self-Attention
“What do others offer, and what do I take in?” -
Think → Multi-Head Attention + Feed-Forward Network
Multiple perspectives, then private reflection. -
Block → Masking
No peeking into the future. -
Speak → Logits → Softmax → Next Token
Choose the next word and say it.