Appendix — Transformer Walkthrough

The test sentence (neutral, fresh)

Let’s take a new sentence, so readers see generality:

“Emma thanked David politely.”

We will follow one token’s journey (say, “thanked”) through the Transformer.

Step 0 — Tokenization (splitting the input)

For simplicity (as in the blogs), we treat words as tokens:

[Emma] [thanked] [David] [politely]

💡(Real tokenizers may split further — concept unchanged.)

Step 1 — Token IDs (labels, not meaning)

Each token is mapped to an ID:

Emma     → 314

thanked  → 982

David    → 217

politely → 654

💡 IDs are just name tags.

Step 2 — Embeddings (profiles)

Each ID looks up a learned embedding vector.

Assume tiny dimension d = 4 (toy world):

Emma     → [ 0.10,  0.30, -0.20,  0.40]

thanked  → [ 0.50, -0.10,  0.20,  0.10]

David    → [ 0.15,  0.25, -0.30,  0.35]

politely → [-0.20,  0.60,  0.10, -0.10]

💡 Now tokens have “profiles”.

Step 3 — Positional Encoding (seat numbers)

Assign positions:

Emma     → position 0

thanked  → position 1

David    → position 2

politely → position 3

Assume simple positional vectors:

p0 → [0.01, 0.00, 0.00, 0.00]

p1 → [0.00, 0.01, 0.00, 0.00]

p2 → [0.00, 0.00, 0.01, 0.00]

p3 → [0.00, 0.00, 0.00, 0.01]

Add embedding + position:

x_thanked = [0.50, -0.10, 0.20, 0.10]
 + [0.00,  0.01, 0.00, 0.00]
= [0.50, -0.09, 0.20, 0.10]

💡 Profile + seat number

Step 4 — Q, K, V (three learned views)

From the same vector x, create three views:

Q = x · WQ

K = x · WK

V = x · WV

Assume tiny learned matrices produce:

Q_thanked = [0.6, 0.2]

K_thanked = [0.4, 0.1]

V_thanked = [0.3, 0.5]

Every token gets its own Q/K/V.

💡 Same person, three badges.

Step 5 — Self-Attention (who matters to me?)

Now “thanked” compares its Query with everyone’s Key:

score(thanked, Emma)     = Q_thanked · K_Emma

score(thanked, thanked)  = Q_thanked · K_thanked

score(thanked, David)    = Q_thanked · K_David

score(thanked, politely) = Q_thanked · K_politely

Assume raw scores:

[ 1.2,  2.4,  1.8,  0.6 ]

Apply softmax:

attention weights ≈ [0.15, 0.45, 0.30, 0.10]

Now compute weighted sum of Values:

output_thanked = 0.15·V_Emma + 0.45·V_thanked + 0.30·V_David + 0.10·V_politely

This becomes a context-aware vector.

💡 “thanked” now understands who did what to whom.

Step 6 — Multi-Head Attention

Repeat Step 5 multiple times in parallel, each with different WQ/WK/WV.

Each head notices something different:

one focuses on subject–verb,
one on verb–object,
one on adverbs.

Concatenate all head outputs → project back.

💡 Multiple spotlights.

Step 7 — Feed-Forward Network (private thinking)

Each token independently goes through:

Linear → ReLU → Linear

Example:

[0.7, -0.3, 0.2, 0.1]

→ ReLU

[0.7,  0.0, 0.2, 0.1]

💡 Refinement, not interaction.

Step 8 — Masking (no peeking)

When generating next words, tokens cannot attend to future positions.

Future scores are masked → softmax → zero.

💡 Language must be earned left-to-right.

Step 9 — Logits → Softmax → Next Word

Final vector → linear layer → logits:

[ "politely": 2.1,

  "sincerely": 1.7,

  "today": 0.4,

  "." : 1.9 ]

Softmax → probabilities:

politely  → 0.41

"."       → 0.32

sincerely → 0.21

today     → 0.06

Choose one → append → repeat.

💡 The model speaks.

The Transformer Mnemonic

Think of the Transformer as a well-run conversation, remembered as:

“Meet → Sit → Look → Listen → Think → Block → Speak”

Now map it gently:

Meet → Tokenization + IDs
Words enter the room.
Sit → Embeddings + Positional Encoding
Everyone gets a profile and a seat.
Look → Queries (Q)
“Who should I pay attention to?”
Listen → Keys & Values (K, V) + Self-Attention
“What do others offer, and what do I take in?”
Think → Multi-Head Attention + Feed-Forward Network
Multiple perspectives, then private reflection.
Block → Masking
No peeking into the future.
Speak → Logits → Softmax → Next Token
Choose the next word and say it.