Transformers are usually taught like someone dumped tokenization + embeddings + Q/K/V + softmax + ReLU + multi-head into a blender, hit “Turbo,” and called it intuition. If you’ve ever nodded politely while your mind quietly left your body—welcome…
In this mini-series, we’ll learn the same mechanics using one friendly setting: A networking event.
Each word is a guest.
Attention is matchmaking.
And every step will follow a simple rhythm: This happens… but it’s not enough because… so we add the next step.
We’ll also do mini walk-throughs with small numbers, just enough math to see where the numbers come from, without turning this into a calculus documentary.
Our running sentence across all 5 parts:
“Alice introduced Bob to Chloe.”
Alright! Name tags on… Let’s walk in.
Name Tags vs Personality Profiles (Vectorization vs Embeddings)
Picture this: You walk into a buzzing networking mixer. People are smiling too hard, clutching paper cups of coffee like emotional-support animals, and pretending they totally love small talk.
Somewhere in this room is the future of your career… or at least a free cookie.
That mixer is exactly how we’re going to understand Transformers and Large Language Models (LLMs) without the “Math blender” effect.
This is Part 1 of a 5-blog mini-series where we follow one tiny story again and again until your brain goes, “Oh. That’s it?”
Our running sentence (neutral, simple, and perfectly non-romantic): “Alice introduced Bob to Chloe.”
Now, before the model can do any fancy “Attention” matchmaking, it needs one basic thing: It must turn human text into something a machine can actually hold and manipulate.
Step 1: From sentence to tokens (the bouncer at the door)
A computer doesn’t read the sentence the way we do. First, it breaks the sentence into tokens (pieces). For now, imagine tokens are just words:
“Alice” | “introduced” | “Bob” | “to” | “Chloe”
Quick note: Real LLM tokenizers often split text into subword pieces (e.g., “introduc” + “ed”, or even smaller chunks).
For our first-pass intuition, we’ll treat each word as a single token—same story, less mental traffic.
At our networking event, tokens are the guests arriving at the entrance.
Step 2: Token IDs (the registration number)
Next, each token gets a token ID from a big guest list (a vocabulary). Think of this as the registration desk handing out numbers:
- Alice → 1021
- introduced → 551
- Bob → 2044
- to → 17
- Chloe → 778
This step is what many people loosely call vectorization i.e., turning text into numbers.
But here’s the catch.
These numbers are not meaning. They are just labels. Bob being “2044” does not mean Bob is twice as meaningful as “1022”. It’s like a jersey number: useful for identifying, useless for understanding.
So the model asks a very reasonable question:
“Okay cool. I have registration numbers. But… who are these people?”
That’s where embeddings enter.
Step 3: Embeddings (the personality profile)
An embedding is a learned vector (a list of numbers) assigned to each token, like a personality profile attached to the name tag.
At a mixer, a name tag that says “Bob” helps you spot Bob. But a mini-profile helps you understand Bob:
- Role: engineer
- Interests: security, startups
- Style: quiet, sharp
In the model, that mini-profile becomes numbers. So instead of:
Bob → 2044
we now have:
Bob → [0.12, −0.07, 0.33, 0.90, …]
Each token becomes a vector of length d_model (think of it as how many “Profile traits” we store). For teaching, we’ll keep it tiny.
Tiny toy example (so you see where numbers “Come from”)
Imagine our world has only 6 tokens:
[Alice, introduced, Bob, to, Chloe, .]
And we choose a tiny embedding size of 4. Then we can imagine an embedding table like this (the model learns these values over training):
- Alice → [ 0.20, 0.10, −0.40, 0.90]
- introduced → [−0.30, 0.80, 0.10, −0.20]
- Bob → [ 0.50, −0.60, 0.00, 0.70]
- to → [ 0.05, 0.02, 0.01, 0.00]
- Chloe → [−0.10, 0.90, −0.20, 0.40]
- . → [ 0.00, 0.00, 0.00, 0.00]
So when the sentence arrives, the model replaces each token ID with its row from this table. That’s it. No magic yet. Just “Swap ID for profile.”
“Wait… who created those numbers?”
Great question.
The model doesn’t download them from heaven.
Here’s the honest truth behind the “Personality profile” analogy:
Nobody hand-assigns these values (no one sits and decides “Dimension-1 = Extroversion” 😄).
An embedding table is just a big matrix of learnable numbers. It usually starts with small random-ish values. During training, the model makes predictions, gets corrected, and those numbers are nudged so the model predicts better next time.
The simplest “How the numbers change” picture
Think of the embedding table as a spreadsheet E with one row per token:
E has shape: (vocab_size × d_model)
If the token ID is t, the embedding vector is simply the row:
eₜ = E[t]
When training produces an error, the model computes “Who contributed to the error?” and updates parameters.
The embedding row gets updated like this:
E[t] ← E[t] − η · (gradient for that row)
Where η is a tiny learning rate.
Tiny numeric peek (toy example)
Suppose Bob’s embedding is:
Bob → [0.50, −0.60, 0.00, 0.70]
And one training step computes a gradient (how Bob’s row should change) as:
grad → [0.02, −0.01, 0.05, −0.03]
With a small learning rate η = 0.1, the update becomes:
Bob(new) = [0.50, −0.60, 0.00, 0.70] − 0.1·[0.02, −0.01, 0.05, −0.03]
= [0.498, −0.599, −0.005, 0.703]
That’s it! Tiny nudges, repeated millions of times.
Bonus intuition: Only the rows for tokens that actually appeared get that step updated, like adjusting only the profiles of the guests who actually came to the mixer.
So are embedding dimensions “Traits”?
In real models, individual coordinates usually are not named traits. The “Traits” language is a metaphor to help intuition. Meaning is often distributed across many dimensions.
We will keep using the profile analogy because it’s friendly, but we will stay honest. Embeddings are learned knobs, not pre-labeled personality sliders.
At the beginning of training, embeddings are usually random-ish. During training, the model keeps adjusting them because it gets feedback (“You predicted wrong / right”).
Over time, tokens that behave similarly end up with embeddings that end up somewhat similar because it helps predictions.
So embeddings are like the model’s internal map of how words behave and learned from data.
Why token IDs aren’t enough
The model first converts text into token IDs because computers need numbers. But IDs do not carry meaning; they are only labels.
So we use embeddings, the dense vectors that represent each token with a learned profile.
This blog (Blog 1) is about getting guests inside the event and giving them profiles. We haven’t started matchmaking conversations, voting, or decision-making.
What’s still missing?
Even with embeddings, the model still doesn’t know one crucial thing: Order.
“Alice introduced Bob to Chloe” isn’t the same as “Bob introduced Chloe to Alice.”
If the model can’t track who came first, we can’t trust it with language.
So in Blog 2, we’ll add the equivalent of seat numbers at the mixer i.e., positional information.
If Blog 1 made you feel like, “Oh… IDs are just name tags, embeddings are the profile,” then you’re already out of the blender.
In Blog 2, we give these guests a seating chart and that’s where the story starts becoming properly interesting.