Seat Numbers at the Mixer (Positional Information)
In Blog 1, we got our guests into the room and gave them name tags (token IDs) plus mini personality profiles (embeddings). Everyone is officially “representable in numbers.”
Nice.
But there’s one awkward rule in the Transformer’s party hall:
It has no built-in sense of who came first.
And if you’ve ever attended a mixer where people keep introducing themselves out of order, you already know what happens next:
chaos, confusion, and someone accidentally congratulating you for a job you never had.
Let’s fix that.
The problem: Embeddings don’t know order
Here’s the same idea in one smooth breath.
Each token gets converted into an embedding vector, a dense little “Profile card” such as:
Alice → [0.20, 0.10, −0.40, 0.90]
which is great for capturing what the token is, but it still doesn’t tell the model where that token appears in the sentence, because embeddings by themselves don’t automatically encode position.
So if the model sees these two sentences:
- “Alice introduced Bob to Chloe.”
- “Chloe introduced Bob to Alice.”
…it might treat them as the same set of guests—just shuffled.
And language is not a bag of guests. It’s a line (or a sequence) where order changes meaning.
The fix: Give every guest a seat number
So we add positional information.
In our networking-event world, positional information is simply:
- “You’re standing at slot 0.”
- “You’re standing at slot 1.”
- “You’re standing at slot 2.”
For our running sentence:
Position 0: Alice
Position 1: introduced
Position 2: Bob
Position 3: to
and Position 4: Chloe
Now the model has a second ingredient per token: a position vector.
Think of it as a tiny stamp on the name tag:
“Hi, I’m Bob — and I’m the 3rd person in line.”
Why do we add position to the embedding?
This is one of those deceptively simple design choices that’s secretly genius.
We want each token’s representation to carry two facts at once:
- Who am I? (embedding = profile)
- Where am I? (position = seat number)
If we “Stacked” them by making the vector longer, the next layers would have to deal with changing dimensions and extra bookkeeping.
So instead, we keep the vector size the same and blend the two facts using addition:
xᵢ = eᵢ + pᵢ
Where:
- eᵢ = embedding of the token at position i
- pᵢ = positional encoding (a position vector) for position i
- xᵢ = the combined representation the Transformer will actually use
Addition here is like putting a transparent seat-number overlay on top of the profile card.
Tiny numeric walkthrough (so you can literally see the “seat overlay”)
Let’s stay in our tiny world where vectors have 4 numbers.
Suppose Alice’s embedding (profile) from Blog 1 was:
e₀ = [0.20, 0.10, −0.40, 0.90]
Now give position 0 a small position vector:
p₀ = [0.01, 0.00, 0.02, 0.00]
Now combine them:
x₀ = e₀ + p₀
= [0.20, 0.10, −0.40, 0.90] + [0.01, 0.00, 0.02, 0.00]
= [0.21, 0.10, −0.38, 0.90]
Do this for every token position, and suddenly the model is no longer looking at “A pile of profiles.”
It’s looking at profiles-in-order.
Where do positional vectors come from?
There are a few common ways Transformers inject positional information:
In our networking event, there are three different ways the organizer can help everyone sense where they are standing in line.
The sinusoidal method is like the entire floor humming with several overlapping rhythms — slow waves, medium waves, fast waves — all pulsing together. As you walk across the room, the combination of those waves under your feet changes in a unique way, so even if one wave repeats, the mix of all the rhythms tells exactly where you are and how close you are to others.
The learned positional embedding approach is simpler. The organizer hands out seat-stickers, one unique sticker-vector for each position, learned over time — practical, direct, but it relies on having predefined seats.
And then there’s Rotary Positional Embedding (RoPE), which doesn’t use rhythms or stickers at all; instead, it subtly adjusts how two guests interact based on how far apart they stand. It’s like your handshake naturally changes with distance, so the model understands order through the interaction itself, not through a tag.
In short:
- sinusoidal is the room’s multi-wave hum,
- learned embeddings are the organizer’s sticker book,
- and RoPE is the distance-aware handshake.
Blog 2 is about giving each token an ordered identity. We haven’t started conversations.
No voting. No probability distributions. No “Who should I listen to?” yet.
What’s still missing (and why Blog 3 exists)
Right now, every guest has:
- a profile (embedding)
- a seat number overlay (positional information)
…but they still haven’t actually talked.
We haven’t explained how “introduced” learns that it should care about “Alice” and “Bob,” or why “to” is glued to “Chloe.”
So in Blog 3, the real mixer begins:
Self-Attention, where each guest decides who matters to them.
And yes, this is where the fancy characters walk in wearing suits:
- Q (What am I looking for?)
- K (What do I advertise?)
- V (What do I actually share?)
Bring your curiosity. I’ll bring the snacks.