Multiple Spotlights and Private Thinking (Multi-Head Attention + Feed-Forward Network)
In Blog 3, the mixer finally came alive. Guests looked around, decided who mattered, and blended what they learned into richer, context-aware selves. Beautiful.
But imagine a photographer trying to capture this event with one single spotlight.
Some faces would be well lit. Others would fall into shadow. And subtle details—who nodded, who leaned in, who quietly mattered—would be missed.
That’s exactly the limitation of a single attention mechanism.
So the Transformer does two very human things next:
- it turns on multiple spotlights at once (Multi-Head Attention), and
- it lets each guest step aside and think privately about what they just learned (Feed-Forward Network).
Why one attention view isn’t enough
In self-attention, each token blends information from others using one learned notion of relevance. But it’s not enough because language has many kinds of relationships happening at the same time. At a networking event, people connect for different reasons:
- job roles,
- seniority,
- shared interests,
- proximity in the conversation,
- subtle grammatical glue words.
One spotlight can’t catch all of that.
So we add…
Multi-Head Attention — many spotlights, same room
Instead of running attention once, the Transformer runs it several times in parallel, each time with different learned projections.
Think of it this way:
- Head 1 watches who is doing what (roles).
- Head 2 watches which words travel together (phrases).
- Head 3 watches long-range connections (who is linked far away).
- Head 4 watches tiny glue words that quietly hold meaning together.
Same guests. Same room. Different spotlights.
How this works
Each head starts with the same input vectors from Blog 3, but applies its own Q/K/V transformations:
- Head h uses its own matrices: W_Qʰ, W_Kʰ, W_Vʰ
- This creates a different “view” of relevance
Each head computes attention independently.
Then:
- all head outputs are concatenated (placed side by side), and
- passed through one final linear layer to blend them back together.
No new magic—just parallel perspectives.
If a single attention head is like listening to one expert, then multi-head attention is like hosting a panel discussion.
Each panelist notices different things.
Together, they produce a richer answer.
Why attention alone still isn’t enough
At this point, every guest has:
- listened to others,
- blended information,
- and updated their understanding.
But listening is not the same as thinking.
After a good conversation, humans often step aside and process:
“Okay… what does all this mean for me?”
That’s the role of the Feed-Forward Network.
Feed-Forward Network (FFN) — the private thinking step
After attention, each token goes through a small neural network independently of all others.
Important detail: It’s the same network for every token, but applied separately.
Think of it as:
- everyone having the same style of brain,
- but each one thinking about their own notes.
What the FFN looks like
It’s very simple:
- a linear layer (expand the features),
- a non-linear activation (ReLU),
- another linear layer (compress back).
ReLU here plays a clear role:
- it keeps useful signals,
- drops unhelpful ones,
- and allows complex patterns to form.
No probabilities yet. No voting. Just refinement.
To summarize:
- Self-attention mixes information across tokens.
- Limitation: One view misses patterns; mixing alone doesn’t transform deeply.
- Multi-head attention captures multiple relationship types in parallel.
- Feed-forward networks let each token privately transform what it learned.
Together, they turn raw interaction into understanding.
What’s still missing (and why Blog 5 exists)
We now have guests who:
- know who they are (embeddings),
- know where they stand (positional information),
- know who matters to them (self-attention),
- see the room through many lenses (multi-head),
- and think privately (FFN).
But one crucial question remains:
“How does the model produce actual words, one step at a time, without cheating?”
That’s where masking, probabilities, and choosing the next word come in.
In Blog 5, we close the loop—from understanding to generation.
Finish your coffee. This is the finale.