Attention at a Networking Event — Blog 5

No Cheating, Making Choices, and Saying the Next Word (Masking + Output Probabilities)

This is the finale of our networking event.

By now, every guest:

has a profile (embeddings),
knows where they stand (positional information),
decides who matters (self-attention),
sees the room through multiple spotlights (multi-head attention), and
steps aside to think privately (feed-forward network).

The room is alive with understanding.

But understanding alone is not enough.

A language model must do one very specific thing:

Say the next word.

And it must do that honestly, without peeking into the future.

Why the model must not cheat

Imagine the organizer hands out tomorrow’s conversation transcript at the door.

Suddenly, everyone looks brilliant.

But none of it is real understanding.

In language generation, the model must predict the next word using only what has already been said. If it could see future words, it would learn shortcuts instead of language.

So we add a strict rule.

Masking — the velvet rope of the mixer

Masking is the velvet rope that says:

“You may look left. You may not look right.”

When the model is generating text step-by-step, each token is allowed to attend only to:

itself, and
the tokens that came before it.

All future positions are masked out.

In practical terms, this means:

attention scores for future tokens are set to a very large negative number,
softmax turns them into zero probability,
so the model literally cannot pay attention to information it hasn’t earned yet.

No cheating. No spoilers.

From understanding to choice

After all the attention, mixing, and private thinking, each token now holds a final vector that represents its understanding of the situation.

But vectors are not words.

So we need one final translation step.

Logits: Scoring every possible word

The model takes the final vector and passes it through a linear layer.

Think of this as asking:

“Given everything I understand right now, how compatible is each word in my vocabulary as the next word?”

The output is a list of raw scores called logits — one score per vocabulary word.

Important: Logits are not probabilities yet. They can be positive, negative, large, or small.

They are simply preferences.

Softmax: Turning preferences into probabilities

Now softmax returns for the final time.

Softmax takes those logits and turns them into a clean probability distribution:

every value is between 0 and 1,
all probabilities add up to 1.

This answers the real question:

“What are the chances of each possible next word?”

Softmax doesn’t decide the word. It creates options with confidence.

Choosing the next word

Once we have probabilities, the model must actually choose.

There are several ways to do this:

Greedy: always pick the highest-probability word.
Sampling: pick randomly, but weighted by probabilities.
Top-k / Top-p: restrict choices to the most reasonable options, then sample.

Each choice style changes the personality of the generated text — cautious, creative, or bold.

Once the word is chosen:

it gets appended to the sentence,
the entire process repeats.

That’s how language flows forward, one word at a time.

The full story (end-to-end)

Tokens get profiles (embeddings).
Profiles get seat numbers (positional information).
Tokens consult each other (self-attention).
Multiple perspectives reveal richer patterns (multi-head attention).
Each token thinks privately (FFN).
Future information is blocked (masking).
Understanding becomes scores (logits).
Scores become probabilities (softmax).
One word is chosen.

And then… the story continues.

A quiet ending

So the next time someone says:

“Transformers are just attention,”

you’ll know the truth.

They’re not just paying attention.

They’re hosting a well-run networking event — with name tags, seating, conversations, multiple perspectives, private reflection, strict rules against cheating, and finally… a thoughtful choice of what to say next.

Up next is an appendix that takes one clean worked example, performs step-by-step transformation with tiny numericals only where they help, and finally a mnemonic to imprint the Transformer forever.