Part 3 – Activation Functions: Same Exam Story with Different Personalities

In Part 1 and Part 2, our little exam predictor learned how to adjust its knobs and walk downhill on the loss landscape. It became good at improving itself, but it still thought in straight lines.

Real students, however, do not behave like straight lines. Too much study can hurt, too little sleep can destroy everything, and sometimes a weird mix of both still works out.

In this part, we give our neuron four different moods—sigmoid, tanh, ReLU, and leaky ReLU, using the same exam story with small twists. Each new activation appears because the previous one leaves a gap.

By the end, you will see activation functions not as mysterious formulas, but as four simple ways of answering one question:

“Given this raw score, how should the neuron react?”

Quick reminder of the situation:

Without any activation function, our neuron does this:

(study_hours × weight_study) + (sleep_hours × weight_sleep) + bias

and directly uses that as the output.

That is like drawing one straight line (or flat surface).
Life, sadly, is not that simple.

We’ll now see four activation functions as four ways of reacting to that raw score,
all explained using the same exam story, with tiny changes in what we’re trying to predict.

No Activation: “Just give me a straight guess”

First, imagine we do no activation at all.

We use the weighted sum directly as predicted_score:

5 hours study, 7 hours sleep and model says 72
1 hour study, 3 hours sleep and model says 38
8 hours study, 2 hours sleep and model says 85

This is fine if:

The relationship between study, sleep, and score is roughly a tilted plane (a straight-ish trend).

Problem:

Real exam behaviour can curve:

A student who studies 10 hours with only 2 hours of sleep might not do better than one who studies 7 hours and sleeps 7.
Too much of a good thing can backfire.

A simple straight line cannot bend to capture such “Helps at first, hurts later” patterns.

So we ask:

“What if we’re not trying to predict exact marks first,
but just a simpler thing like: will this student pass or fail?”

That’s where sigmoid shines.

Sigmoid: “What’s the chance this student will pass?”

Now imagine we change the task:

Instead of predicting the exact exam score, we just want:

“How likely is this student to pass?”

We want a value between 0 and 1, like:

0.10 indicates “10% chance of passing”
0.90 indicates “90% chance of passing”

The sigmoid activation does exactly that:

If the neuron’s raw score is very low (bad combo of study and sleep),
sigmoid pushes the output close to 0.
If the raw score is very high,
sigmoid pushes the output close to 1.
Middle values get gently mapped between 0 and 1.

In our exam story:

Student A: barely studied, barely slept → model’s raw score is very low → sigmoid output ≈ 0.05 → “Almost surely fail”.
Student B: studied reasonably, slept decently → raw score moderate → sigmoid output ≈ 0.6 → “60% chance of passing”.
Student C: worked hard and slept sensibly → raw score high → sigmoid output ≈ 0.95 → “Very likely to pass”.

Where this is useful:

As the final layer when the model answers a yes/ no question (pass vs fail, spam vs not spam, etc.).

New problem:

Sigmoid likes to flatten out near 0 and near 1.
When that happens deep inside a network, earlier layers get weaker learning signals.
So we ask:

“Can we get something similar, but better balanced around zero?”

That brings us to tanh.

Tanh: “How strongly good or bad is this pattern?”

Now imagine we are not predicting pass vs fail,
but trying to represent how strongly helpful or harmful a pattern is.

For example:

“This combination looks strongly harmful for performance.”
“This combination looks strongly helpful.”
“This combination is kind of neutral.”

Tanh activation does this nicely:

It squashes outputs between -1 and 1.
Negative values can mean “Bad influence”.
Positive values can mean “Good influence”.
Values near zero mean “Almost neutral”.

In our exam story:

Student with 0 hours study and 3 hours sleep → tanh output might be -0.8 → “Strongly bad pattern”.
Student with 5 hours study and 7 hours sleep → tanh output might be +0.7 → “Strongly good pattern”.
Student with odd but not extreme pattern → tanh output near 0 → “Not clearly good or bad”.

Where this is useful:

Often used in hidden layers in older or classical designs, because having positive and negative values balanced around zero can be convenient.

New problem:

Tanh, like sigmoid, still flattens out for large positive and large negative inputs.
When it flattens, learning slows down.

So we ask:

“Is there a simpler function that does not flatten so much,
and works really well in deep networks?”

Enter ReLU.

ReLU: “I speak only when it matters”

Now we switch back to our original task: predicting the exam score.

Inside the network, we have hidden neurons.
Each one is trying to detect some pattern, like:

“Good balance of study and sleep.”
“Too much study with too little sleep.”

We want these neurons to be:

Quiet when their pattern is not present.
Active when their pattern is present.

This is what ReLU does:

If the raw result is negative, then output 0 (silent).
If the raw result is positive, then output it as-is (speaks up).

In our exam story:

For some students, a certain neuron might be negative. Then that neuron says, “Nothing to add here.”
For others, it might be positive and large. Then that neuron says, “This is my kind of pattern; I will speak loudly.”

Where this is useful:

ReLU is widely used in hidden layers of modern networks.
It is simple, fast, and helps deep networks learn complex patterns.

New problem: the “Dying ReLU”

If a neuron’s raw output is negative for almost all students,
ReLU will keep outputting zero again and again.
No output → no adjustment → that neuron can effectively “Die” and stay useless.

So we ask:

“Can we keep ReLU’s simplicity,
but avoid completely shutting a neuron down on the negative side?”

That’s where Leaky ReLU comes in.

Leaky ReLU: “Even your negative moods count a little”

Leaky ReLU is a gentle upgrade to ReLU.

It behaves almost the same, with one small twist:

If the raw result is positive → same as ReLU: output it as-is.
If the raw result is negative → instead of giving 0, it gives a small negative value.

So the neuron is never completely silent.
Even for patterns it “Does not like”, it still whispers.

In our exam story:

For a student with a weird combination (say, absurdly long study with almost no sleep),
a certain neuron might think, “I really dislike this pattern,” and produce a negative value.
ReLU would say: “Fine, I’ll output 0. I’m out.”
Leaky ReLU says: “I will output a small negative number. I still want to be counted a little.”

Why this matters:

Because the neuron still outputs something (even tiny),
it still receives a learning signal and its weights can change.
This reduces the chances of neurons becoming permanently useless.

Where this is useful:

Hidden layers, when you like ReLU’s behaviour but want to reduce the “Dead neuron” problem.

New problem:

We’re still only judging one thing at a time

Notice something:
All our activation functions so far — sigmoid, tanh, ReLU, leaky ReLU talk about how one neuron reacts to one raw score.
That’s perfect when:
we want a single numeric prediction (exam score), or
a yes/no probability (pass vs fail).

But our exam world is often more dramatic than that.
Sometimes we do not want to ask:
“Will this student pass?”

We want to ask:
“Is this student an A, B, C, D, or F?”

Now it is not a single outcome or a simple yes/no.
It is one choice out of many, and we want proper probabilities for each grade.

Sigmoid cannot do that (it only handles yes/no).
ReLU and leaky ReLU do not give probabilities at all, just raw numbers.

So we ask:

“How do we turn a bunch of raw scores
into a clean set of probabilities that add up to 1
and highlight the most likely option?”

That’s where softmax comes in.

Softmax: “Choosing one grade out of many”

Softmax is what we use when the model has to pick one outcome out of many and we want proper probabilities for each option.

Instead of giving one single number, it takes a bunch of raw scores and turns them into a neat “who is how likely?” list.

It behaves like this:

If the model has several raw results (one for each option), then softmax converts them into probabilities between 0 and 1 that all add up to 1.

Options with higher raw results get higher probabilities,
but every option gets some share unless it is extremely unlikely.

So the model is not just shouting one answer.
It is saying, “Here is how confident I am about each choice.”

In our exam story:

Imagine we want to predict the grade category instead of the exact score.

The model first produces five raw numbers:

one for A
one for B
one for C
one for D
one for F

By themselves, these five numbers are just messy scores.
Softmax takes them and turns them into something like:

A: 0.08
B: 0.55
C: 0.30
D: 0.05
F: 0.02

Now the model is effectively saying:

“I think B is most likely,
C is also quite possible,
A and F are very unlikely for this student.”

We then usually pick the grade with the highest probability (here, B) as the final prediction.

Why this matters:

Softmax gives us a clean probability distribution over many options.
It lets the model:

Compare all choices at once
Express confidence levels, not just a single raw score
Make multi-class decisions in a human-friendly way (“Most likely grade”, “Most likely label”)

Where this is useful:

Softmax is typically used in the final layer when the task has more than two classes:

Grade prediction (A/B/C/D/F)
Digit recognition (0–9)
Image label (cat / dog / car / tree / …)
Any “Pick one out of many” situation

Inside the network, hidden layers still use activations like ReLU or leaky ReLU.
Softmax comes at the end as the “Decision stage” that turns raw scores into clear probabilities.

Putting it all together (tiny mental map)

In our one exam story, the four activation behaviours are:

No activation
- Use: simplest case, often for the final numeric output (raw score).
- Behaviour: straight-line relationships only.
Sigmoid
- Use: output layer when we want a probability like “Chance of passing”.
- Behaviour: maps values to the range 0 to 1.
Tanh
- Use: Hidden layers in some designs, when we want “Good vs bad” strength from -1 to 1.
- Behaviour: symmetric around zero, still smooth.
ReLU
- Use: very common in hidden layers for deep networks.
- Behaviour: silent for negative inputs, linear for positive ones.
Leaky ReLU
- Use: hidden layers when we want ReLU-like behaviour but fewer “dead” neurons.
- Behaviour: small negative output instead of absolute zero for negative inputs.
Softmax
- Use: final layer when we have more than two classes and want a probability for each option (e.g., grades A/B/C/D/F).
- Behaviour: takes several raw scores and turns them into probabilities between 0 and 1 that add up to 1, giving higher probability to the largest scores.

At this point, our little exam predictor is no longer a boring straight-line machine.
It has moods.

With no activation, it thinks in plain straight trends.
With sigmoid, it speaks in probabilities like “chance of passing”.
With tanh, it mutters “strongly bad” to “strongly good” on a scale from -1 to 1.
With ReLU, it stays silent for patterns it doesn’t care about and wakes up sharply when things matter.
With leaky ReLU, it refuses to completely shut down, even when it dislikes what it sees.
With softmax, the network doesn’t just blurt one answer—it calmly ranks all options as probabilities and then backs the most likely one.

Same exam story.
Different personalities for how a neuron reacts to a raw score.

Here are one-line, visual mnemonics that tie the name to what it does:

1. No activation
 “Naked neuron = naked line: no curves, just a straight trend.”

2. Sigmoid
 “Sigmoid is the S-shaped switch that squeezes everything into a 0–1 probability.”

3. Tanh
 “Tanh is the two-sided thermometer, sliding smoothly from −1 (cold bad) to +1 (warm good).”

4. ReLU
 “ReLU, the Rectifier, Ruthlessly chops everything below zero and Lets the positives Up.”

5. Leaky ReLU
 “Leaky ReLU is the cracked tap: still ReLU, but it lets a tiny trickle of negatives leak through.”

6. Softmax
 “Softmax is the soft spotlight that brightens the biggest score but still leaves a dim light on all others as probabilities.”

We have now given the network two superpowers:

“It knows how to adjust its knobs and walk downhill on the loss landscape.”
“It knows how to shape its internal reactions using activation functions, so it can bend and twist around real-world patterns instead of pretending life is a straight line.”

In the next part, we step out of the neuron’s head and into the real world.

We’ll ask uncomfortable questions like:

“What if the network becomes that student who memorises old question papers but flops in the actual exam?”
“How do we know if the model has actually understood patterns, not just memorised our training data?”

That’s where training vs validation sets, underfitting, overfitting, and regularisation tricks come in.

In Part 4, we’ll turn our exam predictor into a slightly more mature adult: one that not only learns, but also generalises to new students it has never seen before.