In Part 1 we built our tiny exam predictor:
- Inputs: study_hours, sleep_hours
- Knobs: weight_study, weight_sleep
- Starting bonus: bias
- Output: predicted_score
- Truth: actual_score
- “How wrong are we?” meter: loss
And we ended with this very important headache:
“We know we must change the weights and bias to make the loss smaller.
But which way should we change them, and by how much?”
Today we answer exactly that.
From Loss to Landscape: Minima, Maxima, and the Hill You’re Standing On
Our model has 3 knobs:
- weight_study
- weight_sleep
- bias
For every possible combination of these 3 numbers, the model gives some predictions and therefore some loss.
You can picture this as a landscape:
- Each choice of weights and bias = one point in space.
- The loss at that point = height of the ground there.
So:
- High loss means that you’re standing on a tall hill (your predictions are bad).
- Low loss means that you’re down in a valley (your predictions are good).
In this landscape:
- A minimum is a low point (good).
- A maximum is a high point (bad).
- A local minimum is a small valley which is lowest in its nearby region.
- A global minimum is the deepest valley of all.
What training wants to do is simple:
Start somewhere on this bumpy landscape
and walk down into a valley where the loss is low.
That’s it.
This is why minima matter: we want to land in a low one.
But we have two problems:
- We cannot see the whole landscape at once.
- Randomly jumping around is silly and slow.
So we need a local guide that tells us:
“From where you stand right now,
which way is downhill for the loss?”
That’s where slope and gradient come in.
Slope vs Gradient vs Gradient Descent vs Learning Rate
Let’s understand these distinctly with our exam story.
Slope: “How tilted is the ground in one direction?”
Imagine just one knob: say only weight_study can change and everything else is frozen.
If you nudge weight_study a little to the right:
- Does the loss go up or down?
- How fast does it change?
That “Up or down, and how fast” is the slope of the loss with respect to that knob.
In 1D:
- Positive slope = if you increase the knob, loss increases (you’re walking uphill).
- Negative slope = if you increase the knob, loss decreases (you’re walking downhill).
- Zero slope = you’re on a flat spot (could be a minimum, maximum, or saddle).
We want to go downhill (towards a minimum),
so we should move against the slope (opposite direction of uphill).
Gradient: “All the slopes at once”
In reality, we have multiple knobs:
- weight_study
- weight_sleep
- bias
- (and in a big network, thousands more)
For each knob, the loss has its own slope:
- “If I change just weight_study, what happens to the loss?”
- “If I change just weight_sleep, what happens?”
- “If I change just bias, what happens?”
The gradient is simply:
“All these slopes together, at the same time.”
You can think of it as:
- A little arrow in “knob space” that points in the direction where loss increases the fastest.
- Its opposite direction is where loss decreases the fastest.
So:
- Slope: One direction, one knob.
- Gradient: All slopes for all knobs = one combined “Uphill” direction.
Now that we know “The direction of steepest uphill,”
we can walk in the opposite direction to go downhill.
That downhill walking has a name: Gradient descent.
Gradient Descent: “The rule for taking steps downhill”
Gradient descent is just a method, a rule:
“Use the gradient (the uphill direction),
step a little in the opposite direction
so the loss becomes smaller.”
Each training step goes like this:
- Use current weights and bias to predict exam scores.
- Compute the loss.
- Compute the gradient of that loss with respect to all weights and bias.
- Move each knob in the direction that reduces the loss.
You repeat this again and again:
- You measure how wrong you are.
- You adjust yourself using the gradient.
- You gradually slide toward a minimum of the loss.
But now a new question appears:
“When I step in the direction opposite to the gradient,
how big should that step be?”
If you step too far, you overshoot the valley.
If you step too little, you progress like a sleepy snail.
This is where learning rate comes in.
Learning Rate: “How big are your steps?”
The learning rate is simply:
“The size of each step you take during gradient descent.”
It multiplies the gradient to decide how far you move.
-
Too large learning rate:
You keep jumping over the minimum and may cause the loss to bounce or explode. -
Too small learning rate:
You move in the right direction, but painfully slowly.
Your model is technically learning, but practically, you grow old watching it.
So:
- Gradient tells us direction.
- Learning rate decides step size.
- Gradient descent is the procedure that uses both to walk downhill.
- Minima are the valleys where this walking tends to settle.
That’s the entire family:
- Slope is the basic tilt for one knob
- Gradient indicates all slopes together
- Gradient descent is typically the rule: “Walk opposite to gradient”
- Learning rate is how big that walk is
The Daily Life of a Network: Forward, Backward, Update
When we train our exam network, each cycle looks like this:
- Forward pass
- Feed inputs: study_hours, sleep_hours.
- Network computes outputs: predicted_score for each student.
- Feed inputs: study_hours, sleep_hours.
- Loss computation
- Compare predicted_score with actual_score.
- Turn the differences into a single loss value.
- Compare predicted_score with actual_score.
- Backward pass
- Compute gradients:
- How does the loss change if we tweak weight_study?
- How about weight_sleep?
- How about bias?
- How does the loss change if we tweak weight_study?
- This is where backpropagation lives: it efficiently calculates these slopes through all layers.
- Compute gradients:
- Update (gradient descent step)
- Use the gradients plus learning rate.
- Nudge each weight and bias in the direction that reduces loss.
Repeat this over and over on many batches of students.
Over time, the network “Settles” into a low-loss configuration, a minimum.
Cool.
We now understand how the knobs get updated.
Next problem:
“What kind of curve can this network actually learn?”
Right now, without any extra spice, our neuron is basically drawing straight lines.
That’s too boring for real life.
Enter: activation functions.
Why We Need Activation Functions: From Straight Lines to Curved Reality
Our basic neuron does:
inputs → multiply by weights → add bias → output
That is essentially a straight line (or flat plane in higher dimensions).
But exam performance is messy:
- Up to a point, more study helps.
- After some extreme over-studying, performance might drop because of burnout.
- Very little sleep is bad.
- Too much sleep the night before might mean no preparation.
These are curved, twisty relationships, not just one straight tilt.
So we need a way to bend our model’s behaviour.
In the next part, we give our model a personality with activation functions so it can bend its behaviour, squish, and decide when to stay quiet or speak loudly about a pattern.