Part 4 – When the Model Starts Cheating: Overfitting, Underfitting, and Taming the Network

By now, our little exam predictor has grown up quite a bit:

In Part 1, it learned what it’s trying to do:
take study_hours and sleep_hours, combine them with weights and bias, and make a prediction.
Then it measures how wrong it is with a loss.
In Part 2, it learned how to improve itself:
treat the loss as a landscape, use gradients and gradient descent to walk downhill toward a low-loss valley.
In Part 3, it got a personality:
with activation functions, it can now bend and squish its behaviour instead of pretending life is a straight line.

At this point, the network looks smart on paper. But now we hit a very human problem:

The model can become that kid who memorises last year’s question paper perfectly…
and still flops in the real exam.

This is the world of underfitting, overfitting, regularisation, and a few quiet tricks that keep the network honest.

Training Loss Alone Can Lie

Imagine you are coaching a group of students.

You give them a stack of practice questions and let them solve them again and again. At the end, you test them… on the same practice set.

Of course they score well.

You proudly conclude:

“My teaching is amazing! They get 99% on practice!”

Then the actual board exam happens… and scores crash.

They didn’t really learn the subject. They learned the practice sheet.

Your neural network can do the same thing.

If you only look at:

training loss (how wrong it is on the data it sees during training),

you may think it is brilliant… while it is quietly memorising patterns instead of learning a general rule.

So we need to ask a tougher question:

“How do you behave on students you have never seen before?”

That leads to our first tool: a validation set.

Training vs Validation: Two Groups of Students

We take all our past student data and split it into at least two groups:

Training set
Students the model is allowed to learn from.
Their data is shown again and again during training.
The model adjusts its weights and bias to reduce loss on them.
Validation set
Students the model is not allowed to learn from directly.
Their data is kept aside.
The model does not update its weights using them.
We only use them to test how well the model generalises.

The training process now looks like this:

During training rounds:
Feed training students → compute training loss → update weights.
Every so often:
Feed validation students (no weight updates) → compute validation loss.

Now we watch two curves as training goes on:

Training loss over time.
Validation loss over time.

These two together tell us what kind of student our model is becoming.

In many real projects, there is even a third group of students called the test set.
The model never learns from them and we don’t even use them while tuning settings. We only look at test results once, at the very end, as the honest “Board exam score” that tells us how well the final, chosen model behaves on completely unseen data.

Underfitting: The Model Is Just Too Dumb

Case 1:

Training loss is high.
Validation loss is also high.

The model is doing badly even on questions it has seen many times.

This is like a student who:

does not study enough,
uses an oversimplified method,
and consistently gets things wrong in both practice and exams.

This is called underfitting.

In our exam story, underfitting might happen if:

Our network is too simple (say, just a straight-line model, no useful hidden layer).
We ignore important inputs (maybe we only use study_hours and ignore sleep_hours and other factors).
We have not trained for long enough, or the learning rate is too tiny.

How to fix underfitting (in words):

Give the model a richer structure:
Add a hidden layer or a few more neurons.
Add more meaningful inputs if possible:
revision_hours, mock_test_scores, etc.
Let it learn a bit longer if training loss is still clearly going down.

Underfitting is the “Too simple, doesn’t try hard enough” end of the spectrum.

Overfitting: The Model Becomes a Memorising Parrot

Case 2:

Training loss is low (model looks perfect on practice).
Validation loss is high (model fails on unseen students).

This means:

The model has memorised the training data,
but failed to learn a general rule.

In our exam story, the model might be acting like this:

“Ah, this exact combo of study_hours = 5.2 and sleep_hours = 6.8 belonged to Student 37 who scored 83, so anything near that must also be near 83.”

It is storing quirky details about specific students, instead of learning the overall relationship between study, sleep, and performance.

This is overfitting.

Overfitting is like a student who:

memorises last year’s question papers word for word,
but panics when the board asks the same concept in a slightly different way.

The model’s brain is hugging each training point too tightly and ignoring the big picture.

A Quiet Influencer: How We Start the Weights

Before we talk about how to control the model, it’s worth mentioning one often invisible choice:

“Where do all these weights and biases start from?”

This is called initialisation.

If we start in a very bad place, learning can be slow, unstable, or stuck.

Two things we usually avoid:

Starting all weights at exactly zero
Then every neuron in a layer behaves identically. They see the same thing, produce the same output, and get the same update.
It’s like hiring a panel of teachers who all copy each other’s answer key. Zero diversity, zero specialisation.
Starting with weights that are too huge or too tiny
- If they are huge, the internal signals or losses can explode.
- If they are too tiny, signals can shrink and almost disappear as they move through layers.

So in practice, we:

Start with small random values,
Chosen in a way that keeps signals from exploding or vanishing too quickly.

You may have seen names like:

Xavier / Glorot initialisation: It picks small random starting weights so that, on average, the signals going into and out of each layer stay at a reasonable scale. Most often used with smoother activations like sigmoid or tanh, it helps those functions avoid getting stuck in their flat (saturated) regions.
He initialisation: It also picks small random starting weights, but tuned specifically for ReLU‑style activations. It keeps ReLU neurons from seeing values that are too huge or too tiny at the start, which reduces the chance of exploding outputs or everything immediately dying at zero.

For this blog, you don’t need formulas. It’s enough to remember:

“We don’t start from all zeros. We start from sensible small random values so that learning has a fair chance to move in the right direction.”

This doesn’t itself solve overfitting or underfitting, but it’s like giving the model a decent first impression before training begins.

Regularisation: Gentle Rules to Keep the Model Humble

Now back to our main problem: overfitting.

A deep network with many weights can twist itself into weird shapes that perfectly hug the training students but fail on new ones.

We need a way to say:

“Yes, fit the training data… but not too aggressively.
Don’t become weird just to please a few noisy points.”

This is what regularisation does.

You can think of it as adding a good behaviour clause to the loss:

The model still tries to reduce prediction error.
But it also gets a small penalty for becoming too complex or using extreme internal settings.

We’ll look at three kinds:

Weight penalties (Ridge and Lasso).
Dropout.
Early stopping.

Weight Penalties: Ridge and Lasso

Weight penalties nudge the model towards simpler, calmer internal settings by discouraging certain kinds of weights.

Ridge (L2): “Shrink the loud knobs”

Ridge regularisation (often called L2) adds a small cost when weights become large.

In plain language:

“You may use many inputs, but don’t shout.
Keep your weights reasonable instead of blowing up a few of them.”

In our exam story:

If weight_study tries to grow giant while others stay small,
a Ridge penalty quietly says, “Relax. You’re relying too heavily on this one factor.”

Effect:

It shrinks all weights towards zero,
But usually does not push them exactly to zero.
The model still uses many features, just less aggressively.

You can think of Ridge as:

“Talk to all your inputs, but use an indoor voice.”

Lasso (L1): “Turn off the useless knobs”

Lasso regularisation (often called L1) goes one step further:

“If some inputs are not really helping,
I’d rather you ignore them completely.”

In our exam story:

Suppose we add many features: study_hours, sleep_hours, revision_hours, group_study_hours, snacks_eaten, etc.
Lasso will tend to push the weights of truly useless features all the way to zero.

Effect:

It acts like an automatic feature selector.
Some weights become exactly zero. Those inputs are effectively removed from the model.
The model becomes simpler and often easier to interpret.

You can think of Lasso as:

“Instead of listening to everyone,
pick a smaller, truly useful group and ignore the rest.”

Together in your mental map

Ridge (L2) says, “Keep all knobs, but keep them small.”
The model still listens to many inputs, just not too loudly.
Lasso (L1) says, “Keep only the helpful knobs.”
The model cuts off some inputs completely by setting their weights to zero.

In practice, people sometimes use a mix of the two (called Elastic Net), but for now, just knowing their personalities is enough.

Dropout – “Some neurons are randomly told to take a break”

Dropout is a surprisingly powerful regularisation trick.

During training:

For each batch of students, we randomly pick some neurons and temporarily set their output to zero.
In that training step, those neurons are effectively “on leave”.

Why is this helpful?

The network never knows exactly which neurons will be available.
It cannot rely too heavily on one super-memoriser neuron.
Knowledge is forced to spread across many neurons.

In our exam story:

Maybe one neuron discovered a very oddly specific pattern in a small group of students.
With dropout, that neuron is sometimes turned off.
The model must find multiple ways to make good predictions, leading to more robust patterns.

At test time (with new students):

We do not drop neurons.
We use the full network, but it has already learned to work in a more balanced, shared way.

Early Stopping – “Stop when you begin to overthink”

Remember our two curves:

Training loss (on training students).
Validation loss (on validation students).

Typically:

At first, both go down.
The model is learning genuinely helpful patterns.
After some point, something tricky happens:
- Training loss keeps going down. The model is fitting the training data better and better.
- Validation loss stops improving, or even starts going up.

This is the sign of overfitting kicking in.

Early stopping is a simple, powerful rule:

“Keep training as long as validation loss is improving.
Once it stops improving for a while (or gets worse),
stop training and keep the version of the model that had the best validation loss.”

In human exam terms:

You watch your student practise.
Once you see that extra late-night questions are no longer improving mock test scores—and might even be confusing them—you say:

“That’s enough. Close the book. Sleep.”

Early stopping prevents the model from sliding out of the “Good understanding” zone into the “I memorised weird noise” zone.

A Tiny Sanity Checklist for Your Exam Network

Here’s a simple mental checklist you can keep:

Training loss high, validation loss high
- Likely underfitting.
- Model too simple, not trained enough, or missing important inputs.
- Fix: Add capacity (more layers/neurons), better features, train longer.
Training loss low, validation loss high
- Classic overfitting.
- Model is memorising training students.
- Fix: Use regularisation (Ridge/Lasso, dropout), early stopping, or get more data.
Training unstable (loss bouncing or exploding)
- Maybe learning rate too high or weight initialisation not sensible.
- Fix: Lower the learning rate, use standard initialisation methods, or normalise inputs.

The goal is not to make the training loss as microscopic as possible.
The real goal is:

“Find a point where training loss is low enough,
and validation loss is also low and stays stable.”

That balance means the model has actually learned something general about the relationship between study, sleep, and performance.

Where We Are in the Story (and What Could Come Next)

At this point in the series, your exam predictor:

Understands the basic task
Inputs → weights → bias → prediction → loss.
Knows how to improve itself
Uses gradients and gradient descent to move downhill on the loss landscape.
Has flexible inner reactions
Activation functions (sigmoid, tanh, ReLU, leaky ReLU, softmax) let it handle straight lines, curves, probabilities, and multi-class choices.
Behaves better in the real world
Uses training vs validation sets, understands underfitting vs overfitting,
and uses regularisation (Ridge, Lasso, dropout, early stopping) with sensible initialisation to stay sane.

But even if you stop here, you already have something precious:

a complete neural network journey you can hold in your head: from raw scores and loss, through gradients and activations, all the way to generalisation, without a single equation, and with enough intuition to not get bullied by math later.