Lesson 2 of 6
Understanding AI Lesson 2 - How Machines Learn from Data
Lesson 2 of 6

How Machines
Learn from Data

Features, labels, training sets and decision boundaries. This lesson explains the mechanics of how a machine learning model actually gets better at its job.

GCSE and A-Level Free Interactive classifier

Netflix noticed something strange in their viewing data. Users who watched longer episode previews were far less likely to cancel their subscription. Nobody had a theory about this. No programmer had written a rule for it. The pattern emerged entirely from analysing the behaviour of millions of users across billions of data points.

Netflix did not decide that preview length mattered. The data showed them. That is the essential idea behind machine learning: you do not reason your way to the answer. You let the data reveal it.

Netflix Engineering Blog, 2016.

Think: If a programmer had tried to predict this, they probably would have focused on content quality, price, or customer service. The data found a pattern nobody was looking for. What does this tell you about the relationship between data and insight?

The vocabulary of machine learning

Before we can understand how learning happens, we need four terms. These appear in every ML system ever built.

Features
The inputs to the model. Each measurable property of a data point. A house has features: square footage, number of rooms, postcode. An email has features: word frequency, sender, time sent.
Labels
The correct answer for each training example. "Spam" or "not spam." "Cat" or "not cat." "Fraudulent" or "legitimate." Labels are what the model is trying to predict.
Training data
The examples used to teach the model. A collection of feature-label pairs. The model sees these during training and adjusts its internal settings based on them.
Test data
A held-back set of examples the model never sees during training. Used to measure how well it generalises to new data. If test accuracy is much lower than training accuracy, something is wrong.
Why split training and test data?
If you tested the model on the same data it trained on, you'd be asking "can you remember what you were shown?" That's not learning. Test data measures whether the model can handle genuinely new situations it has never encountered. This is called generalisation.

Finding a decision boundary

Imagine you want to classify fruit as either an apple or an orange. You measure two features for each piece of fruit: its weight and its colour score (how orange it is on a scale of 0 to 10). You plot each fruit as a point on a graph.

After plotting enough examples, a pattern emerges. The apples cluster in one region of the graph. The oranges cluster in another. The model's job is to find a decision boundary: a line (or curve) that separates the two groups as cleanly as possible.

Once that boundary is found, classifying a new fruit is simple: plot its features on the graph and check which side of the boundary it falls on. The model never saw that specific fruit during training. But it can make a confident prediction based on the pattern it learned.

Overfitting vs Underfitting
Overfitting: The model learns the training data too precisely, including its noise and quirks. It performs brilliantly on training data but badly on test data. Like memorising last year's exam answers rather than understanding the subject.

Underfitting: The model is too simple to capture the real pattern. It performs badly on both training and test data. Like using a single straight line to separate two groups that need a curved boundary.

Build your own classifier

Fruit Classifier
Plot training data, then train the model to find a decision boundary

You are training a model to classify apples vs oranges based on weight (x-axis) and colour score (y-axis). Click the canvas to add training examples. Select which fruit you are adding, then click Train Model to see the decision boundary the model finds.

Apple (click to add)
Orange (click to add)
Decision boundary
Low colour scoreHigh colour score
LightWeight Heavy
Add at least 4 points of each type, then click Train Model to see the decision boundary.

Diagnose the model

A data scientist trained three models on different datasets. Each produced a report showing accuracy on training data vs. accuracy on unseen test data. For each model, decide what went wrong - or right.

Model A
Classifying medical images as cancerous or benign
Training accuracy
97%
Test accuracy
63%
Model B
Predicting whether a customer will cancel their subscription
Training accuracy
91%
Test accuracy
88%
Model C
Detecting spam emails
Training accuracy
61%
Test accuracy
58%

Questions worth thinking about

Question 1
Why is it essential to use test data that the model has never seen during training?
Key points: If you evaluate the model on its training data, you are only testing its memory, not its ability to generalise. A model could memorise all training examples perfectly (overfitting) and score 100% on training data while being useless on new data. Test data simulates real-world use: the model must handle inputs it has never encountered. Only test accuracy tells you whether the model has genuinely learned the underlying pattern.
Question 2
Labelling training data is expensive and time-consuming. Who does it, and what happens if they make mistakes?
Key points: Labelling is often done by human annotators, sometimes thousands of people working via platforms like Amazon Mechanical Turk. For specialist domains (medical images, legal text), expert labellers are needed, which is very expensive. Labelling errors introduce noise into the training data. If 5% of labels are wrong, the model learns those mistakes as if they were correct. This is why data quality auditing is a critical and often underestimated part of building ML systems.
Question 3
Could you train a model to predict exam grades from student features? What features would you choose, and what ethical problems might arise?
Key points: You could attempt this using features like prior grades, attendance, time spent on tasks, or socioeconomic indicators. The ethical issues are serious: if the model learns that students from certain postcodes or schools typically underperform, it may predict low grades for students from those areas regardless of their individual ability. This becomes a self-fulfilling prophecy if the prediction affects resources or support given. Using protected characteristics (directly or indirectly via proxy features) in automated decisions about people is legally and ethically problematic in the UK under the Equality Act 2010.

What to remember

Core takeaways - Lesson 2
1
Features are the inputs to a model - the measurable properties of each example. The choice of features strongly influences what patterns the model can detect.
2
Labels are the correct answers in training data. Without labels, the model has nothing to compare its predictions against and cannot improve.
3
Training data teaches the model. Test data measures whether the model has genuinely learned or just memorised. Both are essential.
4
A decision boundary separates data into classes. The model's goal is to find the boundary that best separates the training examples, then apply it to new data.
5
Overfitting means memorising, not learning. A model that fits training data too precisely will fail on real-world data that looks slightly different.

Check your understanding

5 Questions
Answer all five, then submit for instant feedback
Question 1
In machine learning, what is a "feature"?
The correct answer for a training example
A measurable input property of a data point used by the model
A special capability of an advanced AI system
The output prediction made by the model
Question 2
Why is data split into training and test sets?
To make training faster by using less data
To measure whether the model genuinely generalises to new data, not just memorises training examples
Because the model cannot process all the data at once
To avoid the model learning from labelled data
Question 3
A model scores 99% on training data but only 62% on test data. What is the most likely explanation?
The model is underfitting
The test data is corrupted
The model is overfitting - it has memorised training examples rather than learning the underlying pattern
The model needs more training data
Question 4
What is a decision boundary in a classification model?
The maximum number of training examples the model can process
A line or boundary that separates different classes in the feature space
The point at which the model stops training
The threshold of accuracy required before deployment
Question 5
An engineer builds a fraud detection model using only fraudulent transactions as training data, with no legitimate transactions included.
What is the fundamental problem with this approach?
The model will run too slowly without both classes
The model needs labelled examples of both classes to learn the difference between them
Fraudulent transactions are too complex to use as training data
The model cannot be tested without both classes

Exam-style practice

Write a structured answer
A company wants to train a machine learning model to predict whether a customer will cancel their subscription in the next 30 days. Describe what training data would be needed and explain what features and labels would be required. [4 marks]
[4 marks]
0 words
Mark scheme - 4 marks
Training data would consist of historical customer records with known outcomes (e.g. did they cancel within 30 days). (1 mark)
Features (inputs) could include: number of logins in the last month, time since last activity, number of support tickets raised, subscription length, plan type, payment method. Any two valid examples. (1 mark)
Labels would be binary: "cancelled within 30 days" (1) or "did not cancel" (0), applied to historical customer records by checking what actually happened. (1 mark)
The data should be split into training and test sets so the model's ability to generalise to new customers can be measured. (1 mark)
Accept any reasonable features. Full marks require both features and labels to be identified and explained, not just listed.
Printable Worksheets

Practice what you've learned

Three printable worksheets covering supervised learning, training data, and overfitting at three levels: Recall, Apply, and Exam-style.

Exam Practice
Lesson 2: How Machines Learn from Data
GCSE-style written questions covering AI concepts. Work through them like an exam.
Start exam practice Download PDF exam
Lesson 2 - Teacher Resources
How Machines Learn from Data
Teacher mode (all pages)
Shows examiner notes on the Exam Practice page
Suggested starter (5 min)
Draw two curves on the board fitted to the same 6 data points: one that passes through every point exactly, and one smooth line that misses some. Ask: which would you use to predict a new value you haven't seen before? Take answers. This makes overfitting intuitive before any formal definition - students see the problem instantly.
Lesson objectives
1Describe the role of a training dataset in machine learning and explain what the model is doing when it trains.
2Explain what overfitting is, why it happens, and why it reduces a model's real-world performance.
3Describe what a decision boundary is and what it represents in a classification problem.
Key vocabulary (board-ready)
Training set
The portion of a dataset used to teach a machine learning model the patterns it needs.
Test set
A separate portion of data the model has never seen, used to evaluate how well it generalises to new examples.
Overfitting
When a model performs very well on training data but poorly on new data, because it has learned specific examples rather than underlying patterns.
Decision boundary
A line or surface that separates different classes in a classification problem, learned from the training data.
Feature
An individual measurable property of the data used as input to a machine learning model (e.g., email length, number of capital letters).
Discussion prompts
A music streaming service trains its recommendation model on 2015-2020 data, then deploys it in 2025. What problems might arise - and why does this happen with ML systems?
A student memorises mark scheme answers and gets 100% on practice papers, then fails the real exam. How is this like overfitting in machine learning?
Should a model that achieves 99% accuracy on its training data be trusted in production? What other information would you need before deploying it?
Common misconceptions
X"High accuracy on training data means the model is good" - a model can be 100% accurate on training data and useless on new data. Training accuracy alone means nothing.
X"More training data always makes a model better" - quality matters as much as quantity. Biased or mislabelled data makes a larger model worse, not better.
X"The test set is used to improve the model" - the test set must never influence training. Using it to adjust the model invalidates the evaluation entirely.
Exit ticket questions
Define overfitting in the context of machine learning.
[1 mark]
A model achieves 98% accuracy on its training data and 61% on new data. What does this suggest about the model?
[1 mark]
Explain why a machine learning model needs a separate test set that is not used during training.
[2 marks]
Homework idea
A company trains a fraud detection model on bank transactions from January to June. It performs well in testing. They deploy it in December and it starts missing fraud cases. Write a paragraph explaining why this might happen and what the company should do to fix it.
Classroom tips
The overfitting concept maps directly onto "learning the mark scheme" rather than understanding. Students grasp this analogy immediately - use it early.
Pair activity: give students a small dataset (5-6 points on a board) and ask them to draw a decision boundary. Compare across pairs and discuss whose generalises better.
Timing: 20 minutes independent / 35 minutes with discussion.
Resources
AI Ethics Exam Practice Download student worksheet (PDF) Set as class homework (coming soon)