Lesson 2 of 6

How Machines
Learn from Data

Features, labels, training sets and decision boundaries. This lesson explains the mechanics of how a machine learning model actually gets better at its job.

GCSE and A-Level Free Interactive classifier

Hook

Netflix noticed something strange in their viewing data. Users who watched longer episode previews were far less likely to cancel their subscription. Nobody had a theory about this. No programmer had written a rule for it. The pattern emerged entirely from analysing the behaviour of millions of users across billions of data points.

Netflix did not decide that preview length mattered. The data showed them. That is the essential idea behind machine learning: you do not reason your way to the answer. You let the data reveal it.

Netflix Engineering Blog, 2016.

Think: If a programmer had tried to predict this, they probably would have focused on content quality, price, or customer service. The data found a pattern nobody was looking for. What does this tell you about the relationship between data and insight?

Core Concepts

The vocabulary of machine learning

Before we can understand how learning happens, we need four terms. These appear in every ML system ever built.

Features

The inputs to the model. Each measurable property of a data point. A house has features: square footage, number of rooms, postcode. An email has features: word frequency, sender, time sent.

Labels

The correct answer for each training example. "Spam" or "not spam." "Cat" or "not cat." "Fraudulent" or "legitimate." Labels are what the model is trying to predict.

Training data

The examples used to teach the model. A collection of feature-label pairs. The model sees these during training and adjusts its internal settings based on them.

Test data

A held-back set of examples the model never sees during training. Used to measure how well it generalises to new data. If test accuracy is much lower than training accuracy, something is wrong.

Why split training and test data?

If you tested the model on the same data it trained on, you'd be asking "can you remember what you were shown?" That's not learning. Test data measures whether the model can handle genuinely new situations it has never encountered. This is called generalisation.

Explanation

Finding a decision boundary

Imagine you want to classify fruit as either an apple or an orange. You measure two features for each piece of fruit: its weight and its colour score (how orange it is on a scale of 0 to 10). You plot each fruit as a point on a graph.

After plotting enough examples, a pattern emerges. The apples cluster in one region of the graph. The oranges cluster in another. The model's job is to find a decision boundary: a line (or curve) that separates the two groups as cleanly as possible.

Once that boundary is found, classifying a new fruit is simple: plot its features on the graph and check which side of the boundary it falls on. The model never saw that specific fruit during training. But it can make a confident prediction based on the pattern it learned.

Overfitting vs Underfitting

Overfitting: The model learns the training data too precisely, including its noise and quirks. It performs brilliantly on training data but badly on test data. Like memorising last year's exam answers rather than understanding the subject.

Underfitting: The model is too simple to capture the real pattern. It performs badly on both training and test data. Like using a single straight line to separate two groups that need a curved boundary.

Interactive Activity

Build your own classifier

Fruit Classifier

Plot training data, then train the model to find a decision boundary

You are training a model to classify apples vs oranges based on weight (x-axis) and colour score (y-axis). Click the canvas to add training examples. Select which fruit you are adding, then click Train Model to see the decision boundary the model finds.

Apple (click to add)

Orange (click to add)

Decision boundary

Low colour scoreHigh colour score

LightWeight Heavy

Add at least 4 points of each type, then click Train Model to see the decision boundary.

Interactive Activity 2

Diagnose the model

A data scientist trained three models on different datasets. Each produced a report showing accuracy on training data vs. accuracy on unseen test data. For each model, decide what went wrong - or right.

Model A

Classifying medical images as cancerous or benign

Training accuracy

97%

Test accuracy

63%

Model B

Predicting whether a customer will cancel their subscription

Training accuracy

91%

Test accuracy

88%

Model C

Detecting spam emails

Training accuracy

61%

Test accuracy

58%

Think Deeper

Questions worth thinking about

Question 1

Why is it essential to use test data that the model has never seen during training?

Key points: If you evaluate the model on its training data, you are only testing its memory, not its ability to generalise. A model could memorise all training examples perfectly (overfitting) and score 100% on training data while being useless on new data. Test data simulates real-world use: the model must handle inputs it has never encountered. Only test accuracy tells you whether the model has genuinely learned the underlying pattern.

Question 2

Labelling training data is expensive and time-consuming. Who does it, and what happens if they make mistakes?

Key points: Labelling is often done by human annotators, sometimes thousands of people working via platforms like Amazon Mechanical Turk. For specialist domains (medical images, legal text), expert labellers are needed, which is very expensive. Labelling errors introduce noise into the training data. If 5% of labels are wrong, the model learns those mistakes as if they were correct. This is why data quality auditing is a critical and often underestimated part of building ML systems.

Question 3

Could you train a model to predict exam grades from student features? What features would you choose, and what ethical problems might arise?

Key points: You could attempt this using features like prior grades, attendance, time spent on tasks, or socioeconomic indicators. The ethical issues are serious: if the model learns that students from certain postcodes or schools typically underperform, it may predict low grades for students from those areas regardless of their individual ability. This becomes a self-fulfilling prophecy if the prediction affects resources or support given. Using protected characteristics (directly or indirectly via proxy features) in automated decisions about people is legally and ethically problematic in the UK under the Equality Act 2010.

Key Points

What to remember

Core takeaways - Lesson 2

Features are the inputs to a model - the measurable properties of each example. The choice of features strongly influences what patterns the model can detect.

Labels are the correct answers in training data. Without labels, the model has nothing to compare its predictions against and cannot improve.

Training data teaches the model. Test data measures whether the model has genuinely learned or just memorised. Both are essential.

A decision boundary separates data into classes. The model's goal is to find the boundary that best separates the training examples, then apply it to new data.

Overfitting means memorising, not learning. A model that fits training data too precisely will fail on real-world data that looks slightly different.

Go and Research

Explore further

Wikipedia makes an excellent starting point for established computing concepts. For any specific fact or claim, scroll to the References section at the bottom of the article and go to the primary source directly.

Teachable Machine - Google

Train a real image classifier in your browser. Notice how adding more varied training examples improves accuracy. This is the feature-label process made visible.

Google

TensorFlow Playground

Visualise a neural network finding decision boundaries on different datasets. Adjust features, hidden layers and training rate - see overfitting happen in real time.

Google / TensorFlow

How machines learn - CGP Grey

A clear visual explanation of how ML models improve through feedback. Accessible and well-paced for a first watch on this topic.

YouTube - 6 min

Wikipedia - Overfitting

Read the introduction and look at the diagrams. The visual comparison of underfitting, good fit and overfitting is the clearest explanation of this concept.

Wikipedia

Quick Quiz

Check your understanding

5 Questions

Answer all five, then submit for instant feedback

Question 1

In machine learning, what is a "feature"?

The correct answer for a training example

A measurable input property of a data point used by the model

A special capability of an advanced AI system

The output prediction made by the model

Question 2

Why is data split into training and test sets?

To make training faster by using less data

To measure whether the model genuinely generalises to new data, not just memorises training examples

Because the model cannot process all the data at once

To avoid the model learning from labelled data

Question 3

A model scores 99% on training data but only 62% on test data. What is the most likely explanation?

The model is underfitting

The test data is corrupted

The model is overfitting - it has memorised training examples rather than learning the underlying pattern

The model needs more training data

Question 4

What is a decision boundary in a classification model?

The maximum number of training examples the model can process

A line or boundary that separates different classes in the feature space

The point at which the model stops training

The threshold of accuracy required before deployment

Question 5

An engineer builds a fraud detection model using only fraudulent transactions as training data, with no legitimate transactions included.

What is the fundamental problem with this approach?

The model will run too slowly without both classes

The model needs labelled examples of both classes to learn the difference between them

Fraudulent transactions are too complex to use as training data

The model cannot be tested without both classes

Extended Answer

Exam-style practice

Write a structured answer

A company wants to train a machine learning model to predict whether a customer will cancel their subscription in the next 30 days. Describe what training data would be needed and explain what features and labels would be required. [4 marks]

[4 marks]

0 words

Mark scheme - 4 marks

Training data would consist of historical customer records with known outcomes (e.g. did they cancel within 30 days). (1 mark)

Features (inputs) could include: number of logins in the last month, time since last activity, number of support tickets raised, subscription length, plan type, payment method. Any two valid examples. (1 mark)

Labels would be binary: "cancelled within 30 days" (1) or "did not cancel" (0), applied to historical customer records by checking what actually happened. (1 mark)

The data should be split into training and test sets so the model's ability to generalise to new customers can be measured. (1 mark)

Accept any reasonable features. Full marks require both features and labels to be identified and explained, not just listed.

Printable Worksheets

Practice what you've learned

Three printable worksheets covering supervised learning, training data, and overfitting at three levels: Recall, Apply, and Exam-style.

Recall

Worksheet 1

Key term matching + True/False + fill-in-the-blanks

Apply

Worksheet 2

Supervised vs unsupervised tasks + features/labels + overfitting analysis

Exam-style

Worksheet 3

Extended exam-style writing questions

Exam Practice

Lesson 2: How Machines Learn from Data

GCSE-style written questions covering AI concepts. Work through them like an exam.

Start exam practice Download PDF exam

Back to the full series

Lesson 2 - Teacher Resources

How Machines Learn from Data

Teacher mode (all pages)

Shows examiner notes on the Exam Practice page

Suggested starter (5 min)

Draw two curves on the board fitted to the same 6 data points: one that passes through every point exactly, and one smooth line that misses some. Ask: which would you use to predict a new value you haven't seen before? Take answers. This makes overfitting intuitive before any formal definition - students see the problem instantly.

Lesson objectives

1Describe the role of a training dataset in machine learning and explain what the model is doing when it trains.

2Explain what overfitting is, why it happens, and why it reduces a model's real-world performance.

3Describe what a decision boundary is and what it represents in a classification problem.

Key vocabulary (board-ready)

Training set

The portion of a dataset used to teach a machine learning model the patterns it needs.

Test set

A separate portion of data the model has never seen, used to evaluate how well it generalises to new examples.

Overfitting

When a model performs very well on training data but poorly on new data, because it has learned specific examples rather than underlying patterns.

Decision boundary

A line or surface that separates different classes in a classification problem, learned from the training data.

Feature

An individual measurable property of the data used as input to a machine learning model (e.g., email length, number of capital letters).

Discussion prompts

A music streaming service trains its recommendation model on 2015-2020 data, then deploys it in 2025. What problems might arise - and why does this happen with ML systems?

A student memorises mark scheme answers and gets 100% on practice papers, then fails the real exam. How is this like overfitting in machine learning?

Should a model that achieves 99% accuracy on its training data be trusted in production? What other information would you need before deploying it?

Common misconceptions

X"High accuracy on training data means the model is good" - a model can be 100% accurate on training data and useless on new data. Training accuracy alone means nothing.

X"More training data always makes a model better" - quality matters as much as quantity. Biased or mislabelled data makes a larger model worse, not better.

X"The test set is used to improve the model" - the test set must never influence training. Using it to adjust the model invalidates the evaluation entirely.

Exit ticket questions

Define overfitting in the context of machine learning.

[1 mark]

A model achieves 98% accuracy on its training data and 61% on new data. What does this suggest about the model?

[1 mark]

Explain why a machine learning model needs a separate test set that is not used during training.

[2 marks]

Homework idea

A company trains a fraud detection model on bank transactions from January to June. It performs well in testing. They deploy it in December and it starts missing fraud cases. Write a paragraph explaining why this might happen and what the company should do to fix it.

Classroom tips

The overfitting concept maps directly onto "learning the mark scheme" rather than understanding. Students grasp this analogy immediately - use it early.

Pair activity: give students a small dataset (5-6 points on a board) and ask them to draw a decision boundary. Compare across pairs and discuss whose generalises better.

Timing: 20 minutes independent / 35 minutes with discussion.

Resources

AI Ethics Exam Practice Download student worksheet (PDF) Set as class homework (coming soon)

How MachinesLearn from Data

The vocabulary of machine learning

Finding a decision boundary

Build your own classifier

Diagnose the model

Questions worth thinking about

What to remember

Explore further

Check your understanding

Exam-style practice

Practice what you've learned

How Machines
Learn from Data