How Machines
Learn from Data
Features, labels, training sets and decision boundaries. This lesson explains the mechanics of how a machine learning model actually gets better at its job.
Netflix noticed something strange in their viewing data. Users who watched longer episode previews
were far less likely to cancel their subscription. Nobody had a theory about this. No programmer
had written a rule for it. The pattern emerged entirely from analysing the behaviour of
millions of users across billions of data points.
Netflix did not decide that preview length mattered. The data showed them.
That is the essential idea behind machine learning: you do not reason your way to the answer.
You let the data reveal it.
Netflix Engineering Blog, 2016.
The vocabulary of machine learning
Before we can understand how learning happens, we need four terms. These appear in every ML system ever built.
Finding a decision boundary
Imagine you want to classify fruit as either an apple or an orange. You measure two features for each piece of fruit: its weight and its colour score (how orange it is on a scale of 0 to 10). You plot each fruit as a point on a graph.
After plotting enough examples, a pattern emerges. The apples cluster in one region of the graph. The oranges cluster in another. The model's job is to find a decision boundary: a line (or curve) that separates the two groups as cleanly as possible.
Once that boundary is found, classifying a new fruit is simple: plot its features on the graph and check which side of the boundary it falls on. The model never saw that specific fruit during training. But it can make a confident prediction based on the pattern it learned.
Underfitting: The model is too simple to capture the real pattern. It performs badly on both training and test data. Like using a single straight line to separate two groups that need a curved boundary.
Build your own classifier
You are training a model to classify apples vs oranges based on weight (x-axis) and colour score (y-axis). Click the canvas to add training examples. Select which fruit you are adding, then click Train Model to see the decision boundary the model finds.
Diagnose the model
A data scientist trained three models on different datasets. Each produced a report showing accuracy on training data vs. accuracy on unseen test data. For each model, decide what went wrong - or right.
Questions worth thinking about
What to remember
Explore further
Wikipedia makes an excellent starting point for established computing concepts. For any specific fact or claim, scroll to the References section at the bottom of the article and go to the primary source directly.
Check your understanding
Exam-style practice
Practice what you've learned
Three printable worksheets covering supervised learning, training data, and overfitting at three levels: Recall, Apply, and Exam-style.