Lesson 6 of 6 · Data & Binary

Data Compression

Files are big. Networks are slow. Storage is expensive. Compression is how computers reconcile these facts. Two approaches, very different trade-offs.

Distinguish between lossy and lossless compression with examples
Explain and apply run-length encoding (RLE) to bitmap data
Describe how Huffman coding assigns shorter codes to frequent characters
Calculate the space saving from a given compression scheme

Why compression matters

From the previous lessons: an uncompressed 4K image is around 25 MB. A 3-minute stereo audio recording is about 30 MB. A 2-hour 4K film in raw form would be around 400 GB. None of these numbers are practical for everyday storage or streaming.

Compression reduces file size by finding and eliminating redundancy in data. There are two fundamentally different approaches:

PropertyLossy compressionLossless compression
Can original be recovered?No -- data is permanently removedYes -- original exactly reconstructed
File size reductionLarge (5:1 to 20:1 or more)Moderate (typically 2:1 to 4:1)
Best forImages (JPEG), audio (MP3), video (MP4)Text, program files, archives (ZIP), PNG
When NOT to useText, code, databases -- errors corrupt meaningWhere maximum size reduction is needed
The key question: Can the recipient tolerate imperfection? A compressed music track that sounds slightly different is fine. A compressed database record with one wrong character could corrupt financial data or a patient's medical record. Know the use case before choosing compression type.

Run-length encoding (RLE)

RLE is a lossless technique. It replaces runs of repeated values with a count and the value. Instead of storing 20 red pixels separately, store "20 red".

Original bitmap row (20 pixels): R R R R R R R R B B B B B B G G G R R R (R=red, B=blue, G=green) RLE encoded: 8R 6B 2G 3R (only 4 pairs) Original: 20 values stored Compressed: 8 numbers stored (4 counts + 4 colour codes) Saving: 60% reduction

RLE works best on data with long runs of the same value -- simple cartoon-style images, icons, and pixel art. It performs poorly on photographic images where almost every pixel is a different colour.

Worst case for RLE: A checkerboard pattern (alternating pixels) where every pixel is different from its neighbours. RLE would produce pairs like "1A 1B 1A 1B..." which uses MORE space than the original, not less. The encoded file would be twice the size.

Worked example: Encode the following row using RLE.

Original: W W W W W W B B B B W W W W B (15 pixels) Identify runs: 6 x W, then 4 x B, then 4 x W, then 1 x B RLE: 6W 4B 4W 1B (8 values) Original stored: 15 x 1 byte = 15 bytes (using 8-bit colour) Compressed: 8 x 1 byte = 8 bytes Saving: 7 bytes = 47% reduction

Huffman coding

Huffman coding is a lossless technique used for text and similar data. It assigns shorter binary codes to more frequent characters and longer codes to rare ones. Standard ASCII uses 8 bits for every character regardless of frequency -- Huffman coding is more efficient.

To build a Huffman code for a piece of text:

Steps to build a Huffman tree:
1. Count the frequency of each character in the text.
2. Create a leaf node for each character, sorted by frequency (lowest first).
3. Take the two lowest-frequency nodes. Create a parent node with their combined frequency.
4. Repeat step 3 until only one node remains (the root).
5. Assign 0 to every left branch and 1 to every right branch.
6. Each character's code is read by following the path from root to that leaf.
Example: Encode "AABABCA" (7 characters) Step 1 - Frequency count: A: 4, B: 2, C: 1 Step 2 - Sort by frequency (lowest first): C(1) B(2) A(4) Step 3 - Merge two lowest: Merge C(1) + B(2) = CB(3) Now: CB(3) A(4) Step 4 - Merge remaining: Merge CB(3) + A(4) = root(7) Step 5 - Assign codes (left=0, right=1): root |- 0: CB(3) | |- 0: C(1) --> code: 00 | `- 1: B(2) --> code: 01 `- 1: A(4) --> code: 1 Step 6 - Encode "AABABCA": A A A B A B C A 1 1 1 01 1 01 00 1 = 11101101001 (11 bits) Standard ASCII: 7 chars x 8 bits = 56 bits Huffman: 11 bits Saving: 45 bits (80% reduction for this example!)

Huffman coding works because it exploits frequency imbalance. Frequent characters get very short codes (1-2 bits). Rare characters get longer codes (3-4+ bits). As long as frequent characters dominate, the total bit count falls well below fixed-width encoding.

Compression Lab

Compression Lab

Interactive
RLE Encoder
Huffman Coder
Dictionary Coder

Click a colour, then click cells to paint. Press Encode to see the RLE output and space saving.

Paint some pixels and press Encode.

Type or paste text. The tool builds the Huffman frequency table and codes in real time. Keep text short (under 60 chars) for clarity.

Type up to 100 words. The tool builds a dictionary of unique words, assigns each a short numeric code, and replaces every word in the text with its code.

This is the principle behind LZ compression (used in ZIP, gzip and PNG). Repeated words are replaced by much shorter references.

Exam Focus
  • Lossy compression: data is permanently lost; the original cannot be recovered. Lossless: the original can be perfectly restored.
  • RLE worked examples: always count runs carefully from left to right. Show each pair (count + colour/value) explicitly.
  • When does RLE not help? When there are no long runs of repeated values -- photographic images, for instance. Some RLE-encoded files can be LARGER than the original.
  • Huffman coding: frequent characters get shorter codes, rare characters get longer codes. This reduces the total number of bits used.
  • To calculate space saving: (original bits - compressed bits) / original bits x 100%. Always show your working.

Check your understanding

1. A student needs to compress a text document that will be sent to a colleague who must be able to read every word exactly. Which type of compression should they use?
Lossless, because the original data must be exactly reconstructed
Lossy, because it achieves a much higher compression ratio
Either type is fine for text documents
No compression, because text cannot be compressed
Lossy compression permanently removes data. For a text document, even one wrong character changes the meaning. Lossless compression (e.g. ZIP) must be used so the original text can be perfectly recovered.
2. Apply RLE to this row: W W W W B B B W W W W W W. What is the encoded form?
3W 4B 6W
4W 3B 6W
4W 2B 7W
5W 3B 5W
Count carefully: W W W W (4) then B B B (3) then W W W W W W (6). Encoded: 4W 3B 6W. Original: 13 values. Encoded: 6 values. Good saving!
3. In Huffman coding, which character should be assigned the shortest binary code?
The character with the fewest occurrences in the text
The character with the highest ASCII value
The character with the most occurrences in the text
The character that appears first in the text
Huffman coding assigns shorter codes to MORE frequent characters. This minimises the total number of bits needed, because the frequently used codes (which appear many times) are short.
4. A 200-bit message is compressed to 80 bits using Huffman coding. What is the percentage space saving?
40%
60%
80%
25%
Saving = (200 - 80) / 200 x 100 = 120 / 200 x 100 = 60%. The file is now 40% of its original size, saving 60%.
5. Describe one situation where RLE compression would make a file LARGER rather than smaller.
When the file contains many different character types
When every pixel or value is different from its neighbours -- e.g. a photographic image or checkerboard pattern
When the file is already less than 1 KB
When using lossy compression instead of lossless
RLE stores count-value pairs. If every value differs from its neighbour, each pair has count=1. "1A 1B 1C..." uses twice as many values as the original "A B C..." and the file grows.

Think Deeper

JPEG is lossy; PNG is lossless. Both are widely used for images on the web. When would you choose PNG over JPEG, and why?
PNG is preferable when: (1) The image contains text, sharp edges, or flat areas of colour -- JPEG's lossy compression blurs these edges with visible artefacts called "JPEG compression artifacts". (2) The image needs a transparent background (PNG supports alpha channels; basic JPEG does not). (3) The image will be edited and re-saved repeatedly -- each save of a JPEG re-applies lossy compression, degrading quality cumulatively. PNG re-saves at identical quality. JPEG is better for photographs where small quality loss is imperceptible and the large compression ratio saves significant bandwidth.
Huffman coding is "optimal" for fixed-length input symbols. But modern compressors (gzip, zstd) achieve much higher compression ratios. What techniques might they use beyond Huffman?
Modern compressors combine multiple techniques. The key addition over Huffman is LZ77/LZ78 dictionary coding -- rather than encoding single characters, they find repeated sequences of characters anywhere in the data and replace them with back-references: "this sequence appeared 200 bytes ago, it was 15 characters long". This is far more powerful than Huffman for text or code files where phrases repeat. The result is then Huffman-coded. Gzip uses DEFLATE (LZ77 + Huffman). Zstandard uses a more sophisticated dictionary coder followed by Finite State Entropy coding (an arithmetic coding variant). These routinely achieve 5:1 to 10:1 ratios on typical text files.
🎉

Series Complete!

You have worked through all 6 lessons of Data & Binary. You can now convert, encode, calculate file sizes, and explain compression -- all the core skills for this topic.

View Series Overview
Printable Worksheets

Practice what you've learned

Three printable worksheets covering data compression at three levels: Recall, Apply, and Exam-style.

Recall
Worksheet 1
Key term matching + RLE practice + lossy/lossless True/False • 16 marks
Apply
Worksheet 2
RLE encoding + space saving calculations + Huffman reasoning • 17 marks
Exam-style
Worksheet 3
Extended RLE, Huffman and compression evaluation questions • 20 marks
Exam Practice
Lesson 6: Data Compression
5 MCQ with instant feedback + 3 written questions with mark schemes.
Start exam practice Download PDF exam
Teacher Panel
Lesson 6 -- Data Compression
Lesson Objectives
Distinguish lossy from lossless compression: lossy removes data permanently, lossless allows exact reconstruction
Apply RLE to a row of pixel data: identify runs, express as count-value pairs
Explain when RLE is ineffective or counterproductive
Describe Huffman coding: shorter codes for more frequent characters
Calculate bit savings from a compression scheme
Timing Guide
0-5 min: "Why is a 30 MB song stored as 3 MB on Spotify?" Hook discussion
5-15 min: Lossy vs lossless -- definitions, examples, when to use each
15-25 min: RLE -- method, worked example, worst case (checkerboard)
25-35 min: Huffman coding -- frequency table, tree building, code assignment
35-40 min: Compression Lab tool + space saving calculation practice
Common Misconceptions
"Lossy and lossless both keep all the data" -- lossy permanently discards data. Only lossless preserves all original information.
"RLE always makes files smaller" -- RLE can INCREASE file size when there are no long runs. The worst case is when every value differs from its neighbours.
"Huffman coding loses data" -- it is a lossless technique. The original text is recoverable from the Huffman codes if the code table is also stored.
Students sometimes confuse "compression ratio" with "space saving percentage". If a 100-bit file compresses to 40 bits, the compression ratio is 2.5:1 but the saving is 60%.
Marking Guidance

Lossy vs lossless (2 marks): 1 mark for correct term, 1 mark for explanation with example. "JPEG is lossy -- data is permanently removed so the original cannot be recovered" scores both marks.

RLE worked example (2 marks): 1 mark for correct run identification, 1 mark for correct count-value pairs. A miscounted run loses the second mark but not the first if method is clear.

Huffman description (2 marks): 1 mark for "shorter codes for more frequent characters", 1 mark for "reduces total bits / less storage". Do not require students to build the full tree unless specifically asked.

Space saving (2 marks): 1 mark for correct calculation, 1 mark for showing the formula (original - compressed) / original x 100.

RLE Checkerboard Challenge
After the RLE tool exercise, ask students to paint an alternating checkerboard (B W B W B W...) and encode it. The output will be longer than the input. Discussion: "Has compression failed? Why? What does this tell us about choosing the right compression for the data type?"
This makes the "when NOT to use RLE" point memorable and concrete -- students have created a real example of compression making things worse.
Exit Tickets
Explain the difference between lossy and lossless compression. Give one example of each. [2 marks]
Apply RLE to: R R R R G G B B B B B B B G G G. Show the encoded form and calculate the space saving as a percentage. [3 marks]
In Huffman coding, the letter 'E' appears 40 times and 'Z' appears 2 times. Which should get the shorter code, and why? [2 marks]
Differentiation
Grade 4 Lossy vs lossless definitions and one example each. RLE on a simple row. State that Huffman assigns shorter codes to frequent characters.
Grade 7 RLE worked examples with space saving calculation. When does RLE increase file size? Huffman table with given codes -- calculate total bits.
Grade 9 Build a Huffman tree from a frequency table. Justify choice of compression type for a specific context. Discuss the overhead of storing the Huffman code table itself as part of the compressed file.