Lesson 3

Computer Systems Lesson 3 of 5

Why is your CPU
faster than mine?

Clock speed, cores, cache size and pipelining. Four factors that determine how quickly a processor can work - and each involves a genuine engineering trade-off.

40-50 minutes Core content Core GCSE CS content Interactive comparisons

The question

Compare two processors - an Intel Core i3 at 3.6 GHz and an Intel Core i9 at 5.0 GHz. The i9 has a higher clock speed. But the i9 also costs four times as much. Is it four times as fast? And why does an Apple M4 at 4.4 GHz often outperform an Intel running at 5.0 GHz?

The answer: Clock speed is just one of four performance factors. Understanding all four explains why processor benchmarks are more complicated than a single GHz number.

The four factors

What actually affects CPU performance?

Click each factor to explore it in detail.

⏱️

Clock Speed

How many cycles per second the CPU can complete

🔲

Number of Cores

How many independent processing units the chip contains

⚡

Cache Size

How much fast-access memory sits close to the CPU

🔄

Pipelining

How the CPU overlaps multiple FDE cycles simultaneously

Clock Speed

The clock speed (measured in GHz) determines how many FDE cycles the CPU can complete per second. 1 GHz = 1 billion cycles per second. A 4 GHz CPU can theoretically process 4 billion instructions per second.

The limit: Increasing clock speed produces heat. Beyond around 5-6 GHz, modern chips generate so much heat they become unreliable unless cooled with extreme methods. This is why clock speeds have not increased dramatically since 2005 - engineers found other ways to improve performance instead.

Number of Cores

Instead of making one core faster, chip designers added more cores - each an independent processor that can fetch, decode and execute its own instructions simultaneously.

The catch: Multiple cores only help if the software is written to use them (multi-threaded). A single-threaded program runs on one core only - adding more cores gives it no benefit at all. Video editing, 3D rendering and scientific simulations benefit enormously from many cores. Opening a spreadsheet on a single tab barely uses more than one.

Cache Size

Fetching data from RAM is slow relative to CPU speed. Cache stores recently used data much closer to the CPU core, reducing how often the processor has to wait. A larger cache means more data can be held nearby - fewer cache misses, fewer stalls.

The limit: Cache is extremely expensive to manufacture per byte - far more than RAM. Large caches also increase chip size and power consumption. The design becomes a careful balance between hit rate, cost, size and heat.

Pipelining

Without pipelining, the CPU finishes all three FDE stages of instruction 1 before starting instruction 2. With pipelining, while instruction 1 is being decoded, instruction 2 is already being fetched. Each stage works on a different instruction simultaneously - like an assembly line in a factory.

The complication: If an instruction's result is needed by the next instruction (a data hazard), the pipeline must stall and wait. Modern CPUs use sophisticated techniques including branch prediction and out-of-order execution to minimise these stalls.

Interactive

Clock speed in numbers

3.6 GHz

That is 3,600,000,000 clock cycles per second

1.0 GHz2.5 GHz3.5 GHz4.5 GHz5.0 GHz

Slower Typical mid-range laptop Faster

Cache in practice

Cache hits, cache misses - and why they matter

When the CPU needs a piece of data, it does not go straight to RAM. It checks each cache level in turn, starting with the fastest. If the data is there, it is a cache hit and the CPU gets the data almost immediately. If not, it is a cache miss - the CPU must go to the next level, which takes longer.

Memory hierarchy - speed vs size

Each row is where the CPU looks next on a miss. Click a row to see what a hit there costs.

L1 L1 Cache 32-64 KB ~1-4 cycles HIT

L2 L2 Cache 256 KB-1 MB ~10-20 cycles MISS

L3 L3 Cache 8-64 MB ~40-50 cycles MISS

RAM Main Memory (RAM) 8-128 GB ~200+ cycles MISS

Simulate a memory access - where is the data?

Click a scenario above to simulate a memory access.

A programme that repeatedly uses the same variables keeps them in L1 cache - each access costs just 1-4 cycles. A programme that constantly reads from large arrays or jumps around in memory causes many L3 misses or even RAM accesses. At 200+ cycles each, those misses add up fast. This is why algorithms that access memory in predictable patterns (like iterating through an array) are faster than those that jump around unpredictably.

Exam focus

When explaining how cache improves performance: say "reduces the number of slow accesses to main memory" and "stores recently/frequently used data closer to the CPU." A cache miss means the CPU must wait while data is fetched from the next level. A larger cache reduces the frequency of cache misses.

Multiple cores

More cores: when it helps and when it does not

A multi-core processor contains multiple complete processing units (cores) on a single chip. Each core has its own ALU, CU, registers and L1/L2 cache. All cores share L3 cache and main memory. This allows genuinely parallel execution - different cores work on different tasks at the same time.

1 Core - 8 tasks

Estimated time: -

4 Cores - 8 tasks

Estimated time: -

The simulation above assumes all 8 tasks are independent (parallelisable). In reality, many programs have tasks that depend on each other's results. If Task B needs the output from Task A, it must wait - even if other cores are free. This is why the theoretical speedup from adding cores is rarely achieved in practice.

Single-threaded vs multi-threaded: real examples

Uses multiple cores well

Video rendering / encoding
3D modelling and animation
Scientific simulations
Compiling large codebases
Running virtual machines

Mostly single-threaded

Simple web browsing
Spreadsheet calculation
Many older games
Sequential data processing
Most command-line scripts

Exam focus

The key phrase is: "Multiple cores allow multiple instruction streams to execute simultaneously, which improves performance for multi-threaded applications. Single-threaded programs cannot benefit from additional cores as they can only use one core at a time."

Pipelining

How pipelining overlaps instructions

The table below shows two approaches for executing 4 instructions. Without pipelining, each instruction must complete all three stages before the next begins. With pipelining, stages overlap - dramatically increasing throughput.

Without pipelining - 12 clock cycles for 4 instructions:

Instruction	Cycle 1	Cycle 2	Cycle 3	Cycle 4	Cycle 5	Cycle 6	Cycle 7	Cycle 8	Cycle 9	Cycle 10	Cycle 11	Cycle 12
I1	F	D	E	-	-	-	-	-	-	-	-	-
I2	-	-	-	F	D	E	-	-	-	-	-	-
I3	-	-	-	-	-	-	F	D	E	-	-	-
I4	-	-	-	-	-	-	-	-	-	F	D	E

With pipelining - 6 clock cycles for 4 instructions:

Instruction	Cycle 1	Cycle 2	Cycle 3	Cycle 4	Cycle 5	Cycle 6
I1	F	D	E	-	-	-
I2	-	F	D	E	-	-
I3	-	-	F	D	E	-
I4	-	-	-	F	D	E

F = Fetch D = Decode E = Execute

Live simulation: a real program through the pipeline

The program below adds two numbers and stores the result. Step through it cycle by cycle to see exactly which stage each instruction occupies - and what the CPU is doing at each moment.

Pipeline Simulator

Program: load 5, add 3, store result, halt

Cycle - of 6

-

CIR

-

ACC

-

MEM[8]

-

Instruction	Cycle 1	Cycle 2	Cycle 3	Cycle 4	Cycle 5	Cycle 6

Press Start to begin. Each click advances one clock cycle.

F = Fetch (get instruction from memory) D = Decode (work out what it means) E = Execute (carry out the operation)

Notice that after cycle 3, all three pipeline stages are busy simultaneously. This is the steady state of a pipelined CPU - three instructions at different stages, all being processed at once. Without pipelining, only one stage would be active at any given cycle.

What about data hazards?

A data hazard occurs when one instruction needs the result of the previous one before it has finished executing. For example, if I3 needed the value computed by I2 but I2 is still in its Execute stage when I3 reaches Decode, the pipeline must stall - inserting empty cycles (called "bubbles") to wait. Modern CPUs use out-of-order execution to rearrange independent instructions and keep the pipeline full as often as possible.

Exam focus

Pipelining increases throughput (instructions completed per second) but does not reduce the time for any single instruction. A data hazard can cause a pipeline stall. For the exam: without pipelining, N instructions take 3N cycles. With pipelining, N instructions take N+2 cycles (2 cycles to fill the pipeline initially).

Real world

How the factors interact

Processor	Cores	Clock Speed	L3 Cache	Best for
Intel Core i3 (budget laptop)	4	3.6 GHz	12 MB	Web browsing, office work, light multitasking
Intel Core i7 (mid-range)	16	4.7 GHz	24 MB	Video editing, software development, gaming
Intel Core i9 (high-end)	24	5.8 GHz	36 MB	3D rendering, scientific computing, high-end gaming
Apple M4 (ARM architecture)	10	4.4 GHz	16 MB	Efficiency-focused: performance per watt, longer battery life

The Apple M4 runs at a lower clock speed than the Intel i9 but often matches or beats it in real-world tasks. This is because architecture matters too - ARM-based chips execute more work per clock cycle than x86 chips. This is called IPC (Instructions Per Clock), and it is why raw GHz is not the full picture.

Lesson 3 Quiz

5 questions on CPU performance

Question 1 of 5

A CPU runs at 4 GHz. How many clock cycles does it complete per second?

Question 2 of 5

Why does adding more cores NOT always improve the performance of every program?

Question 3 of 5

What is the main benefit of larger cache memory in a CPU?

Question 4 of 5

In pipelining, what happens to the next instruction while the current instruction is being decoded?

Question 5 of 5

Why have CPU clock speeds not increased dramatically beyond ~5 GHz despite decades of improvement?

out of 5 correct

Think deeper

A company advertises a new laptop as having "2x more cores" than its predecessor, but benchmarks show it is only 30% faster in typical use. What might explain this gap between the marketing claim and the real-world result?

Several factors limit the real-world gain from doubling cores: (1) Many everyday applications are single-threaded or lightly multi-threaded and cannot use all cores simultaneously. (2) Amdahl's Law - the theoretical maximum speedup is limited by the proportion of a task that can be parallelised. (3) Other bottlenecks such as RAM bandwidth, storage speed or bus width may prevent the cores from being fully utilised. (4) Clock speed, cache size or architecture may differ between the two models, partially offsetting the core count advantage.

Printable Worksheets

Practice what you've learned

Three worksheets on CPU performance factors at three levels: Recall, Apply, and Exam-style.

Recall

Worksheet 1

Key terms matching + performance table + calculations • 18 marks

Apply

Worksheet 2

Cache hit rate calculations + evaluation scenarios • 18 marks

Exam-style

Worksheet 3

Exam-style paper with pipelining and discussion questions • 20 marks

Revision

Computer Systems Flashcards

27 key terms across all 5 lessons. Filter by topic, flip to reveal, mark as known.