Scatter Graphs and Correlation
The Skill
A scatter graph shows the relationship between two variables by plotting pairs of values as points.
Types of Correlation
| Type | Pattern | Example |
|---|---|---|
| Positive | As x increases, y increases | Height and shoe size |
| Negative | As x increases, y decreases | Car age and value |
| No correlation | No clear pattern | Shoe size and IQ |
Strength of Correlation
- Strong: Points close to a line
- Weak: Points more scattered but still show a trend
- None: No pattern at all
Line of Best Fit
A line of best fit shows the general trend. To draw it:
- Use a ruler
- Draw a straight line through the data
- Aim for roughly equal points above and below the line
- The line should pass through or near $(\bar{x}, \bar{y})$ — the mean point
Using the Line of Best Fit
Interpolation: Estimating values within the data range — usually reliable.
Extrapolation: Estimating values outside the data range — less reliable because the trend may not continue.
Correlation vs Causation
Correlation does not imply causation!
Just because two variables are correlated doesn't mean one causes the other.
Example: Ice cream sales and drowning deaths are positively correlated. But ice cream doesn't cause drowning — both increase in summer due to hot weather.
Outliers
An outlier is a point that doesn't fit the pattern. It may be:
- An error in data collection
- An unusual but valid data point
Outliers should be investigated, not automatically removed.
The Traps
Common misconceptions and how to avoid them.
Assuming correlation means causation "The Causation Confusion"
The Mistake in Action
A scatter graph shows positive correlation between number of fire engines at a fire and the damage caused. Conclusion: Fire engines cause more damage.
Wrong: More fire engines must cause more damage.
Why It Happens
Students assume that if two things are related (correlated), one must cause the other. This is a fundamental logical error.
The Fix
Correlation ≠ Causation
The fire engines don't cause the damage. The hidden variable is the size of the fire:
- Bigger fires cause more damage
- Bigger fires need more fire engines
Both variables are affected by a third factor.
Always ask: Could there be another explanation? Is there a hidden variable?
Spot the Mistake
Positive correlation between fire engines and damage
Therefore fire engines cause more damage
Click on the line that contains the error.
Drawing line of best fit through all points "The Connect-the-Dots Error"
The Mistake in Action
Draw a line of best fit for the scatter graph.
Wrong: Student draws a line that goes through every single point, zigzagging across the graph
Why It Happens
Students treat "line of best fit" like "connect the dots" from primary school, or they think the line must touch every point.
The Fix
The line of best fit is a single straight line that shows the general trend.
Rules:
- Use a ruler for a straight line
- The line should have roughly equal points above and below
- It doesn't need to pass through any actual points
- It should pass through or near $(\bar{x}, \bar{y})$
Think of it as: The line that best represents where ALL the points are heading, not a path between them.
Spot the Mistake
Draw a line of best fit
Zigzag line through every point
Click on the line that contains the error.
Treating extrapolation as reliable "The Extrapolation Overreach"
The Mistake in Action
Data shows revision hours (0-10) vs test score. The line of best fit is used to predict the score for 25 hours of revision.
Wrong: "The line predicts 120% — so 25 hours would give 120%."
Why It Happens
Students apply the line of best fit beyond the data without considering that the relationship may not continue.
The Fix
Extrapolation (predicting outside the data range) is unreliable.
The relationship may not continue:
- Test scores can't exceed 100%
- There are diminishing returns to revision
- The linear pattern may change
Interpolation (within the data range) is more reliable because we have evidence the pattern holds there.
Always state: "This estimate is unreliable because it is an extrapolation beyond the data."
Spot the Mistake
Use line to predict for 25 hours (data is 0-10 hours)
Prediction of 120% is valid
Click on the line that contains the error.
The Deep Dive
Apply your knowledge with these exam-style problems.
Level 1: Fully Worked
Complete solutions with commentary on each step.
Question
A scatter graph shows the relationship between temperature (°C) and ice cream sales (£). As temperature increases, sales increase. The points are close to a straight line. Describe the correlation.
Solution
Type of correlation: As temperature increases, sales increase → Positive correlation
Strength: Points are close to a straight line → Strong correlation
Full description: There is a strong positive correlation between temperature and ice cream sales.
Interpretation: Higher temperatures are associated with higher ice cream sales. (Note: We cannot say temperature causes higher sales, only that they are associated.)
Question
The scatter graph shows hours of practice vs typing speed (words per minute). The line of best fit passes through (2, 30) and (10, 70). Estimate the typing speed for someone who practises for 6 hours.
Solution
Step 1: Find the equation of the line (or use the graph).
Gradient = $\frac{70 - 30}{10 - 2} = \frac{40}{8} = 5$
Using point (2, 30): $y - 30 = 5(x - 2)$ $y = 5x + 20$
Step 2: Substitute x = 6. $y = 5(6) + 20 = 30 + 20 = 50$
Or read from graph: At x = 6, the line gives y ≈ 50.
Answer: Estimated typing speed is 50 words per minute.
This is interpolation (within the data range), so it is reasonably reliable.
Question
Research shows a strong positive correlation between the number of books in a home and children's exam results. Does this prove that buying more books will improve exam results? Explain.
Solution
No, correlation does not prove causation.
The correlation shows an association, but there are other explanations:
Possible hidden variables:
- Parents' education level — educated parents may buy more books AND help with homework
- Family income — wealthier families can afford books AND tutoring
- Attitudes to learning — families that value education may buy books AND encourage study
Conclusion: Buying more books might help, but the correlation alone doesn't prove it. The relationship could be due to underlying factors that affect both book ownership and exam success.
Level 2: Scaffolded
Fill in the key steps.
Question
A scatter graph shows age of car (years) vs value (£). Most points show negative correlation. One point shows a 15-year-old car worth £45,000 when similar aged cars are worth around £3,000. Comment on this point.
Level 3: Solo
Try it yourself!
Question
A line of best fit shows the relationship between advertising spend (£0-£5000) and sales (units). Use the line to estimate sales for: (a) £3000 advertising, (b) £12000 advertising. Comment on reliability.
Show Solution
(a) £3000 advertising: Read from line: approximately 420 units
This is interpolation (within the data range £0-£5000), so this estimate is reasonably reliable.
(b) £12000 advertising: Extending the line: approximately 900 units
This is extrapolation (outside the data range), so this estimate is less reliable because:
- The linear trend may not continue
- There may be diminishing returns at higher spending
- We have no data to support the pattern continuing
Examiner's View
Mark allocation: Describing correlation: 1 mark. Drawing line of best fit: 1-2 marks. Using line for estimation: 2 marks. Interpretation: 1-2 marks.
Common errors examiners see:
- Confusing correlation with causation
- Line of best fit not through the data (too high/low)
- Extrapolating too far beyond the data
- Not identifying outliers when asked
What gains marks:
- Using correct terminology (positive, negative, strong, weak)
- Accurate line of best fit with ruler
- Recognising limitations of extrapolation
- Identifying and commenting on outliers
AQA Notes
AQA likes asking about reliability of estimates — distinguish interpolation from extrapolation.