Scatter Plots
When we want to explore the relationship between two quantitative variables, we use a scatter plot. Each individual in the dataset becomes one point on the graph.
Explanatory variable (x): The variable we think explains or predicts the other. Plotted on the horizontal axis.
Response variable (y): The variable we are trying to predict or explain. Plotted on the vertical axis.
Think of it as: x causes (or predicts) y. Hours studied (x) → exam score (y). Age of car (x) → resale value (y).
Describing an Association: DOFS
When asked to describe a scatter plot on the AP exam, use DOFS:
"There is a strong, positive, linear association between hours studied and exam score, with no obvious outliers. Students who study more tend to score higher on the exam."
Correlation (r)
The correlation coefficient r measures the strength and direction of a linear association between two quantitative variables.
Correlation ≠ Causation. A high r just means the variables move together — it does not mean one causes the other.
r only measures linear associations. A curved relationship can have r ≈ 0 even if there's a very strong pattern.
r is not resistant. Outliers can dramatically change the value of r.
r has no units — it's a pure number.
r is symmetric: the correlation of x with y equals the correlation of y with x.
Changing units (e.g. inches to cm) does not change r.
Least-Squares Regression Line (LSRL)
The Least-Squares Regression Line (LSRL) is the line that minimizes the sum of the squared vertical distances (residuals) between each data point and the line. It is the "best fit" line through the data.
Slope: \(\displaystyle b = r \cdot \frac{s_y}{s_x}\) Intercept: \(a = \bar{y} - b\bar{x}\)
The point of averages (x̄, ȳ) is always on the regression line. If you substitute x = x̄ into the equation, you always get ŷ = ȳ. This is a useful check on the AP exam.
Interpreting the LSRL
On the AP exam, you must interpret the slope and intercept in context. Generic answers lose points.
Slope \(b\): "For each additional one [unit of \(x\)], the predicted [y variable] increases/decreases by \(|b|\) [units of \(y\)]."
Intercept \(a\): "When [x variable] \(= 0\), the predicted [y variable] is \(a\) [units of \(y\)]." (Often not meaningful in context.)
A study of used cars finds: ŷ = 24,500 − 1,800x
where x = age of car (years) and ŷ = predicted resale price (dollars).
Slope interpretation: For each additional year of age, the predicted resale price of the car decreases by $1,800.
Intercept interpretation: When the car is 0 years old (brand new), the predicted resale price is $24,500. This is reasonable since a new car costs roughly that amount.
✓ Always include the variable names and units. "Increases by 1,800" with no context earns no credit.
Using the regression line to predict outside the range of the data is called extrapolation and is unreliable. For example, predicting the price of a 50-year-old car using this equation would give a negative price — clearly meaningless. Always note when a prediction requires extrapolation.
Residuals & Residual Plots
Negative residual → actual value is BELOW the line
The sum of all residuals always equals 0.
Using the equation ŷ = 50 + 5x (hours studied → exam score):
A student studies 6 hours and scores 82.
Predicted: ŷ = 50 + 5(6) = 80
Residual = 82 − 80 = +2
The student scored 2 points higher than predicted by the model.
Residual Plots
A residual plot graphs residuals (y-axis) against the explanatory variable or predicted values (x-axis). It tells us whether a linear model is appropriate.
If the residual plot shows no pattern (random scatter around the zero line) → the linear model is appropriate.
If the residual plot shows a curved or systematic pattern → a linear model is not appropriate; a different model is needed.
Coefficient of Determination (r²)
Measures the proportion of variation in \(y\) explained by the linear relationship with \(x\).
"Approximately [r² × 100]% of the variation in [y variable] is accounted for by the linear relationship with [x variable]."
For the hours-studied vs exam-score data, suppose r = 0.92.
r² = (0.92)² = 0.846
"About 84.6% of the variation in exam scores is accounted for by the linear relationship with hours studied. The remaining 15.4% is due to other factors."
| Statistic | Range | Measures | Resistant? |
|---|---|---|---|
| r | −1 to +1 | Strength AND direction of linear association | ❌ No |
| r² | 0 to 1 | % of variation in y explained by x | ❌ No |
Departures from Linearity
Not all relationships are linear. The AP exam expects you to recognize when a linear model is inappropriate and to understand how transformations can help.
Influential Points vs Outliers
| Type | Definition | Effect on LSRL |
|---|---|---|
| Outlier (in regression) | Point with a large residual — far from the line vertically | Inflates s (residual std. dev.), reduces r² |
| High-leverage point | Point with an extreme x-value, far from x̄ | Can pull the line toward it |
| Influential point | Removing it significantly changes the LSRL | Can dramatically change slope or intercept |
A point can be a high-leverage point without being influential (if it lies right on the line). An influential point usually has high leverage AND a large residual. The AP exam frequently asks you to distinguish these.
A linear model is not appropriate when:
• The scatter plot shows a curved pattern (use exponential or power model)
• The residual plot shows a systematic curved pattern
• There is no association at all (r ≈ 0)
Multiple Choice Questions
Try each question, then reveal the answer and explanation.
A researcher finds a correlation of r = 0.85 between shoe size and reading ability in a study of elementary school children. Which of the following is the most reasonable conclusion?
- A Larger shoe size causes better reading ability.
- B Better reading ability causes larger shoe size.
- C There is a strong positive linear association, likely due to a lurking variable (age).
- D The linear model explains 85% of variation in reading scores.
- E The data cannot have a correlation of 0.85 because the variables are unrelated.
Correlation does not imply causation. The real explanation is a lurking variable — age. Older children have both larger feet AND better reading skills. Neither variable causes the other. Note: (D) is wrong because 0.85² = 0.72, so 72% of variation is explained, not 85%.
The regression equation for predicting weight (pounds) from height (inches) is: ŷ = −130 + 4.5x. Which of the following correctly interprets the slope?
- A The predicted weight for a person of height 0 is −130 pounds.
- B For each additional pound of weight, predicted height increases by 4.5 inches.
- C For each additional inch of height, predicted weight increases by 4.5 pounds.
- D The correlation between height and weight is 4.5.
- E 4.5% of the variation in weight is explained by height.
The slope b = 4.5 means: for each 1-inch increase in height, predicted weight increases by 4.5 pounds. (A) describes the intercept. (B) has x and y reversed. (D) and (E) misidentify what 4.5 represents.
A regression model predicts that a student who studies 5 hours will score 78 on an exam. The student actually scores 73. What is the residual?
- A +5
- B −5
- C +73
- D −78
- E 0
Residual = actual − predicted = 73 − 78 = −5.
The negative residual means the student scored 5 points below what the model predicted. The point lies below the regression line.
After fitting a linear regression model, a residual plot shows a clear U-shaped (curved) pattern. What does this indicate?
- A The linear model fits the data very well.
- B There are several influential outliers in the data.
- C The correlation is close to zero.
- D A linear model is not appropriate; a nonlinear model should be considered.
- E The slope of the regression line is negative.
A curved pattern in a residual plot indicates that the linear model is not appropriate — the relationship is actually nonlinear. A good residual plot should show random scatter with no discernible pattern.
For a regression of daily temperature (°F) on ice cream sales, r² = 0.81. Which of the following is the correct interpretation?
- A The correlation between temperature and ice cream sales is 0.81.
- B Temperature causes 81% of ice cream sales.
- C About 81% of the variation in ice cream sales is accounted for by the linear relationship with temperature.
- D The slope of the regression line is 0.81.
- E 81% of data points lie on the regression line.
r² = 0.81 means 81% of the variation in the response variable (ice cream sales) is explained by its linear relationship with the explanatory variable (temperature). The remaining 19% is due to other factors. Note: r = √0.81 = 0.9, not 0.81.
Free Response Questions
Write a full solution before revealing the model answer. Always use context and show all work.
FRQ 1 — Regression Interpretation
~12 minutesŷ = 42.3 − 4.8x
where x = engine size (liters) and ŷ = predicted fuel efficiency (mpg). The correlation is r = −0.87 and r² = 0.757.
✓ Model Solution
(a) Slope interpretation:
For each additional liter of engine size, the predicted fuel efficiency decreases by 4.8 mpg. Larger engines tend to be less fuel-efficient.
(b) r² interpretation:
About 75.7% of the variation in fuel efficiency (mpg) is accounted for by the linear relationship with engine size. The remaining 24.3% is due to other factors not included in the model.
(c) Residual calculation:
Predicted: ŷ = 42.3 − 4.8(3.0) = 42.3 − 14.4 = 27.9 mpg
Residual = actual − predicted = 30 − 27.9 = +2.1 mpg
The car gets 2.1 mpg more than predicted by the model. Its actual fuel efficiency is above the regression line.
(d) Extrapolation:
No, it would not be appropriate. A 10-liter engine is well outside the range of the data collected (the study used cars typical of the dealership). Using the model would require extrapolation, which is unreliable. The model may not apply at extreme values outside the observed data range.
✓ Key AP grading notes: (a) must say "decreases" and include both variable names and units. (b) must say "accounted for" and include context. (c) must show calculation. (d) must use the word "extrapolation" or equivalent.
FRQ 2 — Scatter Plot & Association
~10 minutes✓ Model Solution
(a) DOFS description:
Direction: Negative — students who watch more TV tend to have lower GPAs.
Outliers: There appears to be one potential outlier: a student who watches 0 hours of TV but has a relatively low GPA of 2.1, which does not follow the general trend.
Form: The association is roughly linear.
Strength: The association is moderate in strength.
(b) Effect of removing the outlier:
Removing the outlier would likely make r closer to −1 (stronger negative). The outlier (0 hours TV, low GPA = 2.1) weakens the negative pattern because it has a low x-value but also a low y-value — it doesn't fit the downward trend. Removing it would allow the remaining points to show a cleaner negative relationship.
(c) Causation claim:
No, this is not a valid conclusion. This is an observational study — students were not randomly assigned to TV-watching groups. There may be lurking variables (such as motivation, parental involvement, or study habits) that explain both variables. Correlation does not imply causation.
✓ AP tip: For (c), always say "observational study," mention lurking variables, and state "correlation does not imply causation."