From Sample LSRL to Population Regression
In Unit 2 we computed the sample regression line \(\hat{y} = a + bx\). But that line is based on one sample — different samples give different slopes and intercepts. Unit 9 asks: what does the slope tell us about the true population relationship?
Population model: \(\mu_y = \alpha + \beta x\)
where \(\beta\) = true population slope (unknown), \(\alpha\) = true intercept
Sample estimates: We use \(b\) to estimate \(\beta\), and \(a\) to estimate \(\alpha\).
The slope \(b\) from our sample is just one value from the sampling distribution of b — just like \(\bar{x}\) is one value from the sampling distribution of \(\bar{x}\).
Conditions for Inference on the Slope
The conditions for regression inference go beyond just "random" and "large sample." We must check four conditions, remembered with LINER:
L — Linear: The true relationship between x and y is linear. Check: scatter plot shows a linear pattern; residual plot shows random scatter (no curved pattern).
I — Independent: Individual observations are independent. Check: random sample and n ≤ 10% of population (if applicable).
N — Normal: For any fixed x, the y-values are Normally distributed. Check: histogram or Normal probability plot of residuals shows approximate Normality; no strong skew or outliers in residuals.
E — Equal Variance: The standard deviation of y-values is the same for all values of x. Check: residual plot shows roughly equal vertical spread across all x values (no "fan" shape).
R — Random: Data came from a random sample or randomized experiment.
t-Test for the Slope \(\beta\)
We use a t-test to determine whether the slope of the population regression line is different from zero (or some other value). A slope of zero means there is no linear relationship between x and y.
Most common: \(H_0: \beta = 0\) vs \(H_a: \beta \neq 0\) (two-sided — is there any linear relationship?)
Or one-sided: \(H_a: \beta > 0\) or \(H_a: \beta < 0\)
\(b\) = sample slope from LSRL | \(SE_b\) = standard error of the slope (from computer output)
\(df = n - 2\) (lose 2 df because we estimate both \(\alpha\) and \(\beta\))
In regression we estimate two parameters (\(\alpha\) and \(\beta\)), so we lose 2 degrees of freedom. Compare: one-sample t uses df = n−1 (estimates one parameter \(\mu\)).
A study of 15 students finds the regression of exam score (y) on study hours (x): \(\hat{y} = 52.3 + 4.8x\), with \(SE_b = 1.92\). Test whether there is a positive linear relationship at α = 0.05.
Step 1 — Hypotheses: \(H_0: \beta = 0\) vs \(H_a: \beta > 0\) (one-sided right)
Step 2 — Conditions: Assume LINER conditions verified from scatter and residual plots ✓
Step 3 — Calculate: \(t = \frac{4.8 - 0}{1.92} = 2.50\) | df = 15 − 2 = 13
p-value = P(t > 2.50) with df=13 ≈ 0.013
Step 4 — Conclude: Since 0.013 < 0.05, reject H₀. There is convincing evidence of a positive linear relationship between study hours and exam score.
Confidence Interval for the Slope \(\beta\)
\(t^*\) from t-distribution with \(df = n - 2\)
Using the same study: \(b = 4.8\), \(SE_b = 1.92\), \(n = 15\), df = 13.
For 95% CI: \(t^* = 2.160\) (df=13)
\(CI = 4.8 \pm 2.160(1.92) = 4.8 \pm 4.147\)
Interval: (0.653, 8.947)
Interpretation: We are 95% confident that for each additional hour of studying, the true mean exam score increases by between 0.653 and 8.947 points. Since the interval does not include 0, there is convincing evidence of a positive linear relationship.
If the 95% CI for \(\beta\) does not contain 0, then a two-sided test at α = 0.05 would reject \(H_0: \beta = 0\). If it contains 0, we fail to reject. This is the same CI-test duality from Unit 6.
Reading Computer Output
On the AP exam, regression inference is almost always presented through computer output. You must be able to extract the necessary values.
Coef (slope row): This is \(b\) — the sample slope. Use it to write the LSRL equation.
SE Coef (slope row): This is \(SE_b\) — plug into the CI formula: \(b \pm t^* \cdot SE_b\).
T (slope row): The test statistic = b / SE_b. Already calculated for you.
P-value (slope row): Two-sided p-value for \(H_0: \beta = 0\). For one-sided, divide by 2.
S: Standard deviation of residuals — measures typical distance of points from the regression line.
R-sq: The coefficient of determination \(r^2\) — percent of variation in y explained by x.
Multiple Choice Questions
Try each question, then reveal the answer.
A researcher fits a regression of crop yield (bushels) on fertilizer amount (pounds). She wants to test whether more fertilizer is associated with higher yield. What are the correct hypotheses?
- A \(H_0: b = 0\) vs \(H_a: b > 0\)
- B \(H_0: \beta = 0\) vs \(H_a: \beta > 0\)
- C \(H_0: \beta = 0\) vs \(H_a: \beta \neq 0\)
- D \(H_0: r = 0\) vs \(H_a: r > 0\)
- E \(H_0: \beta > 0\) vs \(H_a: \beta = 0\)
Hypotheses use the population parameter \(\beta\), not the sample statistic \(b\). "Higher yield with more fertilizer" is a one-sided right test: \(H_a: \beta > 0\). \(H_0\) always uses equality. Hypotheses about \(r\) (choice D) are not standard AP Statistics procedure.
Computer output for a regression shows: slope coefficient = 3.24, SE of slope = 1.08, n = 20. What is the t-statistic and degrees of freedom?
- A t = 3.00, df = 20
- B t = 3.00, df = 19
- C t = 3.00, df = 18
- D t = 0.333, df = 18
- E t = 3.24, df = 18
\(t = b/SE_b = 3.24/1.08 = 3.00\). For regression inference, \(df = n - 2 = 20 - 2 = 18\). We lose 2 degrees of freedom because we estimate both the intercept and the slope.
A residual plot for a regression analysis shows a clear fan shape — the spread of residuals increases as x increases. Which LINER condition is violated?
- A Linearity
- B Independence
- C Normality of residuals
- D Equal variance
- E Random sampling
A fan shape (spread increasing with x) violates the Equal Variance condition (also called homoscedasticity). The residual spread should be roughly constant across all x values. A curved pattern would indicate violation of Linearity; random scatter with no pattern is ideal.
A 95% confidence interval for the slope of a regression line is (−0.3, 2.1). Which conclusion is correct?
- A We have convincing evidence that the slope is positive.
- B We have convincing evidence that there is a linear relationship.
- C We do not have convincing evidence of a linear relationship, since 0 is in the interval.
- D The slope is definitely between −0.3 and 2.1.
- E The true slope is 0.9 (the midpoint).
The CI (−0.3, 2.1) contains 0. This means β = 0 is plausible — we cannot conclude that a linear relationship exists. A two-sided test at α = 0.05 would fail to reject H₀: β = 0. (A) and (B) are wrong because 0 is in the interval.
Computer output shows S = 4.2 and R-sq = 68.5% for a regression of weight (kg) on height (cm). Which statement correctly interprets R-sq?
- A The correlation between height and weight is 0.685.
- B The predicted weight is within 4.2 kg of the actual weight 68.5% of the time.
- C About 68.5% of the variation in weight is accounted for by the linear relationship with height.
- D Height causes 68.5% of variation in weight.
- E S = 4.2 means the slope is 4.2.
R² = 68.5% means 68.5% of variation in the response (weight) is explained by the linear relationship with height. Note: the correlation r = √0.685 ≈ 0.828 (not 0.685) — choice A confuses r and r². S = 4.2 is the standard deviation of residuals, not the slope.
Free Response Questions
FRQ 1 — Inference for Slope from Computer Output
~15 minutesConstant 42.8 18.4 2.33 0.032
Sunlight 28.6 6.2 4.61 0.000
S = 32.4 R-sq = 54.2% n = 20 df = 18
✓ Model Solution
(a) LSRL and slope interpretation:
\(\hat{y} = 42.8 + 28.6x\) or \(\widehat{\text{sales}} = 42.8 + 28.6(\text{sunlight hours})\)
Slope interpretation: For each additional hour of sunlight per day, the predicted ice cream sales increase by $28.60, on average.
(b) R² interpretation:
About 54.2% of the variation in daily ice cream sales is accounted for by the linear relationship with hours of sunlight. The remaining 45.8% is due to other factors not included in the model.
(c) 95% CI for slope:
\(CI = b \pm t^* \cdot SE_b = 28.6 \pm 2.101(6.2) = 28.6 \pm 13.026\)
Interval: (15.574, 41.626)
Interpretation: We are 95% confident that for each additional hour of sunlight, the true mean increase in ice cream sales is between $15.57 and $41.63.
(d) Conclusion:
The p-value for the slope (0.000, which means very small, approximately 0.0001) is less than α = 0.05. We reject \(H_0: \beta = 0\). There is very convincing evidence of a positive linear relationship between hours of sunlight and ice cream sales. The CI (15.57, 41.63) also confirms this — it does not contain 0.
✓ AP tip: (a) must name the variables, not just write numbers. (b) always say "accounted for" not "caused." (c) use SE Coef from the slope row, not the constant row. (d) state the decision AND context.