Observational Study vs. Experiment
Before collecting any data, you must decide how to collect it. This decision determines what conclusions you can legally draw — and it is one of the most heavily tested ideas in all of AP Statistics.
Only a well-designed, randomized experiment can establish a cause-and-effect relationship.
An observational study can show association, but never causation — no matter how strong or how large the study is.
Scenario A: Researchers track 10,000 coffee drinkers and non-drinkers for 20 years and compare their heart disease rates. → Observational study. No one was assigned to drink coffee.
Scenario B: Researchers randomly assign 200 patients to take either a new drug or a placebo and measure blood pressure after 3 months. → Experiment. The researcher assigned the treatments.
Census vs Sample
| Term | Definition | Practical Reality |
|---|---|---|
| Population | The entire group of interest | All U.S. adults, all fish in a lake |
| Census | Collecting data from every member of the population | Very expensive, often impossible |
| Sample | A subset of the population selected for study | Practical and usually sufficient |
| Parameter | A number describing a population (e.g., μ, p) | Usually unknown — what we estimate |
| Statistic | A number describing a sample (e.g., x̄, p̂) | Calculated from data we collect |
Sampling Methods
The goal of sampling is to select a sample that represents the population without bias. Different methods have different strengths and weaknesses.
Stratified: You sample from every group. Groups are homogeneous within, different across.
Cluster: You sample entire groups, and select only some groups. Groups look like mini-versions of the whole population. Think of school classrooms — each classroom is a "cluster" that represents the school.
| Method | How | Best When | Weakness |
|---|---|---|---|
| SRS | Random draw — everyone equally likely | Small, accessible populations | May miss subgroups |
| Stratified | SRS within each subgroup (stratum) | Population has distinct subgroups | Must know strata in advance |
| Cluster | Randomly select whole groups | Population naturally in clusters | Higher variability than stratified |
| Systematic | Every kth after random start | Long lists, assembly lines | Periodic patterns cause bias |
| Voluntary Response | People self-select | Never — always biased | Over-represents strong opinions |
| Convenience | Whoever is easiest to reach | Never — always biased | Rarely represents population |
Sources of Bias in Sampling
Bias means the sampling method systematically favors certain outcomes. A biased sample produces results that consistently over- or under-estimate the true population value. Bias cannot be fixed by increasing sample size.
Larger samples do NOT fix bias. A biased method with 1,000,000 people is still biased. Only better sampling design removes bias. This is a classic AP trap question.
| Type of Bias | Definition | Example |
|---|---|---|
| Undercoverage | Some members of the population are systematically excluded from the sample | Phone survey excludes people without phones; online survey excludes those without internet |
| Voluntary Response Bias | People with strong opinions are more likely to respond | Online poll about a controversial topic — only passionate people bother |
| Nonresponse Bias | Selected individuals don't respond, and non-responders differ from responders | Mailed survey — busy people don't respond, retired people do |
| Response Bias | Respondents give inaccurate answers due to question wording, interviewer presence, or social desirability | "Do you recycle as often as you should?" over-reports yes; asking about illegal behavior face-to-face |
| Question Wording Bias | Leading or loaded questions push respondents toward certain answers | "Do you support wasteful government spending?" vs "Do you support government investment in infrastructure?" |
A magazine asks readers to mail in responses to a survey about whether they enjoy the magazine.
Bias type: Voluntary response bias AND undercoverage.
Only readers who feel strongly (usually those who love or hate it) will bother mailing back a response. People who are indifferent will not respond. The sample will not represent the typical reader's opinion.
Experimental Design
An experiment imposes a treatment on subjects to observe the response. The key vocabulary is tested heavily on the AP exam.
| Term | Definition | Example |
|---|---|---|
| Experimental Unit | The individual on which the experiment is performed | A patient, a plant, a car |
| Subject | Experimental unit that is a person | A student, a patient |
| Treatment | The specific condition applied to experimental units | Drug A, Drug B, placebo |
| Factor | An explanatory variable in the experiment | Type of fertilizer, dosage level |
| Level | The specific values of a factor | Low dose, medium dose, high dose |
| Response Variable | The outcome measured after treatment | Blood pressure, plant height |
| Control Group | Group receiving no treatment (or placebo) | Patients given a sugar pill |
| Placebo | An inactive treatment that looks like the real one | Sugar pill identical in appearance to the drug |
| Confounding Variable | A variable associated with both the explanatory and response variable that distorts results | Healthier people both exercise more AND eat better — hard to isolate exercise effect |
The Three Principles of Good Experiments
Blinding
| Type | Who Doesn't Know the Treatment | Purpose |
|---|---|---|
| Single-blind | The subjects (patients) don't know if they got the drug or placebo | Eliminates placebo effect in subjects |
| Double-blind | Neither the subjects NOR the evaluators know who got which treatment | Eliminates both placebo effect and evaluator bias — gold standard |
If the doctor who measures "improvement" knows which patients got the real drug, they might unconsciously rate those patients higher. Double-blinding removes this bias from both ends — the patient and the evaluator.
Blocked Designs
A block is a group of experimental units that are similar in some way that might affect the response. By blocking, we control for known sources of variability.
Stratified sampling (Unit 3 sampling) — groups used in selecting who to include in the study.
Blocking (experimental design) — groups used in assigning treatments within an experiment. Same idea, different context. Block on variables that might affect your results (sex, age, health status).
A researcher tests whether a new fertilizer increases crop yield. She suspects soil type (clay vs sandy) matters.
Block by soil type: Within each soil type block, randomly assign plots to fertilizer vs no fertilizer.
This way, the comparison of fertilizer vs control is fair within each soil type — soil type cannot confound the results.
Matched Pairs Design
A special case of blocking where each "block" has exactly 2 units (or is the same person measured twice). Common designs:
| Design Type | How It Works | Example |
|---|---|---|
| Before/After | Same person measured before and after treatment | Measure blood pressure before and after giving a drug to each patient |
| Paired individuals | Two very similar people paired; one gets treatment, one gets control | Pairs of identical twins — one twin gets new curriculum, the other gets old |
Drawing Conclusions — Scope of Inference
The scope of inference — what conclusions you can draw — depends on two things: (1) was there random selection? and (2) was there random assignment?
"Can we generalize to the population?" → Only if random selection (sampling) was used.
"Can we conclude causation?" → Only if random assignment (experiment) was used.
These are two completely separate questions. An experiment with a convenience sample can show causation but only for those subjects — you can't generalize the results to all people.
Multiple Choice Questions
Try each question, then reveal the answer.
A researcher wants to determine whether listening to classical music improves concentration. She surveys 500 college students, asking how often they listen to classical music and their GPA. She finds that students who listen more frequently have higher GPAs. Which of the following is the most appropriate conclusion?
- A Listening to classical music causes higher GPA.
- B Higher GPA causes students to listen to more classical music.
- C There is an association between classical music listening and GPA, but causation cannot be established.
- D The study proves that classical music is beneficial for all students.
- E No conclusions can be drawn because the sample size is too small.
This is an observational study — the researcher did not assign students to listen to classical music. She merely observed their habits. No matter how strong the association, an observational study cannot establish causation. There may be lurking variables: students who study more might both listen to more classical music and get better grades.
A school district wants to survey students about cafeteria food quality. The district has 12 schools. Officials randomly select 3 schools, then survey every student in those 3 schools. What type of sampling method is this?
- A Simple random sample
- B Stratified random sample
- C Systematic sample
- D Cluster sample
- E Voluntary response sample
The schools are the clusters. Three clusters were randomly selected, and then all members of those clusters were surveyed. This is the defining feature of cluster sampling — entire groups are selected. In stratified sampling, you would randomly select some students from each of the 12 schools.
A polling company calls randomly selected landline phone numbers to survey adults about their opinions on a new tax policy. Which of the following is the most significant source of bias in this survey?
- A Response bias, because people will lie about their opinions
- B Undercoverage, because adults without landline phones are excluded
- C Voluntary response bias, because people choose whether to answer
- D Nonresponse bias, because the sample size is too small
- E There is no bias because phone numbers are randomly selected
Calling only landline numbers systematically excludes adults who only use cell phones — a large portion of the population, especially younger adults. This is undercoverage bias. The sampling frame (landline numbers) does not match the target population (all adults). Note that random selection within the frame doesn't fix the undercoverage of those outside the frame.
In a clinical trial, neither the patients nor the doctors measuring their outcomes know which patients received the new drug and which received the placebo. This design feature is called:
- A Blocking
- B Stratification
- C Single-blind
- D Double-blind
- E Matched pairs
When both the subjects and the evaluators are unaware of the treatment assignment, the study is double-blind. This is the gold standard for clinical trials. Single-blind means only the subjects don't know. Double-blinding prevents both the placebo effect (from subjects) and measurement bias (from evaluators).
Researchers randomly select 400 adults from a city's voter registration list and randomly assign each to one of two exercise programs. After 8 weeks, the group assigned to Program A shows significantly greater improvement in cardiovascular health. Which conclusion is best supported?
- A Program A causes greater cardiovascular improvement in all adults everywhere.
- B Program A is associated with greater improvement, but causation cannot be determined.
- C Program A causes greater cardiovascular improvement; results can be generalized to registered voters in the city.
- D Program A causes greater improvement only for the 400 adults in the study.
- E No conclusion is possible because the study was not blinded.
There was random assignment (experiment) → we can conclude causation.
There was random selection from voter registration list → we can generalize to registered voters in the city.
However, we cannot generalize to "all adults everywhere" because the sample was only from one city's voter rolls — not all adults. (A) overgeneralizes. (B) wrongly denies causation. (D) ignores the random selection.
Free Response Questions
Write your full solution before revealing. Unit 3 FRQs often ask you to design a study or identify flaws in one.
FRQ 1 — Design an Experiment
~15 minutes✓ Model Solution
(a) Experimental Design:
Number the 60 volunteers 1–60. Use a random number generator (or random digit table) to randomly assign 30 volunteers to the treatment group (daily green tea) and the remaining 30 to the control group (non-green-tea beverage). Measure each participant's fasting blood sugar at the start and after 12 weeks. Compare the change in blood sugar between the two groups.
(b) Why a control group is necessary:
Without a control group, we cannot know if any change in blood sugar was due to the green tea itself or to other factors such as the placebo effect (participants believing they are being treated may change behavior), dietary changes during the study, natural changes in blood sugar over 12 weeks, or increased attention from researchers. The placebo control isolates the effect of green tea specifically.
(c) Blocking for age:
Divide the 60 volunteers into age blocks (e.g., under 50 and 50+). Within each age block, randomly assign half to the green tea group and half to the control group. This is a randomized block design. By blocking on age, we ensure both groups are balanced with respect to age, preventing age from confounding the results.
(d) Scope of inference:
Yes, because random assignment was used, we can conclude causation — the green tea caused the reduction in blood sugar for the subjects in this study. However, since the 60 volunteers were not randomly selected from all pre-diabetic adults (they were volunteers), we cannot generalize these results to all pre-diabetic adults. The conclusion is limited to people similar to those in the study.
✓ AP grading tips: (a) must mention randomization method and comparison. (b) must mention placebo effect. (c) must say "block" and explain the grouping. (d) must distinguish causation (yes — random assignment) from generalization (no — volunteers, not random sample).
FRQ 2 — Identifying Flaws in a Study
~10 minutes✓ Model Solution
(a) Two sources of bias:
1. Undercoverage: The survey only reaches people who commute through a downtown subway station. Residents who drive, live in suburbs, or don't use this station are completely excluded. These excluded residents may have very different opinions about transit expansion. This would likely cause the 94% figure to be an overestimate of support among all city residents.
2. Convenience/Location Bias (or Undercoverage): Surveying only during morning rush hour misses residents who commute at different times or don't commute at all (e.g., retirees, work-from-home residents, night-shift workers). Transit commuters already use the system, so they are more likely to support its expansion — further inflating the estimate.
(b) Why sample size doesn't fix bias:
The official's argument is flawed because increasing sample size cannot correct for bias in the sampling method. If the method systematically over-samples transit supporters, then surveying 1,200 or even 12,000 people at the same location would still produce a biased estimate. A larger biased sample just gives you a more precise estimate of the wrong thing. Only a better sampling design can remove bias.
(c) Better sampling method:
Obtain a complete list of all city residents (e.g., from voter registration, utility billing records, or address database). Use a simple random sample to select residents, then contact them by phone, mail, or in person. To reduce nonresponse bias, make multiple attempts to contact each selected resident. This gives every resident an equal chance of being included, reducing undercoverage and location bias.
✓ For full credit on (a): name the bias AND explain the direction (over- or under-estimate). For (b): explicitly say larger samples don't fix bias. For (c): describe a probability sampling method with a complete sampling frame.