Chapter 1: Using Graphs to Describe Data

Why use graphs? Graphs turn complex datasets into intuitive visuals, making patterns, trends, and outliers easier to understand. How do they work? By mapping data to visual elements like bars, lines, or points, graphs leverage human perception to reveal insights that raw numbers obscure. This chapter introduces probability sampling, variable classification, and graph types, with interactive examples and a case study to illustrate their practical application.

1.1 Probability Sampling Methods

Why sample? Studying an entire population is often impractical, so we select a representative subset. How? Probability sampling assigns each element a known selection chance, minimizing bias. Below, we explore four methods, their mechanics, and why they suit different scenarios.

1.1.1 Simple Random Sampling

Why? It’s the simplest way to ensure unbiased selection. How? Each element (e.g., a person) is assigned a number, and a random number generator picks the sample, like drawing names from a hat. For example, selecting 10 students from 80 by generating random numbers between 1 and 80. Pros: Minimal bias, easy to implement. Cons: No control over sample composition, risking unrepresentative results (e.g., all selected students being from one class).

1.1.2 Systematic Sampling

Why? It’s efficient for large, ordered populations. How? Select a random starting point, then pick every nth element (e.g., every 10th person in a list of 100, starting at a random number between 1 and 10). Randomizing the list order (e.g., alphabetically) prevents bias. Pros: Uniform coverage, faster than simple random sampling. Cons: Bias risk if the list has hidden patterns (e.g., every 10th person is a manager).

1.1.3 Stratified Sampling

Why? It ensures representation of key subgroups. How? Divide the population into strata (e.g., gender, age groups) based on relevant traits, then randomly sample from each stratum proportionally. For a college with 80% female and 20% male students, a sample of 200 includes 160 females and 40 males. Pros: Accurate for diverse populations. Cons: Needs detailed population data, complex to design.

1.1.4 Cluster Sampling

Why? It’s cost-effective for geographically dispersed populations. How? Randomly select entire clusters (e.g., schools, zip codes), then either include all elements in those clusters or randomly sample within them. For example, selecting 5 zip codes from 50, then surveying all households in those areas. Pros: Reduces logistical costs. Cons: Higher sampling error if clusters are similar internally.

Figure 1: Animated Sampling Methods (8×10 Grid). Click a method to visualize its selection process.

1.2 Classification of Variables

Why classify variables? Different data types require specific analytical and visualization techniques. How? Variables are split into categorical (qualitative, e.g., eye color) and numerical (quantitative, e.g., height). Measurement levels further refine analysis:

Table 1: Examples of Variable Classifications
Variable Type Example Measurement Level
Eye Color Blue, Brown, Green Nominal
Satisfaction Low, Medium, High Ordinal
Temperature 20°C, 25°C Interval
Weight 70 kg, 80 kg Ratio

1.3 Graphs to Describe Categorical Variables

Why use graphs for categorical data? They summarize frequencies or proportions clearly. How? Tables, pie charts, and Pareto diagrams display category counts or relationships. For instance, a survey of 35 pet owners (20 cats, 15 dogs) uses a pie chart to show proportions (57% cats, 43% dogs) and a cross table to break down ownership by gender.

Figure 2: Pie chart showing pet ownership distribution (57% cats, 43% dogs).

Table 2: Pet Ownership Cross Table by Gender
Pet Type Male Female Total
Cats 8 12 20
Dogs 10 5 15
Total 18 17 35

1.4 Graphs to Describe Time-Series Data

Why use time-series graphs? They reveal trends and patterns over time. How? Line graphs plot data points connected by lines, showing changes (e.g., monthly rainfall from January to July 2025). The smooth curve highlights seasonal variations, aiding weather predictions.

Figure 3: Line graph of monthly rainfall (Jan-Jul 2025).

1.5 Graphs to Describe Numerical Variables

Why graph numerical data? To visualize distributions and relationships. How? Histograms show frequency distributions (e.g., age groups), while scatter plots reveal correlations (e.g., height vs. weight). These graphs help identify data shapes (e.g., skewed, normal) and trends.

Figure 4: Histogram of age distribution in a sample population.

Figure 5: Scatter plot of height vs. weight with animated data points.

Table 3: Frequency Distribution of Ages
Age Group Frequency
0-10 5
11-20 10
21-30 8

1.6 Data Presentation Errors

Why avoid errors? Misleading graphs distort decision-making. How do errors occur? Uneven histogram bins (e.g., 0-10, 11-20, 21-50) exaggerate certain ranges, while truncated y-axes in line graphs (e.g., 0-20mm instead of 0-100mm) inflate trends. Proper scaling and consistent bins ensure accuracy.

Figure 6: Misleading histogram with uneven bins exaggerating certain age groups.

1.7 Story: The Data Detective’s Challenge

Why use a story? To show real-world application of concepts. How? In Numeropolis, data detective Mira evaluated park demand. She used simple random sampling (200 residents via random numbers) and systematic sampling (every 15th visitor), but faced sampling errors (e.g., overrepresenting retirees: 70% vs. 30% actual) and nonsampling errors (e.g., overstated usage). Mira classified variables: categorical (park support: yes/no) and numerical (park hours). Her pie chart showed 65% support, a Pareto diagram prioritized supporters, and a line graph tracked visits (Jan: 100, Dec: 180). For numerical data, a histogram of park hours (0-5: 50 people, 6-10: 30) and a scatter plot of age vs. hours revealed trends. She corrected a competitor’s misleading histogram (uneven bins) and truncated line graph, persuading the council to approve the park.

Figure 7: Pareto diagram of park support in Numeropolis.

1.8 Questions

  1. Why are simple random and systematic sampling used, and how did Mira apply them? Give an example of her sampling error.
  2. How do categorical and numerical variables differ, with examples from Mira’s study? Why does this matter?
  3. How does a pie chart differ from a Pareto diagram, using Mira’s data? Why choose one over the other?
  4. Why did Mira use a line graph for park visits, and what trend did it show? How was it constructed?
  5. How did Mira use three graphs for numerical data, and what did each reveal? Why were they effective?
  6. How did Mira correct two presentation errors, and why could they mislead the council?
  7. How does a histogram differ from a frequency table, and why might Mira prefer one?
  8. Why use scatter plots for correlations, and how did Mira’s age vs. hours plot work?