Chapter 2: Using Numerical Measures to Describe Data

Why use numerical measures? Numerical measures summarize datasets concisely, revealing central tendencies, variability, and relationships. How? By calculating statistics like mean, median, variance, and correlations, we gain insights into data patterns that graphs alone cannot provide. This chapter explores measures of central tendency, variability, grouped data, and relationships, with interactive visualizations and a case study.

2.1 Measures of Central Tendency and Location

Why measure central tendency? To identify the typical value in a dataset. How? Mean, median, mode, and percentiles describe where data clusters or specific locations within it.

2.1.1 Mean, Median, and Mode

Mean: The average, calculated as the sum of values divided by their count (e.g., for test scores [85, 90, 95], mean = (85+90+95)/3 = 90). Median: The middle value when ordered (e.g., for [85, 90, 95], median = 90). Mode: The most frequent value (e.g., for [85, 85, 90], mode = 85). Pros: Mean uses all data; median resists outliers; mode highlights frequency. Cons: Mean is sensitive to outliers; mode may not exist or be unique.

2.1.2 Shape of a Distribution

Why study shape? It affects which measure (mean, median, mode) is most representative. How? Distributions can be symmetric (mean ≈ median ≈ mode), right-skewed (mean > median), or left-skewed (mean < median). For example, income data is often right-skewed due to high earners.

Figure 1: Histogram showing symmetric, right-skewed, and left-skewed distributions.

2.1.3 Geometric Mean

Why use geometric mean? It’s ideal for growth rates or ratios (e.g., investment returns: [1.1, 1.2, 1.3], geometric mean = (1.1 × 1.2 × 1.3)^(1/3) ≈ 1.197). How? Multiply n values and take the nth root. Pros: Reflects compounding effects. Cons: Not suitable for negative or zero values.

2.1.4 Percentiles and Quartiles

Why use percentiles? To locate specific positions in ordered data (e.g., 75th percentile = value below which 75% of data lies). Quartiles: Divide data into four parts (Q1 = 25th, Q2 = 50th, Q3 = 75th percentiles). For test scores [60, 70, 80, 90, 100], Q2 = 80. Pros: Useful for ranking and spread. Cons: Requires sorted data, sensitive to sample size.

2.2 Measures of Variability

Why measure variability? To quantify data spread around the central tendency. How? Range, interquartile range, variance, and standard deviation describe dispersion.

2.2.1 Range and Interquartile Range

Range: Maximum minus minimum (e.g., for [60, 70, 80], range = 80 - 60 = 20). Interquartile Range (IQR): Q3 - Q1 (e.g., for [60, 70, 80, 90, 100], IQR = 90 - 70 = 20). Pros: Range is simple; IQR resists outliers. Cons: Range uses only extremes; IQR ignores some data.

2.2.2 Box-and-Whisker Plots

Why use box plots? To visualize quartiles, IQR, and outliers. How? A box spans Q1 to Q3, with a line at Q2; whiskers extend to min/max or 1.5 × IQR, flagging outliers. For [60, 70, 80, 90, 100], Q1=70, Q2=80, Q3=90.

Figure 2: Box-and-whisker plot of test scores.

2.2.3 Variance and Standard Deviation

Why measure variance? To quantify average squared deviation from the mean. How? Variance = Σ(x - mean)²/n; standard deviation = √variance. For [60, 70, 80], mean = 70, variance ≈ 66.67, standard deviation ≈ 8.16. Pros: Uses all data. Cons: Variance units are squared; sensitive to outliers.

2.2.4 Coefficient of Variation

Why use CV? To compare variability across datasets with different means. How? CV = (standard deviation / mean) × 100%. For [60, 70, 80], CV ≈ (8.16 / 70) × 100% ≈ 11.66%. Pros: Unitless, comparable. Cons: Undefined if mean is zero.

2.2.5 Chebyshev’s Theorem and the Empirical Rule

Why use these rules? To estimate data within k standard deviations. Chebyshev’s Theorem: At least (1 - 1/k²) of data lies within k standard deviations (k > 1). For k=2, ≥75% of data is within ±2σ. Empirical Rule: For normal distributions, ~68% within ±1σ, ~95% within ±2σ, ~99.7% within ±3σ. Pros: Chebyshev applies to any distribution; Empirical is precise for normal data. Cons: Empirical requires normality.

2.2.6 z-Score

Why use z-scores? To measure how far a value is from the mean in standard deviations. How? z = (x - mean)/σ. For x=85, mean=70, σ=8.16, z ≈ 1.84. Pros: Standardizes comparisons. Cons: Assumes known mean and σ.

2.3 Weighted Mean and Measures of Grouped Data

Why use weighted mean? To account for varying importance (e.g., grades weighted by credits: [(4×90 + 3×80)/7 ≈ 85.71]). Grouped Data: Use midpoints for means (e.g., class [60-70] midpoint = 65). Pros: Reflects weights or grouped structure. Cons: Approximations for grouped data lose precision.

2.4 Measures of Relationships Between Variables

Why study relationships? To understand how variables covary. How? Covariance and correlation measure direction and strength. For height vs. weight, a positive correlation (r ≈ 0.8) suggests taller people tend to weigh more. Pros: Quantifies relationships. Cons: Correlation doesn’t imply causation.

Figure 3: Scatter plot of height vs. weight showing correlation.

2.5 Case Study: Mortgage Portfolio

Why analyze mortgages? To assess risk and performance. How? A bank’s portfolio of 100 loans has loan amounts (mean = $200,000, σ = $50,000) and interest rates (mean = 4%, σ = 0.5%). Box plots show loan amount spread (Q1 = $175,000, Q3 = $225,000, outliers at $300,000). z-scores flag high-risk loans (e.g., $275,000 has z = 1.5). Correlation between amount and rate (r = -0.3) suggests larger loans have slightly lower rates. Weighted mean accounts for loan size in portfolio return.

Table 1: Summary Statistics for Mortgage Portfolio
Variable	Mean	Standard Deviation	Q1	Q3
Loan Amount	$200,000	$50,000	$175,000	$225,000
Interest Rate	4%	0.5%	3.75%	4.25%

2.6 Questions

Why are mean, median, and mode used, and how do they differ in the mortgage case study?
How does distribution shape affect the choice of central tendency measure? Give an example from the case study.
Why is the geometric mean useful for investment returns, and how is it calculated?
How are percentiles and quartiles computed, and why were they useful for loan amounts?
Why use box-and-whisker plots, and how did they reveal outliers in the mortgage portfolio?
How are variance and standard deviation calculated, and why are they important for loan risk?
Why use the coefficient of variation, and how does it compare loan amount vs. interest rate variability?
How do Chebyshev’s Theorem and the Empirical Rule differ, and why apply them to loan amounts?
How are z-scores used in the case study, and why do they help identify high-risk loans?
Why use a weighted mean for grouped data, and how was it applied to the portfolio?
How does correlation measure relationships, and why was the loan amount vs. rate correlation negative?