intermediate 18 minutes

Correlation Analysis

Learn how to measure and interpret the relationship between two variables using correlation coefficients, including Pearson's r.

On This Page
Advertisement

What is Correlation?

Correlation measures the strength and direction of the linear relationship between two quantitative variables. It answers the question: “When one variable changes, does the other tend to change in a predictable way?”

Correlation is a fundamental concept in statistics used across:

  • Science (relationships between variables)
  • Finance (stock price movements)
  • Social sciences (behavior patterns)
  • Medicine (risk factors and outcomes)

Visualizing Relationships

Before calculating correlation, always visualize your data with a scatter plot:

  • X-axis: Independent variable (predictor)
  • Y-axis: Dependent variable (outcome)
  • Each point represents one observation

Scatter plots reveal:

  • Direction of the relationship (positive or negative)
  • Strength of the relationship (tight cluster vs. dispersed)
  • Form of the relationship (linear vs. curved)
  • Outliers that might affect the correlation

Pearson Correlation Coefficient

The Pearson correlation coefficient (denoted as rr) is the most common measure of linear correlation.

Pearson Correlation Coefficient

r=(xixˉ)(yiyˉ)(xixˉ)2(yiyˉ)2r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}

Alternatively, using z-scores:

Pearson Correlation (z-score form)

r=1n1zxzyr = \frac{1}{n-1} \sum z_x \cdot z_y

Where:

  • xi,yix_i, y_i = individual data points
  • xˉ,yˉ\bar{x}, \bar{y} = means of x and y
  • zx,zyz_x, z_y = standardized z-scores
  • nn = number of observations

Properties of the Correlation Coefficient

Range

1r+1-1 \leq r \leq +1

Interpretation of r:

Value of rInterpretation
r=+1r = +1Perfect positive linear relationship
0.7r<10.7 \leq r < 1Strong positive correlation
0.3r<0.70.3 \leq r < 0.7Moderate positive correlation
0<r<0.30 < r < 0.3Weak positive correlation
r=0r = 0No linear correlation
0.3<r<0-0.3 < r < 0Weak negative correlation
0.7<r0.3-0.7 < r \leq -0.3Moderate negative correlation
1<r0.7-1 < r \leq -0.7Strong negative correlation
r=1r = -1Perfect negative linear relationship

Key Properties:

  1. Unitless: r has no units, making it comparable across different measurements
  2. Symmetric: Correlation of X with Y equals correlation of Y with X
  3. Linear only: r measures only linear relationships
  4. Affected by outliers: Extreme values can greatly influence r

Positive vs. Negative Correlation

Positive Correlation (r>0r > 0)

  • As one variable increases, the other tends to increase
  • Points slope upward from left to right
  • Example: Study time and test scores

Negative Correlation (r<0r < 0)

  • As one variable increases, the other tends to decrease
  • Points slope downward from left to right
  • Example: Exercise frequency and resting heart rate

No Correlation (r0r \approx 0)

  • No clear linear relationship
  • Points appear randomly scattered
  • Example: Shoe size and intelligence
Computing Correlation by Hand

Calculate the correlation between hours studied and exam scores:

StudentHours (x)Score (y)xxˉx - \bar{x}yyˉy - \bar{y}(xxˉ)(yyˉ)(x - \bar{x})(y - \bar{y})
A265-2-1020
B370-1-55
C475000
D580155
E68521020

Step 1: Calculate means

  • xˉ=4\bar{x} = 4, yˉ=75\bar{y} = 75

Step 2: Calculate deviations and products (see table)

  • (xxˉ)(yyˉ)=50\sum (x - \bar{x})(y - \bar{y}) = 50

Step 3: Calculate sum of squared deviations

  • (xxˉ)2=4+1+0+1+4=10\sum (x - \bar{x})^2 = 4 + 1 + 0 + 1 + 4 = 10
  • (yyˉ)2=100+25+0+25+100=250\sum (y - \bar{y})^2 = 100 + 25 + 0 + 25 + 100 = 250

Step 4: Compute r r=5010×250=5050=1.0r = \frac{50}{\sqrt{10 \times 250}} = \frac{50}{50} = 1.0

Result: Perfect positive correlation! Every additional hour of study corresponds perfectly with a 5-point increase in score.

Coefficient of Determination (r2r^2)

The coefficient of determination is the square of the correlation coefficient:

r2=(correlation coefficient)2r^2 = (\text{correlation coefficient})^2

Interpretation

r2r^2 represents the proportion of variance in one variable that is predictable from the other variable.

Interpreting r²

If r=0.8r = 0.8 between hours studied and test score:

r2=(0.8)2=0.64=64%r^2 = (0.8)^2 = 0.64 = 64\%

Interpretation: 64% of the variation in test scores can be explained by hours studied. The remaining 36% is due to other factors (prior knowledge, test anxiety, etc.).

Testing Correlation Significance

Is the observed correlation statistically significant, or could it have occurred by chance?

Hypotheses:

  • H0:ρ=0H_0: \rho = 0 (no correlation in the population)
  • HA:ρ0H_A: \rho \neq 0 (correlation exists in the population)

Where ρ\rho (rho) is the population correlation coefficient.

Test Statistic:

t-test for Correlation

t=rn21r2t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}}

With degrees of freedom: df=n2df = n - 2

Testing Significance

Given r=0.65r = 0.65 with n=30n = 30 observations:

t=0.6530210.652=0.65×5.290.5775=3.440.76=4.53t = \frac{0.65\sqrt{30-2}}{\sqrt{1-0.65^2}} = \frac{0.65 \times 5.29}{\sqrt{0.5775}} = \frac{3.44}{0.76} = 4.53

With df=28df = 28 and typical α=0.05\alpha = 0.05, the critical value is approximately 2.048.

Decision: Since 4.53>2.0484.53 > 2.048, we reject H0H_0.

Conclusion: The correlation is statistically significant at the 0.05 level.

Assumptions and Limitations

Assumptions for Pearson’s r:

  1. Linearity: The relationship must be linear
  2. Bivariate normality: Both variables should be approximately normally distributed
  3. Homoscedasticity: Variance should be constant across the range
  4. Independence: Observations should be independent
  5. No extreme outliers: Outliers can dramatically affect r

Common Pitfalls

1. Correlation ≠ Causation

Just because two variables are correlated doesn’t mean one causes the other.

Example

Ice cream sales and drowning deaths are positively correlated. Does ice cream cause drowning? No! Both are caused by a third variable: hot weather.

2. Outliers Can Mislead

A single outlier can create, destroy, or reverse a correlation.

3. Restricted Range

Correlations computed on a restricted range of values may underestimate the true relationship.

Example

If you only study students who study 8-10 hours, you might find little correlation between study time and grades. But looking at the full range (0-10 hours) might reveal a strong correlation.

4. Lurking Variables

A third variable might be influencing both variables, creating a spurious correlation.

5. Nonlinear Relationships

Pearson’s r only measures linear relationships. Curved relationships won’t be captured.

Other Correlation Measures

Spearman’s Rank Correlation (rsr_s)

  • Used for ordinal data or when data doesn’t meet assumptions
  • Based on ranks rather than raw values
  • More robust to outliers

Kendall’s Tau (τ\tau)

  • Another rank-based correlation
  • Preferred for small samples or many tied ranks

Point-Biserial Correlation

  • Used when one variable is continuous and one is binary (0/1)

Practical Applications

1. Finance

  • Correlation between stock prices
  • Portfolio diversification

2. Medicine

  • Relationship between risk factors and disease
  • Dose-response relationships

3. Education

  • Study habits and academic performance
  • Test score validation

4. Psychology

  • Personality traits and behavior
  • Cognitive abilities

Summary

In this lesson, you learned:

  • Correlation measures the strength and direction of linear relationships
  • Pearson’s r ranges from -1 to +1
  • r2r^2 represents the proportion of variance explained
  • Always visualize data before computing correlations
  • Correlation does not imply causation
  • Watch out for outliers, non-linearity, and restricted ranges
  • Statistical significance can be tested using a t-test

Next Steps

Build on your correlation knowledge:

Advertisement

Was this lesson helpful?

Help us improve by sharing your feedback or spreading the word.