Correlation Analysis
Learn how to measure and interpret the relationship between two variables using correlation coefficients, including Pearson's r.
On This Page
What is Correlation?
Correlation measures the strength and direction of the linear relationship between two quantitative variables. It answers the question: “When one variable changes, does the other tend to change in a predictable way?”
Correlation is a fundamental concept in statistics used across:
- Science (relationships between variables)
- Finance (stock price movements)
- Social sciences (behavior patterns)
- Medicine (risk factors and outcomes)
Visualizing Relationships
Before calculating correlation, always visualize your data with a scatter plot:
- X-axis: Independent variable (predictor)
- Y-axis: Dependent variable (outcome)
- Each point represents one observation
Scatter plots reveal:
- Direction of the relationship (positive or negative)
- Strength of the relationship (tight cluster vs. dispersed)
- Form of the relationship (linear vs. curved)
- Outliers that might affect the correlation
Pearson Correlation Coefficient
The Pearson correlation coefficient (denoted as ) is the most common measure of linear correlation.
Alternatively, using z-scores:
Where:
- = individual data points
- = means of x and y
- = standardized z-scores
- = number of observations
Properties of the Correlation Coefficient
Range
Interpretation of r:
| Value of r | Interpretation |
|---|---|
| Perfect positive linear relationship | |
| Strong positive correlation | |
| Moderate positive correlation | |
| Weak positive correlation | |
| No linear correlation | |
| Weak negative correlation | |
| Moderate negative correlation | |
| Strong negative correlation | |
| Perfect negative linear relationship |
Key Properties:
- Unitless: r has no units, making it comparable across different measurements
- Symmetric: Correlation of X with Y equals correlation of Y with X
- Linear only: r measures only linear relationships
- Affected by outliers: Extreme values can greatly influence r
Positive vs. Negative Correlation
Positive Correlation ()
- As one variable increases, the other tends to increase
- Points slope upward from left to right
- Example: Study time and test scores
Negative Correlation ()
- As one variable increases, the other tends to decrease
- Points slope downward from left to right
- Example: Exercise frequency and resting heart rate
No Correlation ()
- No clear linear relationship
- Points appear randomly scattered
- Example: Shoe size and intelligence
Calculate the correlation between hours studied and exam scores:
| Student | Hours (x) | Score (y) | |||
|---|---|---|---|---|---|
| A | 2 | 65 | -2 | -10 | 20 |
| B | 3 | 70 | -1 | -5 | 5 |
| C | 4 | 75 | 0 | 0 | 0 |
| D | 5 | 80 | 1 | 5 | 5 |
| E | 6 | 85 | 2 | 10 | 20 |
Step 1: Calculate means
- ,
Step 2: Calculate deviations and products (see table)
Step 3: Calculate sum of squared deviations
Step 4: Compute r
Result: Perfect positive correlation! Every additional hour of study corresponds perfectly with a 5-point increase in score.
Coefficient of Determination ()
The coefficient of determination is the square of the correlation coefficient:
Interpretation
represents the proportion of variance in one variable that is predictable from the other variable.
If between hours studied and test score:
Interpretation: 64% of the variation in test scores can be explained by hours studied. The remaining 36% is due to other factors (prior knowledge, test anxiety, etc.).
Testing Correlation Significance
Is the observed correlation statistically significant, or could it have occurred by chance?
Hypotheses:
- (no correlation in the population)
- (correlation exists in the population)
Where (rho) is the population correlation coefficient.
Test Statistic:
With degrees of freedom:
Given with observations:
With and typical , the critical value is approximately 2.048.
Decision: Since , we reject .
Conclusion: The correlation is statistically significant at the 0.05 level.
Assumptions and Limitations
Assumptions for Pearson’s r:
- Linearity: The relationship must be linear
- Bivariate normality: Both variables should be approximately normally distributed
- Homoscedasticity: Variance should be constant across the range
- Independence: Observations should be independent
- No extreme outliers: Outliers can dramatically affect r
Common Pitfalls
1. Correlation ≠ Causation
Just because two variables are correlated doesn’t mean one causes the other.
Ice cream sales and drowning deaths are positively correlated. Does ice cream cause drowning? No! Both are caused by a third variable: hot weather.
2. Outliers Can Mislead
A single outlier can create, destroy, or reverse a correlation.
3. Restricted Range
Correlations computed on a restricted range of values may underestimate the true relationship.
If you only study students who study 8-10 hours, you might find little correlation between study time and grades. But looking at the full range (0-10 hours) might reveal a strong correlation.
4. Lurking Variables
A third variable might be influencing both variables, creating a spurious correlation.
5. Nonlinear Relationships
Pearson’s r only measures linear relationships. Curved relationships won’t be captured.
Other Correlation Measures
Spearman’s Rank Correlation ()
- Used for ordinal data or when data doesn’t meet assumptions
- Based on ranks rather than raw values
- More robust to outliers
Kendall’s Tau ()
- Another rank-based correlation
- Preferred for small samples or many tied ranks
Point-Biserial Correlation
- Used when one variable is continuous and one is binary (0/1)
Practical Applications
1. Finance
- Correlation between stock prices
- Portfolio diversification
2. Medicine
- Relationship between risk factors and disease
- Dose-response relationships
3. Education
- Study habits and academic performance
- Test score validation
4. Psychology
- Personality traits and behavior
- Cognitive abilities
Summary
In this lesson, you learned:
- Correlation measures the strength and direction of linear relationships
- Pearson’s r ranges from -1 to +1
- represents the proportion of variance explained
- Always visualize data before computing correlations
- Correlation does not imply causation
- Watch out for outliers, non-linearity, and restricted ranges
- Statistical significance can be tested using a t-test
Next Steps
Build on your correlation knowledge:
- Simple Linear Regression - Use correlation to make predictions
- Correlation Calculator - Compute correlations quickly
- Scatter Plot Generator - Visualize relationships
Was this lesson helpful?
Help us improve by sharing your feedback or spreading the word.