Regression Diagnostics
Learn to evaluate regression assumptions and detect problems. Master residual analysis, influence measures, and multicollinearity detection.
On This Page
Why Diagnostics Matter
Regression models make assumptions about the data. When these assumptions are violated:
- Coefficient estimates may be biased
- Standard errors may be wrong
- Predictions may be unreliable
- p-values may be misleading
The Four Key Assumptions (LINE)
| Letter | Assumption | What It Means |
|---|---|---|
| L | Linearity | Relationship between X and Y is linear |
| I | Independence | Observations are independent |
| N | Normality | Errors are normally distributed |
| E | Equal Variance | Errors have constant variance (homoscedasticity) |
Residual Analysis
Residuals are the key to diagnosing regression problems:
Types of Residuals
| Type | Formula | Use |
|---|---|---|
| Raw residuals | eᵢ = yᵢ - ŷᵢ | Basic analysis |
| Standardized | eᵢ/s | Compare across models |
| Studentized | eᵢ/(s√(1-hᵢᵢ)) | Identify outliers |
Checking Linearity
Plot: Residuals vs Fitted Values
What to look for:
- Good: Random scatter around zero
- Bad: Curved pattern suggests non-linear relationship
Good residual plot:
- Points scattered randomly above/below zero
- No obvious pattern
Bad residual plot:
- U-shaped or curved pattern
- Indicates need for polynomial terms or transformation
Remedy: Add squared term, use transformation (log, sqrt)
Checking Equal Variance (Homoscedasticity)
Plot: Residuals vs Fitted Values
What to look for:
- Good: Constant spread across all fitted values
- Bad: Funnel shape (spread increases or decreases)
Common patterns:
- Fan spreading out → variance increases with Y
- Fan narrowing → variance decreases with Y
Causes:
- Skewed dependent variable
- Missing variables
- Measurement issues
Remedies:
- Log transformation of Y
- Weighted least squares
- Robust standard errors
Checking Normality
Plot: Q-Q Plot (Normal probability plot) or Histogram of residuals
What to look for:
- Good: Points follow diagonal line on Q-Q plot
- Bad: Systematic deviations from line
| Pattern | Interpretation |
|---|---|
| S-curve | Heavy tails (outliers) |
| Concave up | Right-skewed residuals |
| Concave down | Left-skewed residuals |
Checking Independence
When to worry:
- Time series data (autocorrelation)
- Clustered data (students in schools)
- Repeated measures (same subjects)
Diagnostic: Durbin-Watson test for autocorrelation
- DW ≈ 2: No autocorrelation
- DW < 2: Positive autocorrelation
- DW > 2: Negative autocorrelation
Detecting Outliers
Types of Unusual Points
| Type | Description | Effect |
|---|---|---|
| Outlier | Unusual Y value | May affect fit |
| Leverage point | Unusual X value | Has potential for influence |
| Influential point | Changes regression when removed | Affects conclusions |
Leverage
Leverage (hᵢᵢ) measures how far xᵢ is from the mean of X.
A point has high leverage if:
Where p = number of predictors
Cook’s Distance
Cook’s D measures overall influence on regression coefficients.
Rule of thumb: Dᵢ > 1 or Dᵢ > 4/n suggests influential point
Point #47 has:
- Standardized residual = 3.2 (large outlier)
- Leverage = 0.15 (high)
- Cook’s D = 1.8 (very influential)
Action:
- Check for data entry error
- Run regression with and without point
- Report both results if conclusions differ
Multicollinearity
Multicollinearity: Predictors are highly correlated with each other.
Problems Caused
- Unstable coefficient estimates
- Large standard errors
- Coefficients may flip signs
- Difficulty interpreting individual effects
Detection
Where Rⱼ² is R² from regressing Xⱼ on all other predictors.
Rule of thumb: VIF > 5 or 10 indicates problematic multicollinearity
Predicting house price with:
- Square footage (X₁)
- Number of rooms (X₂)
- Number of bathrooms (X₃)
VIF results:
- VIF(X₁) = 8.5 ⚠️
- VIF(X₂) = 7.2 ⚠️
- VIF(X₃) = 2.1 ✓
Square footage and rooms are highly correlated.
Remedies:
- Remove one of the correlated variables
- Combine into single variable
- Use principal components
- Use regularization (ridge regression)
Diagnostic Summary Table
| Problem | Detection | Remedy |
|---|---|---|
| Non-linearity | Curved residual pattern | Transform X or Y, add polynomial |
| Heteroscedasticity | Funnel pattern | Transform Y, weighted regression |
| Non-normality | Q-Q plot deviation | Transform Y (usually not critical) |
| Autocorrelation | Durbin-Watson | Time series methods |
| Outliers | Studentized residuals > 3 | Investigate, robust regression |
| High leverage | hᵢᵢ > 2(p+1)/n | Check data, be cautious |
| Influential points | Cook’s D > 1 | Report sensitivity |
| Multicollinearity | VIF > 5-10 | Remove/combine predictors |
Common Transformations
| Problem | Transformation |
|---|---|
| Right-skewed Y | log(Y) or √Y |
| Increasing variance | log(Y) |
| Non-linear relationship | log(X), √X, or X² |
| Proportions (0-1) | logit(Y) |
| Counts | √Y or log(Y+1) |
Practical Workflow
1. Fit initial model
2. Check residual plots
- Residuals vs fitted → linearity, equal variance
- Residuals vs each X → linearity with individual predictors
- Q-Q plot → normality
3. Check for unusual points
- Standardized residuals > 3?
- High leverage points?
- Cook’s D > 1?
4. Check multicollinearity
- Correlation matrix
- VIF for each predictor
5. Apply remedies as needed
6. Recheck diagnostics after changes
Summary
In this lesson, you learned:
- Regression requires LINE assumptions: Linearity, Independence, Normality, Equal variance
- Residual plots reveal assumption violations
- Outliers have unusual Y; leverage points have unusual X; influential points affect conclusions
- Cook’s Distance measures overall influence
- VIF detects multicollinearity (VIF > 5-10 is problematic)
- Transformations can fix many problems
- Always check diagnostics before trusting results
Practice Problems
1. A residual plot shows points spreading out like a fan as fitted values increase. What assumption is violated? What remedy would you try?
2. A regression has three predictors with VIFs of 1.5, 12.3, and 11.8. What does this indicate? What would you do?
3. Point #15 has standardized residual = 0.5 and Cook’s D = 2.3. Should you be concerned? Why?
4. When is non-normality of residuals most concerning?
Click to see answers
1. Heteroscedasticity (unequal variance) is violated.
Remedies:
- Log transformation of Y
- Weighted least squares
- Use robust standard errors
The fan shape indicates variance increases with the mean, often fixed by log(Y).
2. Multicollinearity between predictors 2 and 3.
VIFs > 10 indicate severe multicollinearity.
Actions:
- Examine correlation between predictors 2 and 3
- Consider removing one predictor
- Combine into single index
- Use ridge regression if both predictors needed
3. Yes, be concerned!
The small residual (0.5) but high Cook’s D (2.3) indicates this is a high leverage point—unusual X values that strongly influence the regression line.
Actions:
- Check if data is correct
- Run regression with and without point #15
- Report sensitivity of results
4. Non-normality is most concerning when:
- Sample size is small (n < 30)
- Making confidence intervals or testing hypotheses
- Residuals are severely skewed or have heavy tails
With large samples, the Central Limit Theorem helps, and the effect is minimal. Non-normality is the least important of the LINE assumptions—focus more on linearity and equal variance.
Next Steps
Apply your diagnostic skills:
- Multiple Regression - Complex models need careful checking
- Logistic Regression - Different diagnostics for binary outcomes
- Linear Regression - Review fundamentals
Was this lesson helpful?
Help us improve by sharing your feedback or spreading the word.