advanced 25 minutes

Regression Diagnostics

Learn to evaluate regression assumptions and detect problems. Master residual analysis, influence measures, and multicollinearity detection.

On This Page
Advertisement

Why Diagnostics Matter

Regression models make assumptions about the data. When these assumptions are violated:

  • Coefficient estimates may be biased
  • Standard errors may be wrong
  • Predictions may be unreliable
  • p-values may be misleading

The Four Key Assumptions (LINE)

LetterAssumptionWhat It Means
LLinearityRelationship between X and Y is linear
IIndependenceObservations are independent
NNormalityErrors are normally distributed
EEqual VarianceErrors have constant variance (homoscedasticity)

Residual Analysis

Residuals are the key to diagnosing regression problems:

Residual

ei=yiy^i=ObservedPredictede_i = y_i - \hat{y}_i = \text{Observed} - \text{Predicted}

Types of Residuals

TypeFormulaUse
Raw residualseᵢ = yᵢ - ŷᵢBasic analysis
Standardizedeᵢ/sCompare across models
Studentizedeᵢ/(s√(1-hᵢᵢ))Identify outliers

Checking Linearity

Plot: Residuals vs Fitted Values

What to look for:

  • Good: Random scatter around zero
  • Bad: Curved pattern suggests non-linear relationship
Linearity Check

Good residual plot:

  • Points scattered randomly above/below zero
  • No obvious pattern

Bad residual plot:

  • U-shaped or curved pattern
  • Indicates need for polynomial terms or transformation

Remedy: Add squared term, use transformation (log, sqrt)


Checking Equal Variance (Homoscedasticity)

Plot: Residuals vs Fitted Values

What to look for:

  • Good: Constant spread across all fitted values
  • Bad: Funnel shape (spread increases or decreases)
Heteroscedasticity

Common patterns:

  • Fan spreading out → variance increases with Y
  • Fan narrowing → variance decreases with Y

Causes:

  • Skewed dependent variable
  • Missing variables
  • Measurement issues

Remedies:

  • Log transformation of Y
  • Weighted least squares
  • Robust standard errors

Checking Normality

Plot: Q-Q Plot (Normal probability plot) or Histogram of residuals

What to look for:

  • Good: Points follow diagonal line on Q-Q plot
  • Bad: Systematic deviations from line
PatternInterpretation
S-curveHeavy tails (outliers)
Concave upRight-skewed residuals
Concave downLeft-skewed residuals

Checking Independence

When to worry:

  • Time series data (autocorrelation)
  • Clustered data (students in schools)
  • Repeated measures (same subjects)

Diagnostic: Durbin-Watson test for autocorrelation

Durbin-Watson Statistic

DW=i=2n(eiei1)2i=1nei2DW = \frac{\sum_{i=2}^{n}(e_i - e_{i-1})^2}{\sum_{i=1}^{n}e_i^2}

  • DW ≈ 2: No autocorrelation
  • DW < 2: Positive autocorrelation
  • DW > 2: Negative autocorrelation

Detecting Outliers

Types of Unusual Points

TypeDescriptionEffect
OutlierUnusual Y valueMay affect fit
Leverage pointUnusual X valueHas potential for influence
Influential pointChanges regression when removedAffects conclusions

Leverage

Leverage (hᵢᵢ) measures how far xᵢ is from the mean of X.

High Leverage

A point has high leverage if: hii>2(p+1)nh_{ii} > \frac{2(p+1)}{n}

Where p = number of predictors

Cook’s Distance

Cook’s D measures overall influence on regression coefficients.

Cook's Distance

Di=ei2pMSEhii(1hii)2D_i = \frac{e_i^2}{p \cdot MSE} \cdot \frac{h_{ii}}{(1-h_{ii})^2}

Rule of thumb: Dᵢ > 1 or Dᵢ > 4/n suggests influential point

Influential Point Analysis

Point #47 has:

  • Standardized residual = 3.2 (large outlier)
  • Leverage = 0.15 (high)
  • Cook’s D = 1.8 (very influential)

Action:

  1. Check for data entry error
  2. Run regression with and without point
  3. Report both results if conclusions differ

Multicollinearity

Multicollinearity: Predictors are highly correlated with each other.

Problems Caused

  • Unstable coefficient estimates
  • Large standard errors
  • Coefficients may flip signs
  • Difficulty interpreting individual effects

Detection

Variance Inflation Factor

VIFj=11Rj2VIF_j = \frac{1}{1 - R_j^2}

Where Rⱼ² is R² from regressing Xⱼ on all other predictors.

Rule of thumb: VIF > 5 or 10 indicates problematic multicollinearity

Multicollinearity Example

Predicting house price with:

  • Square footage (X₁)
  • Number of rooms (X₂)
  • Number of bathrooms (X₃)

VIF results:

  • VIF(X₁) = 8.5 ⚠️
  • VIF(X₂) = 7.2 ⚠️
  • VIF(X₃) = 2.1 ✓

Square footage and rooms are highly correlated.

Remedies:

  • Remove one of the correlated variables
  • Combine into single variable
  • Use principal components
  • Use regularization (ridge regression)

Diagnostic Summary Table

ProblemDetectionRemedy
Non-linearityCurved residual patternTransform X or Y, add polynomial
HeteroscedasticityFunnel patternTransform Y, weighted regression
Non-normalityQ-Q plot deviationTransform Y (usually not critical)
AutocorrelationDurbin-WatsonTime series methods
OutliersStudentized residuals > 3Investigate, robust regression
High leveragehᵢᵢ > 2(p+1)/nCheck data, be cautious
Influential pointsCook’s D > 1Report sensitivity
MulticollinearityVIF > 5-10Remove/combine predictors

Common Transformations

ProblemTransformation
Right-skewed Ylog(Y) or √Y
Increasing variancelog(Y)
Non-linear relationshiplog(X), √X, or X²
Proportions (0-1)logit(Y)
Counts√Y or log(Y+1)

Practical Workflow

Complete Diagnostic Checklist

1. Fit initial model

2. Check residual plots

  • Residuals vs fitted → linearity, equal variance
  • Residuals vs each X → linearity with individual predictors
  • Q-Q plot → normality

3. Check for unusual points

  • Standardized residuals > 3?
  • High leverage points?
  • Cook’s D > 1?

4. Check multicollinearity

  • Correlation matrix
  • VIF for each predictor

5. Apply remedies as needed

6. Recheck diagnostics after changes


Summary

In this lesson, you learned:

  • Regression requires LINE assumptions: Linearity, Independence, Normality, Equal variance
  • Residual plots reveal assumption violations
  • Outliers have unusual Y; leverage points have unusual X; influential points affect conclusions
  • Cook’s Distance measures overall influence
  • VIF detects multicollinearity (VIF > 5-10 is problematic)
  • Transformations can fix many problems
  • Always check diagnostics before trusting results

Practice Problems

1. A residual plot shows points spreading out like a fan as fitted values increase. What assumption is violated? What remedy would you try?

2. A regression has three predictors with VIFs of 1.5, 12.3, and 11.8. What does this indicate? What would you do?

3. Point #15 has standardized residual = 0.5 and Cook’s D = 2.3. Should you be concerned? Why?

4. When is non-normality of residuals most concerning?

Click to see answers

1. Heteroscedasticity (unequal variance) is violated.

Remedies:

  • Log transformation of Y
  • Weighted least squares
  • Use robust standard errors

The fan shape indicates variance increases with the mean, often fixed by log(Y).

2. Multicollinearity between predictors 2 and 3.

VIFs > 10 indicate severe multicollinearity.

Actions:

  • Examine correlation between predictors 2 and 3
  • Consider removing one predictor
  • Combine into single index
  • Use ridge regression if both predictors needed

3. Yes, be concerned!

The small residual (0.5) but high Cook’s D (2.3) indicates this is a high leverage point—unusual X values that strongly influence the regression line.

Actions:

  • Check if data is correct
  • Run regression with and without point #15
  • Report sensitivity of results

4. Non-normality is most concerning when:

  • Sample size is small (n < 30)
  • Making confidence intervals or testing hypotheses
  • Residuals are severely skewed or have heavy tails

With large samples, the Central Limit Theorem helps, and the effect is minimal. Non-normality is the least important of the LINE assumptions—focus more on linearity and equal variance.

Next Steps

Apply your diagnostic skills:

Advertisement

Was this lesson helpful?

Help us improve by sharing your feedback or spreading the word.