Multiple Regression
Extend regression to multiple predictors. Learn to interpret coefficients, assess multicollinearity, and build predictive models.
On This Page
Introduction to Multiple Regression
Multiple regression extends simple regression to include multiple predictor variables.
ŷ = b₀ + b₁x₁ + b₂x₂ + … + bₖxₖ
Where:
- ŷ = predicted value
- b₀ = intercept
- b₁, b₂, …, bₖ = regression coefficients
- x₁, x₂, …, xₖ = predictor variables
Predicting house price:
Price = 50,000 + 100(Square Feet) + 15,000(Bedrooms) + 8,000(Bathrooms)
For a house with 2000 sq ft, 3 bedrooms, 2 bathrooms:
Price = 50,000 + 100(2000) + 15,000(3) + 8,000(2) = 50,000 + 200,000 + 45,000 + 16,000 = $311,000
Interpreting Coefficients
In multiple regression, each coefficient is a partial effect—the effect of that variable holding all other variables constant.
Model: Salary = 30,000 + 2,000(Experience) + 5,000(Education)
Coefficient for Experience (2,000): “Holding education constant, each additional year of experience is associated with $2,000 higher salary.”
Coefficient for Education (5,000): “Holding experience constant, each additional year of education is associated with $5,000 higher salary.”
Adjusted R-Squared
As you add more predictors, R² always increases (or stays the same). Adjusted R² penalizes for adding variables.
R²_adj = 1 - [(1-R²)(n-1)] / (n-k-1)
Where:
- n = sample size
- k = number of predictors
| Comparison | Interpretation |
|---|---|
| R²_adj increases | Variable improves model |
| R²_adj decreases | Variable doesn’t help enough |
| R²_adj much smaller than R² | Possible overfitting |
F-Test for Overall Model
H₀: All slopes = 0 (model has no predictive value) H₁: At least one slope ≠ 0
F = (R²/k) / [(1-R²)/(n-k-1)]
With df₁ = k and df₂ = n - k - 1
Model: R² = 0.60, n = 100, k = 3 predictors
F = (0.60/3) / (0.40/96) = 0.20 / 0.00417 = 48.0
With df₁ = 3, df₂ = 96, critical F ≈ 2.70
Since 48.0 is much greater than 2.70, model is significant.
Individual Coefficient Tests
H₀: βᵢ = 0 H₁: βᵢ ≠ 0
t = bᵢ / SE(bᵢ)
With df = n - k - 1
A significant t-test means that variable contributes to the model beyond other variables.
Multicollinearity
Multicollinearity occurs when predictors are highly correlated with each other.
Problems Caused
- Unstable coefficient estimates
- Large standard errors
- Coefficients may have “wrong” signs
- Hard to determine individual variable importance
Detection: Variance Inflation Factor (VIF)
VIFⱼ = 1 / (1 - Rⱼ²)
Where Rⱼ² is R² from regressing xⱼ on all other predictors
| VIF Value | Interpretation |
|---|---|
| 1 | No correlation with other predictors |
| 1-5 | Moderate correlation |
| 5-10 | High correlation (concerning) |
| >10 | Severe multicollinearity |
Solutions
- Remove one of the correlated variables
- Combine variables (create index)
- Use principal component analysis
- Use ridge regression or LASSO
Model Building Strategies
1. Backward Elimination
Start with all variables, remove least significant one at a time.
2. Forward Selection
Start with no variables, add most significant one at a time.
3. Stepwise Selection
Combination: add and remove variables based on significance.
4. Theory-Based
Include variables based on domain knowledge, regardless of significance.
Categorical Predictors (Dummy Variables)
Categorical variables need to be converted to dummy variables.
Region with 3 categories: North, South, West
Create k-1 = 2 dummy variables:
- D_South = 1 if South, 0 otherwise
- D_West = 1 if West, 0 otherwise
- North is the reference category (when both dummies = 0)
Model: Y = 100 + 5(D_South) + (-3)(D_West)
- North: Y = 100 + 5(0) + (-3)(0) = 100
- South: Y = 100 + 5(1) + (-3)(0) = 105
- West: Y = 100 + 5(0) + (-3)(1) = 97
Interpretation: South is 5 units higher than North; West is 3 units lower than North.
Interaction Terms
Interaction occurs when the effect of one variable depends on another.
ŷ = b₀ + b₁x₁ + b₂x₂ + b₃(x₁ × x₂)
Income = 20,000 + 3,000(Education) + 2,000(Experience) + 500(Education × Experience)
The effect of education depends on experience:
- At Experience = 0: Effect of education = 3,000
- At Experience = 10: Effect of education = 3,000 + 500(10) = 8,000
More experience amplifies the benefit of education!
Model Diagnostics
Check These:
- Linearity: Residuals vs. fitted values plot
- Normality: Q-Q plot of residuals
- Homoscedasticity: Constant variance in residual plot
- Independence: No patterns in residual sequence
- Influential points: Cook’s distance, leverage
Remedies for Violations:
- Non-linearity: Transform variables or add polynomial terms
- Non-normality: Transform Y variable
- Heteroscedasticity: Transform Y or use weighted least squares
- Outliers: Investigate, possibly remove with justification
Predictions and Confidence Intervals
Narrower interval: Predicting the average Y at given X values
Wider interval: Predicting Y for a specific new observation
(Includes both uncertainty in mean and individual variation)
Summary
In this lesson, you learned:
- Multiple regression includes several predictor variables
- Coefficients are partial effects (controlling for other variables)
- Adjusted R² penalizes for adding variables
- F-test assesses overall model significance
- Multicollinearity causes unstable coefficients; check with VIF
- Dummy variables encode categorical predictors
- Interactions capture when effects depend on other variables
- Always check diagnostics and assumptions
Practice Problems
1. A model has R² = 0.70, n = 50, k = 4. Calculate adjusted R².
2. Model: GPA = 1.5 + 0.3(StudyHours) + 0.2(Sleep) - 0.1(Social) a) Interpret the coefficient for StudyHours b) Predict GPA for: StudyHours=5, Sleep=7, Social=3
3. Two predictors have correlation r = 0.95. What problem might this cause?
4. A variable has VIF = 8. What does this mean and what should you do?
Click to see answers
1. R²_adj = 1 - (1-0.70)(50-1)/(50-4-1) = 1 - (0.30)(49)/45 = 1 - 14.7/45 = 1 - 0.327 = 0.673
2. a) “Holding sleep and social hours constant, each additional hour of study is associated with a 0.3 increase in GPA.” b) GPA = 1.5 + 0.3(5) + 0.2(7) - 0.1(3) = 1.5 + 1.5 + 1.4 - 0.3 = 4.1 (would cap at 4.0 in reality)
3. Multicollinearity - the predictors are highly correlated, which can cause:
- Unstable coefficient estimates
- Large standard errors
- Difficulty interpreting individual effects
4. VIF = 8 indicates high multicollinearity - this predictor is highly correlated with other predictors.
Options:
- Remove this variable or a correlated variable
- Combine correlated variables
- Use regularization (ridge regression)
- Be cautious interpreting this coefficient
Next Steps
Continue your regression journey:
- Regression Diagnostics - Deep dive into assumption checking
- Logistic Regression - Predicting categories
- Standard Deviation Calculator - Calculate statistics
Was this lesson helpful?
Help us improve by sharing your feedback or spreading the word.