Simple Linear Regression
Learn to fit and interpret linear regression models. Understand least squares, coefficients, predictions, and model assessment.
On This Page
What is Linear Regression?
Linear regression finds the best straight line to describe the relationship between two variables and make predictions.
Where:
- = predicted value of Y
- = y-intercept (value when x = 0)
- = slope (change in Y for each unit increase in X)
- x = value of the predictor variable
Question: How do study hours (X) predict exam scores (Y)?
If we find: Score = 50 + 5(Hours)
- Intercept (50): Predicted score with 0 study hours
- Slope (5): Each additional hour adds 5 points
Prediction: Student who studies 8 hours Score = 50 + 5(8) = 90 points
The Least Squares Method
The least squares method finds the line that minimizes the sum of squared errors (residuals).
Minimize:
Where is the residual (error) for observation i
Slope:
Intercept:
Step-by-Step Example
Data: Study hours (X) and exam scores (Y)
| X | Y | X - X̄ | Y - Ȳ | (X-X̄)(Y-Ȳ) | (X-X̄)² |
|---|---|---|---|---|---|
| 2 | 65 | -4 | -15 | 60 | 16 |
| 4 | 75 | -2 | -5 | 10 | 4 |
| 6 | 80 | 0 | 0 | 0 | 0 |
| 8 | 88 | 2 | 8 | 16 | 4 |
| 10 | 92 | 4 | 12 | 48 | 16 |
| Sum | 134 | 40 |
Means: X̄ = 6, Ȳ = 80
Calculate slope:
Calculate intercept:
Regression equation:
Interpretation:
- Each additional study hour increases score by 3.35 points
- A student with 0 study hours would score about 60 (extrapolation!)
Interpreting Coefficients
Slope (b₁)
| Slope | Interpretation |
|---|---|
| Positive | Y increases as X increases |
| Negative | Y decreases as X increases |
| Near 0 | Weak or no linear relationship |
Units: The slope has units of Y per unit of X.
Income = 20,000 + 5,000(Years of Education)
- Slope = 5,000: Each additional year of education is associated with $5,000 higher income
- Note: This is association, not causation!
Intercept (b₀)
The predicted Y when X = 0. May or may not be meaningful depending on context.
Residuals and Model Fit
Actual value minus predicted value
Properties of Residuals
- Sum of residuals = 0
- Mean of residuals = 0
- Residuals measure how far points are from the line
From our example, for X = 8:
Predicted:
Actual: y = 88
Residual: e = 88 - 86.7 = 1.3 (point is above the line)
R-Squared (Coefficient of Determination)
Where:
- (total variation)
- (explained variation)
- (unexplained variation)
Interpretation: R² is the proportion of variance in Y explained by X.
| R² | Interpretation |
|---|---|
| 0.0 | Model explains nothing |
| 0.5 | Model explains 50% of variation |
| 1.0 | Model explains everything (perfect fit) |
Standard Error of Estimate
Interpretation: Average distance of observations from the regression line.
Making Predictions
Using our equation:
Predict score for student who studies 7 hours:
Predicted score: 83 points
Inference for Regression
Testing the Slope
H₀: β₁ = 0 (no linear relationship) H₁: β₁ ≠ 0 (linear relationship exists)
With df = n - 2
Confidence Interval for Slope
Assumptions of Linear Regression
Checking Assumptions
| Assumption | How to Check |
|---|---|
| Linearity | Scatter plot, residual plot |
| Independence | Study design, residual patterns |
| Normality | Histogram or Q-Q plot of residuals |
| Equal variance | Residuals vs fitted values plot |
Residual Plots
A residual plot (residuals vs. fitted values) should show:
- Random scatter around 0
- No patterns (curves, fans, clusters)
Good Residual Plot: Bad (Curved): Bad (Funnel):
* * *
* * * * * * *
* * * * * * *
* * * * * * * *
* * *
Summary
In this lesson, you learned:
- Linear regression predicts Y from X using
- Least squares minimizes sum of squared residuals
- Slope (b₁): Change in Y for unit change in X
- Intercept (b₀): Predicted Y when X = 0
- R²: Proportion of variance explained (0 to 1)
- Residuals: Actual minus predicted values
- Assumptions: Linearity, independence, normality, equal variance
- Avoid extrapolation beyond data range
Practice Problems
1. A regression analysis yields: Price = 15,000 + 1,200(Age of Car) a) Interpret the slope b) Is the slope interpretation sensible? c) Predict price for a 5-year-old car
2. Given: r = 0.8, sx = 4, sy = 10, x̄ = 20, ȳ = 50 Find the regression equation for predicting Y from X.
3. A model has R² = 0.64. a) What correlation (r) produced this? b) What percentage of variance is unexplained?
4. Residual plot shows a curved pattern. What does this indicate?
Click to see answers
1. a) Slope = 1,200: Each additional year of age is associated with 21,000**
2. b₁ = r(sy/sx) = 0.8(10/4) = 0.8(2.5) = 2.0 b₀ = ȳ - b₁x̄ = 50 - 2.0(20) = 50 - 40 = 10
Y = 10 + 2X
3. a) r = √0.64 = ±0.8 (need context to know sign) b) Unexplained = 1 - 0.64 = 0.36 or 36%
4. The linearity assumption is violated. The relationship is not linear. Consider:
- Transforming variables (log, square root)
- Using polynomial regression
- Using a different model type
Next Steps
Expand your regression knowledge:
- Multiple Regression - Multiple predictors
- Regression Diagnostics - Checking assumptions
- Standard Deviation Calculator - Calculate variability
Was this lesson helpful?
Help us improve by sharing your feedback or spreading the word.