intermediate 30 minutes

Simple Linear Regression

Learn to fit and interpret linear regression models. Understand least squares, coefficients, predictions, and model assessment.

On This Page
Advertisement

What is Linear Regression?

Linear regression finds the best straight line to describe the relationship between two variables and make predictions.

Simple Linear Regression Model

y^=b0+b1x\hat{y} = b_0 + b_1 x

Where:

  • y^\hat{y} = predicted value of Y
  • b0b_0 = y-intercept (value when x = 0)
  • b1b_1 = slope (change in Y for each unit increase in X)
  • x = value of the predictor variable
Regression in Action

Question: How do study hours (X) predict exam scores (Y)?

If we find: Score = 50 + 5(Hours)

  • Intercept (50): Predicted score with 0 study hours
  • Slope (5): Each additional hour adds 5 points

Prediction: Student who studies 8 hours Score = 50 + 5(8) = 90 points


The Least Squares Method

The least squares method finds the line that minimizes the sum of squared errors (residuals).

Minimizing Error

Minimize: (yiy^i)2=ei2\sum(y_i - \hat{y}_i)^2 = \sum e_i^2

Where ei=yiy^ie_i = y_i - \hat{y}_i is the residual (error) for observation i

Least Squares Formulas

Slope: b1=(xixˉ)(yiyˉ)(xixˉ)2=rsysxb_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2} = r \cdot \frac{s_y}{s_x}

Intercept: b0=yˉb1xˉb_0 = \bar{y} - b_1 \bar{x}


Step-by-Step Example

Complete Regression Calculation

Data: Study hours (X) and exam scores (Y)

XYX - X̄Y - Ȳ(X-X̄)(Y-Ȳ)(X-X̄)²
265-4-156016
475-2-5104
6800000
88828164
10924124816
Sum13440

Means: X̄ = 6, Ȳ = 80

Calculate slope: b1=13440=3.35b_1 = \frac{134}{40} = 3.35

Calculate intercept: b0=803.35(6)=8020.1=59.9b_0 = 80 - 3.35(6) = 80 - 20.1 = 59.9

Regression equation: Y^=59.9+3.35X\hat{Y} = 59.9 + 3.35X

Interpretation:

  • Each additional study hour increases score by 3.35 points
  • A student with 0 study hours would score about 60 (extrapolation!)

Interpreting Coefficients

Slope (b₁)

SlopeInterpretation
PositiveY increases as X increases
NegativeY decreases as X increases
Near 0Weak or no linear relationship

Units: The slope has units of Y per unit of X.

Slope Interpretation

Income = 20,000 + 5,000(Years of Education)

  • Slope = 5,000: Each additional year of education is associated with $5,000 higher income
  • Note: This is association, not causation!

Intercept (b₀)

The predicted Y when X = 0. May or may not be meaningful depending on context.


Residuals and Model Fit

Residual

ei=yiy^ie_i = y_i - \hat{y}_i

Actual value minus predicted value

Properties of Residuals

  • Sum of residuals = 0
  • Mean of residuals = 0
  • Residuals measure how far points are from the line
Calculating Residuals

From our example, for X = 8:

Predicted: y^=59.9+3.35(8)=86.7\hat{y} = 59.9 + 3.35(8) = 86.7

Actual: y = 88

Residual: e = 88 - 86.7 = 1.3 (point is above the line)


R-Squared (Coefficient of Determination)

R-Squared

R2=SSregressionSStotal=1SSresidualSStotalR^2 = \frac{SS_{regression}}{SS_{total}} = 1 - \frac{SS_{residual}}{SS_{total}}

Where:

  • SStotal=(yiyˉ)2SS_{total} = \sum(y_i - \bar{y})^2 (total variation)
  • SSregression=(y^iyˉ)2SS_{regression} = \sum(\hat{y}_i - \bar{y})^2 (explained variation)
  • SSresidual=(yiy^i)2SS_{residual} = \sum(y_i - \hat{y}_i)^2 (unexplained variation)

Interpretation: R² is the proportion of variance in Y explained by X.

Interpretation
0.0Model explains nothing
0.5Model explains 50% of variation
1.0Model explains everything (perfect fit)

Standard Error of Estimate

Standard Error of Estimate

se=(yiy^i)2n2=SSresidualn2s_e = \sqrt{\frac{\sum(y_i - \hat{y}_i)^2}{n-2}} = \sqrt{\frac{SS_{residual}}{n-2}}

Interpretation: Average distance of observations from the regression line.


Making Predictions

Prediction

y^=b0+b1xnew\hat{y} = b_0 + b_1 x_{new}

Prediction Example

Using our equation: Y^=59.9+3.35X\hat{Y} = 59.9 + 3.35X

Predict score for student who studies 7 hours: Y^=59.9+3.35(7)=59.9+23.45=83.4\hat{Y} = 59.9 + 3.35(7) = 59.9 + 23.45 = 83.4

Predicted score: 83 points


Inference for Regression

Testing the Slope

Testing b₁ = 0

H₀: β₁ = 0 (no linear relationship) H₁: β₁ ≠ 0 (linear relationship exists)

t=b10SEb1t = \frac{b_1 - 0}{SE_{b_1}}

With df = n - 2

Confidence Interval for Slope

b1±tα/2×SEb1b_1 \pm t_{\alpha/2} \times SE_{b_1}


Assumptions of Linear Regression

Checking Assumptions

AssumptionHow to Check
LinearityScatter plot, residual plot
IndependenceStudy design, residual patterns
NormalityHistogram or Q-Q plot of residuals
Equal varianceResiduals vs fitted values plot

Residual Plots

A residual plot (residuals vs. fitted values) should show:

  • Random scatter around 0
  • No patterns (curves, fans, clusters)
Good Residual Plot:        Bad (Curved):          Bad (Funnel):
      *                         *                      *
   *     *                    *   *                 *  *  *
*     *     *              *       *              *      *
   *     *                    *   *               *  *  *  *
      *                         *                        *

Summary

In this lesson, you learned:

  • Linear regression predicts Y from X using y^=b0+b1x\hat{y} = b_0 + b_1 x
  • Least squares minimizes sum of squared residuals
  • Slope (b₁): Change in Y for unit change in X
  • Intercept (b₀): Predicted Y when X = 0
  • : Proportion of variance explained (0 to 1)
  • Residuals: Actual minus predicted values
  • Assumptions: Linearity, independence, normality, equal variance
  • Avoid extrapolation beyond data range

Practice Problems

1. A regression analysis yields: Price = 15,000 + 1,200(Age of Car) a) Interpret the slope b) Is the slope interpretation sensible? c) Predict price for a 5-year-old car

2. Given: r = 0.8, sx = 4, sy = 10, x̄ = 20, ȳ = 50 Find the regression equation for predicting Y from X.

3. A model has R² = 0.64. a) What correlation (r) produced this? b) What percentage of variance is unexplained?

4. Residual plot shows a curved pattern. What does this indicate?

Click to see answers

1. a) Slope = 1,200: Each additional year of age is associated with 1,200increaseinpriceb)Notsensible!Carstypicallydecreaseinvaluewithage.Thismightbeantique/classiccars.c)Price=15,000+1,200(5)=1,200 increase in price b) **Not sensible!** Cars typically decrease in value with age. This might be antique/classic cars. c) Price = 15,000 + 1,200(5) = **21,000**

2. b₁ = r(sy/sx) = 0.8(10/4) = 0.8(2.5) = 2.0 b₀ = ȳ - b₁x̄ = 50 - 2.0(20) = 50 - 40 = 10

Y = 10 + 2X

3. a) r = √0.64 = ±0.8 (need context to know sign) b) Unexplained = 1 - 0.64 = 0.36 or 36%

4. The linearity assumption is violated. The relationship is not linear. Consider:

  • Transforming variables (log, square root)
  • Using polynomial regression
  • Using a different model type

Next Steps

Expand your regression knowledge:

Advertisement

Was this lesson helpful?

Help us improve by sharing your feedback or spreading the word.