Hypothesis Testing: A Beginner's Complete Guide

Hypothesis testing is the backbone of scientific research. It’s how we move from “I wonder if…” to “The evidence suggests…” This guide will take you through the process step by step.

What Is Hypothesis Testing?

Hypothesis testing is a formal procedure for using data to decide between two competing claims about a population. It answers questions like:

Does this new drug actually work?
Is there a difference between these two groups?
Has the average changed from the expected value?

The Logic

We start by assuming nothing special is happening (the null hypothesis), then see if the data provides enough evidence to reject that assumption.

The Five Steps of Hypothesis Testing

Step 1: State Your Hypotheses

Every hypothesis test involves two competing statements:

Null Hypothesis (H₀)

The “nothing happening” statement
The status quo
What we assume is true unless proven otherwise
Always contains an equality (=, ≤, or ≥)

Alternative Hypothesis (H₁ or Hₐ)

What you’re trying to demonstrate
The “something is happening” statement
Contains inequality (<, >, or ≠)

Examples of Hypothesis Pairs

Testing a drug:

H₀: The drug has no effect (μ = 0)
H₁: The drug has an effect (μ ≠ 0)

Comparing two groups:

H₀: Groups are equal (μ₁ = μ₂)
H₁: Groups are different (μ₁ ≠ μ₂)

Testing if something increased:

H₀: No increase (μ ≤ 100)
H₁: There is an increase (μ > 100)

Step 2: Choose Your Significance Level (α)

The significance level is your threshold for “surprising enough.”

Common choices:

α = 0.05 (5%): Standard in most fields
α = 0.01 (1%): More stringent, medical research
α = 0.10 (10%): More lenient, exploratory research

What α means: If you set α = 0.05, you’re willing to accept a 5% chance of incorrectly rejecting a true null hypothesis (Type I error).

Step 3: Collect Data and Calculate Test Statistic

Your test statistic measures how far your sample result is from what the null hypothesis predicts.

Common test statistics:

z-statistic: When population σ is known or n is large
t-statistic: When population σ is unknown and n is smaller
χ² statistic: For categorical data
F-statistic: For comparing multiple groups (ANOVA)

General formula: Test statistic = (Sample value - Null hypothesis value) / Standard error

The larger the test statistic (in absolute value), the more your data differs from what H₀ predicts.

Step 4: Find the p-value

The p-value is the probability of getting a result as extreme as (or more extreme than) what you observed, IF the null hypothesis were true.

Interpreting p-values:

Small p-value (e.g., 0.02): Your result would be unlikely if H₀ were true
Large p-value (e.g., 0.35): Your result is quite possible if H₀ were true

Common misunderstandings:

p-value is NOT the probability that H₀ is true
p-value is NOT the probability of making an error
p-value IS the probability of the data given H₀

Step 5: Make a Decision

Compare p-value to α:

If p ≤ α: Reject H₀ (result is “statistically significant”)
If p > α: Fail to reject H₀ (result is “not statistically significant”)

Important language:

We never “accept” H₀—we only “fail to reject” it
Lack of evidence against H₀ ≠ proof that H₀ is true

Types of Hypothesis Tests

One-Tailed vs. Two-Tailed

Two-tailed test (≠)

Tests for any difference from null value
H₁: μ ≠ μ₀
Use when you don’t predict the direction

One-tailed test (< or >)

Tests for difference in a specific direction
H₁: μ > μ₀ (right-tailed) or H₁: μ < μ₀ (left-tailed)
Use when theory predicts a specific direction

One-tailed is more powerful but risky

Easier to find significance in predicted direction
But you’ll miss effects in the opposite direction

A Complete Example

Research question: A company claims their light bulbs last 1000 hours on average. You suspect they last less.

Step 1: State hypotheses

H₀: μ = 1000 hours (or μ ≥ 1000)
H₁: μ < 1000 hours (one-tailed)

Step 2: Set significance level

α = 0.05

Step 3: Collect data and calculate

Sample: n = 50 bulbs
Sample mean: x̄ = 980 hours
Sample SD: s = 50 hours
Standard error: SE = 50/√50 = 7.07
t-statistic: t = (980 - 1000)/7.07 = -2.83

Step 4: Find p-value

With df = 49 and t = -2.83
p-value ≈ 0.0033 (from t-table or calculator)

Step 5: Decision

p = 0.0033 < α = 0.05
Reject H₀
Conclusion: There is significant evidence that the bulbs last less than 1000 hours on average.

Understanding Errors

Type I Error (False Positive)

Rejecting H₀ when it’s actually true
Probability = α
Example: Concluding a drug works when it doesn’t

Type II Error (False Negative)

Failing to reject H₀ when it’s actually false
Probability = β
Example: Concluding a drug doesn’t work when it does

The Trade-off

Lowering α increases β (and vice versa)
More stringent ≠ always better
Balance depends on consequences of each error

Power

Power = 1 - β
Probability of correctly rejecting a false H₀
Aim for power ≥ 0.80
Increase power with larger samples

Common Misconceptions

1. “p = 0.05 means 95% chance the effect is real”

Wrong. The p-value doesn’t tell you the probability that your hypothesis is true.

2. “Not significant means no effect”

Wrong. It means you didn’t find sufficient evidence. The effect might exist but be too small to detect with your sample.

3. “p = 0.049 is very different from p = 0.051”

Wrong. They’re essentially the same. Don’t treat α as a magical cutoff.

4. “A smaller p-value means a bigger effect”

Wrong. Small p-values can come from small effects with large samples. Always report effect sizes.

5. “Hypothesis testing proves things”

Wrong. It only provides evidence. Science requires replication.

Best Practices

Before the Study

Determine sample size based on power analysis
Pre-register your hypotheses and analysis plan
Set α before collecting data

During Analysis

Check assumptions before running tests
Report exact p-values (not just “p < 0.05”)
Calculate effect sizes (Cohen’s d, η², etc.)
Include confidence intervals

When Reporting

Be precise: “We rejected/failed to reject H₀”
Avoid overstatement: “Significant evidence” ≠ “proof”
Discuss practical significance: Is the effect meaningful?
Acknowledge limitations: Sample size, generalizability

Beyond p-values

Modern statistics emphasizes moving beyond simple p-value testing:

Confidence Intervals

Show the range of plausible values, not just yes/no.

Effect Sizes

Quantify how large the effect is, not just whether it exists.

Bayesian Methods

Directly calculate probability of hypotheses given data.

Replication

One significant result isn’t enough—findings need to replicate.

Summary Checklist

When doing hypothesis testing, make sure you:

Hypothesis testing is a powerful tool, but it’s just one part of statistical inference. Use it wisely, report it fully, and always think critically about what your results really mean.