Hypothesis Testing: A Beginner's Complete Guide
Learn hypothesis testing step by step, from formulating hypotheses to interpreting p-values and making conclusions.
Hypothesis testing is the backbone of scientific research. It’s how we move from “I wonder if…” to “The evidence suggests…” This guide will take you through the process step by step.
What Is Hypothesis Testing?
Hypothesis testing is a formal procedure for using data to decide between two competing claims about a population. It answers questions like:
- Does this new drug actually work?
- Is there a difference between these two groups?
- Has the average changed from the expected value?
The Logic
We start by assuming nothing special is happening (the null hypothesis), then see if the data provides enough evidence to reject that assumption.
The Five Steps of Hypothesis Testing
Step 1: State Your Hypotheses
Every hypothesis test involves two competing statements:
Null Hypothesis (H₀)
- The “nothing happening” statement
- The status quo
- What we assume is true unless proven otherwise
- Always contains an equality (=, ≤, or ≥)
Alternative Hypothesis (H₁ or Hₐ)
- What you’re trying to demonstrate
- The “something is happening” statement
- Contains inequality (<, >, or ≠)
Examples of Hypothesis Pairs
Testing a drug:
- H₀: The drug has no effect (μ = 0)
- H₁: The drug has an effect (μ ≠ 0)
Comparing two groups:
- H₀: Groups are equal (μ₁ = μ₂)
- H₁: Groups are different (μ₁ ≠ μ₂)
Testing if something increased:
- H₀: No increase (μ ≤ 100)
- H₁: There is an increase (μ > 100)
Step 2: Choose Your Significance Level (α)
The significance level is your threshold for “surprising enough.”
Common choices:
- α = 0.05 (5%): Standard in most fields
- α = 0.01 (1%): More stringent, medical research
- α = 0.10 (10%): More lenient, exploratory research
What α means: If you set α = 0.05, you’re willing to accept a 5% chance of incorrectly rejecting a true null hypothesis (Type I error).
Step 3: Collect Data and Calculate Test Statistic
Your test statistic measures how far your sample result is from what the null hypothesis predicts.
Common test statistics:
- z-statistic: When population σ is known or n is large
- t-statistic: When population σ is unknown and n is smaller
- χ² statistic: For categorical data
- F-statistic: For comparing multiple groups (ANOVA)
General formula: Test statistic = (Sample value - Null hypothesis value) / Standard error
The larger the test statistic (in absolute value), the more your data differs from what H₀ predicts.
Step 4: Find the p-value
The p-value is the probability of getting a result as extreme as (or more extreme than) what you observed, IF the null hypothesis were true.
Interpreting p-values:
- Small p-value (e.g., 0.02): Your result would be unlikely if H₀ were true
- Large p-value (e.g., 0.35): Your result is quite possible if H₀ were true
Common misunderstandings:
- p-value is NOT the probability that H₀ is true
- p-value is NOT the probability of making an error
- p-value IS the probability of the data given H₀
Step 5: Make a Decision
Compare p-value to α:
- If p ≤ α: Reject H₀ (result is “statistically significant”)
- If p > α: Fail to reject H₀ (result is “not statistically significant”)
Important language:
- We never “accept” H₀—we only “fail to reject” it
- Lack of evidence against H₀ ≠ proof that H₀ is true
Types of Hypothesis Tests
One-Tailed vs. Two-Tailed
Two-tailed test (≠)
- Tests for any difference from null value
- H₁: μ ≠ μ₀
- Use when you don’t predict the direction
One-tailed test (< or >)
- Tests for difference in a specific direction
- H₁: μ > μ₀ (right-tailed) or H₁: μ < μ₀ (left-tailed)
- Use when theory predicts a specific direction
One-tailed is more powerful but risky
- Easier to find significance in predicted direction
- But you’ll miss effects in the opposite direction
A Complete Example
Research question: A company claims their light bulbs last 1000 hours on average. You suspect they last less.
Step 1: State hypotheses
- H₀: μ = 1000 hours (or μ ≥ 1000)
- H₁: μ < 1000 hours (one-tailed)
Step 2: Set significance level
- α = 0.05
Step 3: Collect data and calculate
- Sample: n = 50 bulbs
- Sample mean: x̄ = 980 hours
- Sample SD: s = 50 hours
- Standard error: SE = 50/√50 = 7.07
- t-statistic: t = (980 - 1000)/7.07 = -2.83
Step 4: Find p-value
- With df = 49 and t = -2.83
- p-value ≈ 0.0033 (from t-table or calculator)
Step 5: Decision
- p = 0.0033 < α = 0.05
- Reject H₀
- Conclusion: There is significant evidence that the bulbs last less than 1000 hours on average.
Understanding Errors
Type I Error (False Positive)
- Rejecting H₀ when it’s actually true
- Probability = α
- Example: Concluding a drug works when it doesn’t
Type II Error (False Negative)
- Failing to reject H₀ when it’s actually false
- Probability = β
- Example: Concluding a drug doesn’t work when it does
The Trade-off
- Lowering α increases β (and vice versa)
- More stringent ≠ always better
- Balance depends on consequences of each error
Power
- Power = 1 - β
- Probability of correctly rejecting a false H₀
- Aim for power ≥ 0.80
- Increase power with larger samples
Common Misconceptions
1. “p = 0.05 means 95% chance the effect is real”
Wrong. The p-value doesn’t tell you the probability that your hypothesis is true.
2. “Not significant means no effect”
Wrong. It means you didn’t find sufficient evidence. The effect might exist but be too small to detect with your sample.
3. “p = 0.049 is very different from p = 0.051”
Wrong. They’re essentially the same. Don’t treat α as a magical cutoff.
4. “A smaller p-value means a bigger effect”
Wrong. Small p-values can come from small effects with large samples. Always report effect sizes.
5. “Hypothesis testing proves things”
Wrong. It only provides evidence. Science requires replication.
Best Practices
Before the Study
- Determine sample size based on power analysis
- Pre-register your hypotheses and analysis plan
- Set α before collecting data
During Analysis
- Check assumptions before running tests
- Report exact p-values (not just “p < 0.05”)
- Calculate effect sizes (Cohen’s d, η², etc.)
- Include confidence intervals
When Reporting
- Be precise: “We rejected/failed to reject H₀”
- Avoid overstatement: “Significant evidence” ≠ “proof”
- Discuss practical significance: Is the effect meaningful?
- Acknowledge limitations: Sample size, generalizability
Beyond p-values
Modern statistics emphasizes moving beyond simple p-value testing:
Confidence Intervals
Show the range of plausible values, not just yes/no.
Effect Sizes
Quantify how large the effect is, not just whether it exists.
Bayesian Methods
Directly calculate probability of hypotheses given data.
Replication
One significant result isn’t enough—findings need to replicate.
Summary Checklist
When doing hypothesis testing, make sure you:
- Clearly state H₀ and H₁
- Choose α before looking at data
- Check test assumptions
- Calculate appropriate test statistic
- Find exact p-value
- Compare p to α for decision
- Calculate effect size
- Report confidence interval
- Interpret in context
- Acknowledge limitations
Hypothesis testing is a powerful tool, but it’s just one part of statistical inference. Use it wisely, report it fully, and always think critically about what your results really mean.