Introduction to Hypothesis Testing
Learn the framework for hypothesis testing: null and alternative hypotheses, test statistics, p-values, and drawing conclusions.
On This Page
What is Hypothesis Testing?
Hypothesis testing is a formal procedure for using sample data to evaluate claims about population parameters.
- Does this new drug lower blood pressure more than the placebo?
- Is there a difference in test scores between two teaching methods?
- Has the average customer satisfaction changed after the redesign?
- Is the coin fair, or is it biased toward heads?
The Hypothesis Testing Framework
Step 1: State the Hypotheses
Every test has two hypotheses:
Null Hypothesis (H₀): The “no effect” or “no difference” claim
- Status quo assumption
- What we test against
- Contains ”=” (or ≤ or ≥)
Alternative Hypothesis (H₁ or Hₐ): The research claim
- What we’re trying to find evidence for
- Contains ≠, <, or >
Claim: The average height of students is greater than 170 cm.
- H₀: μ ≤ 170 (or μ = 170)
- H₁: μ > 170
Claim: A new treatment changes recovery time from 10 days.
- H₀: μ = 10
- H₁: μ ≠ 10
Claim: The defect rate is less than 5%.
- H₀: p ≥ 0.05
- H₁: p < 0.05
Step 2: Choose Significance Level (α)
The significance level α is the probability of rejecting H₀ when it’s actually true (Type I error).
| Common α | Confidence Level |
|---|---|
| 0.10 | 90% |
| 0.05 | 95% (most common) |
| 0.01 | 99% |
Step 3: Collect Data and Calculate Test Statistic
The test statistic measures how far your sample result is from the null hypothesis value, in standard error units.
Test Statistic = (Sample Statistic - Null Value) / Standard Error
For a mean: z = (x̄ - μ₀) / (σ/√n) or t = (x̄ - μ₀) / (s/√n)
For a proportion: z = (p̂ - p₀) / √(p₀(1-p₀)/n)
Step 4: Find the P-Value
The p-value is the probability of getting a test statistic as extreme as (or more extreme than) the observed value, assuming H₀ is true.
| P-value | Evidence against H₀ |
|---|---|
| p > 0.10 | Weak or none |
| 0.05 < p ≤ 0.10 | Moderate |
| 0.01 < p ≤ 0.05 | Strong |
| p ≤ 0.01 | Very strong |
Step 5: Make a Decision
- If p-value ≤ α: Reject H₀ (statistically significant)
- If p-value > α: Fail to reject H₀ (not statistically significant)
Types of Tests
One-Tailed vs Two-Tailed Tests
Two-tailed (H₁: μ ≠ value)
- Evidence in either direction
- Critical region split between both tails
Left-tailed (H₁: μ < value)
- Evidence only in left tail
- Critical region entirely in left tail
Right-tailed (H₁: μ > value)
- Evidence only in right tail
- Critical region entirely in right tail
“Different from” → Two-tailed
- H₀: μ = 100, H₁: μ ≠ 100
“Less than” / “Decreased” → Left-tailed
- H₀: μ ≥ 100, H₁: μ < 100
“Greater than” / “Increased” → Right-tailed
- H₀: μ ≤ 100, H₁: μ > 100
Complete Example: One-Sample Z-Test
Scenario: A company claims batteries last 500 hours on average. A consumer group tests 36 batteries and finds mean = 490 hours. Population σ = 30 hours. At α = 0.05, is there evidence the true mean is less than claimed?
Step 1: Hypotheses
- H₀: μ ≥ 500 (or μ = 500)
- H₁: μ < 500 (left-tailed)
Step 2: Significance Level
- α = 0.05
Step 3: Test Statistic z = (x̄ - μ₀) / (σ/√n) = (490 - 500) / (30/√36) = -10 / 5 = -2.0
Step 4: P-Value P(Z < -2.0) = 0.0228
Step 5: Decision Since p-value (0.0228) < α (0.05), we reject H₀.
Conclusion: There is significant evidence at the 0.05 level that the mean battery life is less than 500 hours.
Type I and Type II Errors
When making decisions, we can make two types of errors:
| H₀ True | H₀ False | |
|---|---|---|
| Reject H₀ | Type I Error (α) | Correct Decision (Power) |
| Fail to reject H₀ | Correct Decision | Type II Error (β) |
Type I Error (α): Rejecting H₀ when it’s true (“false positive”)
- Probability = α (significance level)
Type II Error (β): Failing to reject H₀ when it’s false (“false negative”)
- Probability = β
Power = 1 - β: Probability of correctly rejecting false H₀
H₀: Patient is healthy H₁: Patient has disease
Type I Error: Telling a healthy person they have the disease (false positive)
- Causes unnecessary worry, treatment, cost
Type II Error: Telling a sick person they’re healthy (false negative)
- Disease goes untreated, potentially dangerous
Which error is worse depends on context!
Factors Affecting Power
| Factor | Effect on Power |
|---|---|
| ↑ Sample size (n) | ↑ Power |
| ↑ Significance level (α) | ↑ Power |
| ↑ True effect size | ↑ Power |
| ↓ Population variability (σ) | ↑ Power |
Critical Value Approach (Alternative)
Instead of p-values, you can use critical values:
Same battery example with α = 0.05, left-tailed:
Critical value: z₀.₀₅ = -1.645
Decision rule: Reject H₀ if z < -1.645
Our z-statistic: z = -2.0
Since -2.0 < -1.645, we reject H₀.
(Same conclusion as p-value approach!)
Statistical vs Practical Significance
A study of 100,000 people finds new drug lowers blood pressure by 0.5 mmHg compared to placebo.
- p-value = 0.001 (highly significant!)
- But 0.5 mmHg is clinically meaningless
Always consider effect size alongside significance!
Common Mistakes in Hypothesis Testing
Summary
In this lesson, you learned:
- Null hypothesis (H₀): No effect/difference; what we test against
- Alternative hypothesis (H₁): Research claim we seek evidence for
- Test statistic: Measures how far sample is from H₀
- P-value: Probability of data this extreme if H₀ true
- Decision rule: Reject H₀ if p ≤ α
- Type I error (α): False positive (rejecting true H₀)
- Type II error (β): False negative (failing to reject false H₀)
- Power = 1 - β: Ability to detect real effects
- Statistical significance ≠ practical importance
Practice Problems
1. State the null and alternative hypotheses for: a) Testing if average commute time differs from 30 minutes b) Testing if a new process reduces defects below 2% c) Testing if customer satisfaction increased above 4.0
2. A sample of 49 has mean 85 and s = 14. Test H₀: μ = 80 vs H₁: μ ≠ 80 at α = 0.05.
3. A z-test yields z = 1.8 for a right-tailed test. a) What is the p-value? b) At α = 0.05, what’s the decision? c) At α = 0.01, what’s the decision?
4. For the battery example, what type of error could we have made with our decision to reject H₀?
Click to see answers
1. a) H₀: μ = 30, H₁: μ ≠ 30 (two-tailed) b) H₀: p ≥ 0.02, H₁: p < 0.02 (left-tailed) c) H₀: μ ≤ 4.0, H₁: μ > 4.0 (right-tailed)
2. t = (85 - 80)/(14/√49) = 5/2 = 2.5 df = 48, critical t₀.₀₂₅ ≈ 2.01 Since |2.5| > 2.01, reject H₀ P-value ≈ 0.016 < 0.05
3. a) P(Z > 1.8) = 1 - 0.9641 = 0.0359 b) 0.0359 < 0.05, reject H₀ c) 0.0359 > 0.01, fail to reject H₀
4. Since we rejected H₀, if we’re wrong, it’s a Type I error (rejected a true H₀). The probability of this error is α = 0.05.
Next Steps
Continue learning about hypothesis testing:
- T-Tests - One-sample and two-sample tests
- Chi-Square Tests - Tests for categorical data
- T-Test Calculator - Practice hypothesis testing
- T-Table - Critical t-values reference
Frequently Asked Questions
What is hypothesis testing in simple terms?
Hypothesis testing is a method to determine if observed data provides enough evidence to support a claim about a population. You start with a null hypothesis (no effect), collect data, and determine if the evidence is strong enough to reject the null hypothesis.
What is the difference between null and alternative hypothesis?
The null hypothesis (H₀) states there is no effect or difference—it’s the default assumption. The alternative hypothesis (H₁ or Hₐ) is the research claim you want to prove. Hypothesis testing determines if there’s enough evidence to reject H₀ in favor of H₁.
What does p-value mean?
The p-value is the probability of obtaining results at least as extreme as your observed results, assuming the null hypothesis is true. A small p-value (typically below 0.05) suggests the observed data is unlikely under H₀, providing evidence to reject it.
What is Type I and Type II error?
Type I error (false positive) occurs when you reject a true null hypothesis—you claim an effect exists when it doesn’t. Type II error (false negative) occurs when you fail to reject a false null hypothesis—you miss a real effect. α controls Type I error; β controls Type II.
What significance level should I use?
α = 0.05 (5%) is most common, providing a good balance. Use α = 0.01 for stricter standards when false positives are costly (e.g., medical trials). Use α = 0.10 for exploratory research where missing effects is more concerning.
What is the difference between one-tailed and two-tailed tests?
A two-tailed test checks for any difference (H₁: μ ≠ μ₀), while a one-tailed test checks for a difference in a specific direction (H₁: μ greater than μ₀ or μ less than μ₀). Use two-tailed unless you have strong prior justification for a direction.
Was this lesson helpful?
Help us improve by sharing your feedback or spreading the word.