Data Collection Methods

Why Data Collection Matters

The quality of your statistical conclusions is only as good as the data you collect. Poor data collection leads to:

Biased estimates that systematically miss the truth
Invalid conclusions that don’t apply to your target population
Wasted resources on studies that can’t answer your questions
Misleading results that could harm decision-making

Populations and Samples

Key Definitions

Term	Definition	Example
Population	The entire group you want to study	All U.S. adults
Sample	A subset of the population you actually study	1,500 surveyed adults
Parameter	A number describing the population (usually unknown)	True average income of all U.S. adults
Statistic	A number calculated from the sample	Average income of 1,500 surveyed adults

Why Sample?

Studying an entire population is often:

Impossible (can’t test every light bulb without destroying them all)
Impractical (surveying 330 million Americans is costly)
Unnecessary (a well-designed sample can provide excellent estimates)

The goal: Use sample statistics to estimate population parameters.

Sampling Methods

Probability Sampling

Every member of the population has a known, non-zero probability of being selected. These methods allow for valid statistical inference.

1. Simple Random Sampling (SRS)

Every possible sample of size $n$ has an equal chance of being selected.

Simple Random Sampling

Scenario: Survey 100 students from a university of 10,000.

Method:

Get a list of all 10,000 students
Assign each a number (1-10,000)
Use a random number generator to select 100 numbers
Survey those 100 students

Pros: Simple, unbiased, easy to analyze Cons: Need complete list, may miss subgroups

2. Stratified Sampling

Divide the population into strata (groups), then sample from each stratum.

Stratified Sampling

Scenario: Survey students about campus dining.

Method:

Divide students into strata: Freshmen, Sophomores, Juniors, Seniors
Randomly sample from each group proportionally (or equally)

Stratum	Population	Sample (10%)
Freshmen	3,000	300
Sophomores	2,500	250
Juniors	2,300	230
Seniors	2,200	220
Total	10,000	1,000

Pros: Ensures representation of all groups, more precise estimates Cons: Must know stratum membership in advance

3. Cluster Sampling

Divide population into clusters (often geographic), randomly select entire clusters, then sample everyone (or a random sample) within selected clusters.

Cluster Sampling

Scenario: Survey household income across a large city.

Method:

Divide city into 500 neighborhoods (clusters)
Randomly select 25 neighborhoods
Survey all (or some) households in those 25 neighborhoods

Pros: Practical when no population list exists, reduces travel costs Cons: Less precise than SRS (households in clusters may be similar)

4. Systematic Sampling

Select every $k^{th}$ individual from a list, starting from a random point.

Systematic Sampling

Scenario: Quality control on a production line.

Method:

Production makes 1,000 items per day, want to test 50
Calculate interval: $k = 1000/50 = 20$
Randomly pick starting point (say, item #7)
Test items: 7, 27, 47, 67, 87, … (every 20th)

Pros: Easy to implement, spreads sample across time/space Cons: Can be biased if there’s a periodic pattern

Non-Probability Sampling

Selection is not random. Useful for exploratory research but cannot generalize to populations.

1. Convenience Sampling

Sample whoever is easily available.

Convenience Sampling

Surveying shoppers at one mall
Using students in your class for a psychology study
Polling followers on social media

Problem: Your sample may not represent the population at all.

2. Voluntary Response Sampling

People choose to participate (e.g., online polls, call-in surveys).

3. Snowball Sampling

Existing participants recruit future participants. Useful for hard-to-reach populations.

Snowball Sampling

Scenario: Study experiences of undocumented immigrants.

Method:

Find initial participants through community organizations
Ask them to refer others who might participate
Continue until desired sample size reached

Useful when: Population is hidden or hard to identify

Types of Studies

Observational Studies

Researchers observe and measure without intervening. No manipulation of variables.

Cross-Sectional Studies

Collect data at one point in time.

Cross-Sectional Study

Research question: Is coffee consumption associated with anxiety levels?

Method: Survey 1,000 adults today about their coffee consumption and anxiety symptoms.

Limitation: Can’t determine which came first (temporal ambiguity)

Longitudinal Studies

Follow the same subjects over time.

Longitudinal Study

Research question: Does childhood obesity predict adult health outcomes?

Method:

1990: Measure BMI of 5,000 children (age 10)
2000: Follow up (age 20)
2010: Follow up (age 30)
2020: Follow up (age 40)

Advantage: Can track changes over time Challenge: Expensive, participant dropout

Case-Control Studies

Compare people with a condition (cases) to those without (controls).

Case-Control Study

Research question: Is smoking associated with lung cancer?

Method:

Cases: 500 people with lung cancer
Controls: 500 similar people without lung cancer
Compare smoking history between groups

Advantage: Efficient for rare diseases Limitation: Relies on recall (potential bias)

Experimental Studies

Researchers actively intervene by assigning treatments to subjects.

Key Components of Experiments

1. Treatments: The conditions being compared (drug vs. placebo, new method vs. standard)

2. Experimental Units: The individuals receiving treatments (people, animals, plots of land)

3. Response Variable: The outcome being measured

4. Control Group: Group receiving no treatment or a placebo (for comparison)

5. Randomization: Random assignment of units to treatments (reduces bias)

6. Replication: Multiple units in each treatment group (allows estimation of variability)

Randomized Controlled Trial (RCT)

Research question: Does a new drug lower blood pressure?

Design:

Recruit 200 patients with high blood pressure
Randomly assign: 100 to drug, 100 to placebo
Neither patients nor doctors know who gets what (double-blind)
After 3 months, compare blood pressure between groups

Why this works:

Randomization ensures groups are similar at baseline
Blinding prevents placebo effect and researcher bias
Any difference in outcomes can be attributed to the drug

Observational vs. Experimental

Aspect	Observational	Experimental
Intervention	No	Yes
Causation	Cannot establish	Can establish
Ethics	Often more ethical	May raise ethical concerns
Real-world	More natural	More artificial
Confounding	Possible	Controlled through randomization
Example	Survey about diet and health	Clinical trial of new drug

Sources of Bias

Bias is systematic error that causes results to deviate from the truth in a consistent direction.

Selection Bias

The sample doesn’t represent the population.

Selection Bias Examples

1936 Literary Digest Poll: Surveyed 2.4 million people, predicted Landon would beat Roosevelt. Wrong! Sample came from phone books and car registrations—overrepresented wealthy voters.
Survivor bias: Studying successful entrepreneurs while ignoring failed ones
Self-selection: Only people interested in a topic volunteer to participate

Response Bias

Participants don’t answer truthfully or accurately.

Causes include:

Social desirability: People underreport drinking, overreport exercise
Leading questions: “Don’t you agree that taxes are too high?”
Question wording: “Assistance to the poor” vs. “Welfare” get different responses
Recall bias: People misremember past events

Measurement Bias

The measurement process itself introduces errors.

Measurement Bias

Using bathroom scales that consistently read 2 lbs too high
Researchers who know the hypothesis unconsciously interpreting results favorably
Survey questions that are confusing or ambiguous

Non-Response Bias

People who don’t respond differ systematically from those who do.

Non-Response Bias

A phone survey about political opinions:

Only 30% of people answer the phone
Those who answer may be older (more landlines) or have more free time
Political opinions of responders may differ from non-responders

Survey Design Best Practices

Writing Good Questions

DO:

Use clear, simple language
Ask about one thing at a time
Provide balanced response options
Pilot test your survey

DON’T:

Use leading questions
Use double-barreled questions (“Do you like the taste and price?”)
Use jargon or technical terms
Make assumptions (“When did you stop smoking?”)

Good vs. Bad Survey Questions

Bad: “Don’t you think the government wastes too much money?”

Leading, suggests expected answer

Good: “How well do you think the government manages taxpayer money?”

Excellent / Good / Fair / Poor / Very Poor

Bad: “How often do you exercise and eat healthy?”

Double-barreled (two questions in one)

Good: Two separate questions about exercise and diet

Response Formats

Type	When to Use	Example
Multiple choice	Finite, known options	”What is your major?”
Likert scale	Measure attitudes/opinions	”Strongly disagree to Strongly agree”
Rating scale	Measure intensity	”Rate your pain from 0-10”
Open-ended	Exploratory, want details	”What could we improve?”
Ranking	Relative preferences	”Rank these 5 features by importance”

Designing Your Study: A Checklist

Define your research question clearly
Identify the population of interest
Choose study type (observational vs. experimental)
Select sampling method (probability preferred)
Determine sample size (larger = more precise)
Design instruments (surveys, measurements)
Consider potential biases and how to minimize them
Plan data collection logistics
Pilot test before full implementation
Document everything for reproducibility

Summary

In this lesson, you learned:

Populations are entire groups; samples are subsets we actually study
Probability sampling (SRS, stratified, cluster, systematic) allows valid inference
Non-probability sampling (convenience, voluntary) cannot generalize to populations
Observational studies measure without intervention; cannot prove causation
Experiments with randomization can establish causation
Bias (selection, response, measurement, non-response) threatens validity
Good survey design uses clear, neutral, focused questions

Practice Problems

1. A researcher wants to study smartphone usage among teenagers. She surveys students at her local high school during lunch. What type of sampling is this, and what are potential problems?

2. A pharmaceutical company tests a new pain medication by randomly assigning 100 patients to receive the drug and 100 to receive a sugar pill. Neither patients nor researchers know who received what. What type of study is this? What features make it rigorous?

3. An online news site asks readers to vote on whether they support a new policy. 75% vote “No.” Can we conclude that 75% of the population opposes the policy? Why or why not?

Click to see answers

1. This is convenience sampling. Problems:

Only represents one school (not all teenagers)
Students at lunch may differ from those elsewhere
Results cannot be generalized to all teenagers

2. This is a randomized controlled trial (RCT) that is double-blind. Rigorous features:

Random assignment (eliminates selection bias)
Control group (provides comparison)
Double-blinding (prevents placebo effect and researcher bias)
Replication (100 per group allows statistical analysis)

3. No! This is voluntary response sampling. People with strong opinions (especially negative) are more likely to respond. The 75% likely overestimates true opposition in the population.

Next Steps

Now that you understand data collection:

Summarizing Data with Tables - Organize your collected data
Sampling Distributions - Understand sampling variability
Sampling Methods - Deep dive into sampling techniques

Why Data Collection Matters

Populations and Samples

Key Definitions

Why Sample?

Sampling Methods

Probability Sampling

1. Simple Random Sampling (SRS)

2. Stratified Sampling

3. Cluster Sampling

4. Systematic Sampling

Non-Probability Sampling

1. Convenience Sampling

2. Voluntary Response Sampling

3. Snowball Sampling

Types of Studies

Observational Studies

Cross-Sectional Studies

Longitudinal Studies

Case-Control Studies

Experimental Studies

Key Components of Experiments

Observational vs. Experimental

Sources of Bias

Selection Bias

Response Bias

Measurement Bias

Non-Response Bias

Survey Design Best Practices

Writing Good Questions

Response Formats

Designing Your Study: A Checklist

Summary

Practice Problems

Next Steps

Was this lesson helpful?