Data Collection Methods
Master data collection: sampling techniques, surveys, experiments, and observational studies. Learn to design research.
On This Page
Why Data Collection Matters
The quality of your statistical conclusions is only as good as the data you collect. Poor data collection leads to:
- Biased estimates that systematically miss the truth
- Invalid conclusions that don’t apply to your target population
- Wasted resources on studies that can’t answer your questions
- Misleading results that could harm decision-making
Populations and Samples
Key Definitions
| Term | Definition | Example |
|---|---|---|
| Population | The entire group you want to study | All U.S. adults |
| Sample | A subset of the population you actually study | 1,500 surveyed adults |
| Parameter | A number describing the population (usually unknown) | True average income of all U.S. adults |
| Statistic | A number calculated from the sample | Average income of 1,500 surveyed adults |
Why Sample?
Studying an entire population is often:
- Impossible (can’t test every light bulb without destroying them all)
- Impractical (surveying 330 million Americans is costly)
- Unnecessary (a well-designed sample can provide excellent estimates)
The goal: Use sample statistics to estimate population parameters.
Sampling Methods
Probability Sampling
Every member of the population has a known, non-zero probability of being selected. These methods allow for valid statistical inference.
1. Simple Random Sampling (SRS)
Every possible sample of size has an equal chance of being selected.
Scenario: Survey 100 students from a university of 10,000.
Method:
- Get a list of all 10,000 students
- Assign each a number (1-10,000)
- Use a random number generator to select 100 numbers
- Survey those 100 students
Pros: Simple, unbiased, easy to analyze Cons: Need complete list, may miss subgroups
2. Stratified Sampling
Divide the population into strata (groups), then sample from each stratum.
Scenario: Survey students about campus dining.
Method:
- Divide students into strata: Freshmen, Sophomores, Juniors, Seniors
- Randomly sample from each group proportionally (or equally)
| Stratum | Population | Sample (10%) |
|---|---|---|
| Freshmen | 3,000 | 300 |
| Sophomores | 2,500 | 250 |
| Juniors | 2,300 | 230 |
| Seniors | 2,200 | 220 |
| Total | 10,000 | 1,000 |
Pros: Ensures representation of all groups, more precise estimates Cons: Must know stratum membership in advance
3. Cluster Sampling
Divide population into clusters (often geographic), randomly select entire clusters, then sample everyone (or a random sample) within selected clusters.
Scenario: Survey household income across a large city.
Method:
- Divide city into 500 neighborhoods (clusters)
- Randomly select 25 neighborhoods
- Survey all (or some) households in those 25 neighborhoods
Pros: Practical when no population list exists, reduces travel costs Cons: Less precise than SRS (households in clusters may be similar)
4. Systematic Sampling
Select every individual from a list, starting from a random point.
Scenario: Quality control on a production line.
Method:
- Production makes 1,000 items per day, want to test 50
- Calculate interval:
- Randomly pick starting point (say, item #7)
- Test items: 7, 27, 47, 67, 87, … (every 20th)
Pros: Easy to implement, spreads sample across time/space Cons: Can be biased if there’s a periodic pattern
Non-Probability Sampling
Selection is not random. Useful for exploratory research but cannot generalize to populations.
1. Convenience Sampling
Sample whoever is easily available.
- Surveying shoppers at one mall
- Using students in your class for a psychology study
- Polling followers on social media
Problem: Your sample may not represent the population at all.
2. Voluntary Response Sampling
People choose to participate (e.g., online polls, call-in surveys).
3. Snowball Sampling
Existing participants recruit future participants. Useful for hard-to-reach populations.
Scenario: Study experiences of undocumented immigrants.
Method:
- Find initial participants through community organizations
- Ask them to refer others who might participate
- Continue until desired sample size reached
Useful when: Population is hidden or hard to identify
Types of Studies
Observational Studies
Researchers observe and measure without intervening. No manipulation of variables.
Cross-Sectional Studies
Collect data at one point in time.
Research question: Is coffee consumption associated with anxiety levels?
Method: Survey 1,000 adults today about their coffee consumption and anxiety symptoms.
Limitation: Can’t determine which came first (temporal ambiguity)
Longitudinal Studies
Follow the same subjects over time.
Research question: Does childhood obesity predict adult health outcomes?
Method:
- 1990: Measure BMI of 5,000 children (age 10)
- 2000: Follow up (age 20)
- 2010: Follow up (age 30)
- 2020: Follow up (age 40)
Advantage: Can track changes over time Challenge: Expensive, participant dropout
Case-Control Studies
Compare people with a condition (cases) to those without (controls).
Research question: Is smoking associated with lung cancer?
Method:
- Cases: 500 people with lung cancer
- Controls: 500 similar people without lung cancer
- Compare smoking history between groups
Advantage: Efficient for rare diseases Limitation: Relies on recall (potential bias)
Experimental Studies
Researchers actively intervene by assigning treatments to subjects.
Key Components of Experiments
1. Treatments: The conditions being compared (drug vs. placebo, new method vs. standard)
2. Experimental Units: The individuals receiving treatments (people, animals, plots of land)
3. Response Variable: The outcome being measured
4. Control Group: Group receiving no treatment or a placebo (for comparison)
5. Randomization: Random assignment of units to treatments (reduces bias)
6. Replication: Multiple units in each treatment group (allows estimation of variability)
Research question: Does a new drug lower blood pressure?
Design:
- Recruit 200 patients with high blood pressure
- Randomly assign: 100 to drug, 100 to placebo
- Neither patients nor doctors know who gets what (double-blind)
- After 3 months, compare blood pressure between groups
Why this works:
- Randomization ensures groups are similar at baseline
- Blinding prevents placebo effect and researcher bias
- Any difference in outcomes can be attributed to the drug
Observational vs. Experimental
| Aspect | Observational | Experimental |
|---|---|---|
| Intervention | No | Yes |
| Causation | Cannot establish | Can establish |
| Ethics | Often more ethical | May raise ethical concerns |
| Real-world | More natural | More artificial |
| Confounding | Possible | Controlled through randomization |
| Example | Survey about diet and health | Clinical trial of new drug |
Sources of Bias
Bias is systematic error that causes results to deviate from the truth in a consistent direction.
Selection Bias
The sample doesn’t represent the population.
- 1936 Literary Digest Poll: Surveyed 2.4 million people, predicted Landon would beat Roosevelt. Wrong! Sample came from phone books and car registrations—overrepresented wealthy voters.
- Survivor bias: Studying successful entrepreneurs while ignoring failed ones
- Self-selection: Only people interested in a topic volunteer to participate
Response Bias
Participants don’t answer truthfully or accurately.
Causes include:
- Social desirability: People underreport drinking, overreport exercise
- Leading questions: “Don’t you agree that taxes are too high?”
- Question wording: “Assistance to the poor” vs. “Welfare” get different responses
- Recall bias: People misremember past events
Measurement Bias
The measurement process itself introduces errors.
- Using bathroom scales that consistently read 2 lbs too high
- Researchers who know the hypothesis unconsciously interpreting results favorably
- Survey questions that are confusing or ambiguous
Non-Response Bias
People who don’t respond differ systematically from those who do.
A phone survey about political opinions:
- Only 30% of people answer the phone
- Those who answer may be older (more landlines) or have more free time
- Political opinions of responders may differ from non-responders
Survey Design Best Practices
Writing Good Questions
DO:
- Use clear, simple language
- Ask about one thing at a time
- Provide balanced response options
- Pilot test your survey
DON’T:
- Use leading questions
- Use double-barreled questions (“Do you like the taste and price?”)
- Use jargon or technical terms
- Make assumptions (“When did you stop smoking?”)
Bad: “Don’t you think the government wastes too much money?”
- Leading, suggests expected answer
Good: “How well do you think the government manages taxpayer money?”
- Excellent / Good / Fair / Poor / Very Poor
Bad: “How often do you exercise and eat healthy?”
- Double-barreled (two questions in one)
Good: Two separate questions about exercise and diet
Response Formats
| Type | When to Use | Example |
|---|---|---|
| Multiple choice | Finite, known options | ”What is your major?” |
| Likert scale | Measure attitudes/opinions | ”Strongly disagree to Strongly agree” |
| Rating scale | Measure intensity | ”Rate your pain from 0-10” |
| Open-ended | Exploratory, want details | ”What could we improve?” |
| Ranking | Relative preferences | ”Rank these 5 features by importance” |
Designing Your Study: A Checklist
- Define your research question clearly
- Identify the population of interest
- Choose study type (observational vs. experimental)
- Select sampling method (probability preferred)
- Determine sample size (larger = more precise)
- Design instruments (surveys, measurements)
- Consider potential biases and how to minimize them
- Plan data collection logistics
- Pilot test before full implementation
- Document everything for reproducibility
Summary
In this lesson, you learned:
- Populations are entire groups; samples are subsets we actually study
- Probability sampling (SRS, stratified, cluster, systematic) allows valid inference
- Non-probability sampling (convenience, voluntary) cannot generalize to populations
- Observational studies measure without intervention; cannot prove causation
- Experiments with randomization can establish causation
- Bias (selection, response, measurement, non-response) threatens validity
- Good survey design uses clear, neutral, focused questions
Practice Problems
1. A researcher wants to study smartphone usage among teenagers. She surveys students at her local high school during lunch. What type of sampling is this, and what are potential problems?
2. A pharmaceutical company tests a new pain medication by randomly assigning 100 patients to receive the drug and 100 to receive a sugar pill. Neither patients nor researchers know who received what. What type of study is this? What features make it rigorous?
3. An online news site asks readers to vote on whether they support a new policy. 75% vote “No.” Can we conclude that 75% of the population opposes the policy? Why or why not?
Click to see answers
1. This is convenience sampling. Problems:
- Only represents one school (not all teenagers)
- Students at lunch may differ from those elsewhere
- Results cannot be generalized to all teenagers
2. This is a randomized controlled trial (RCT) that is double-blind. Rigorous features:
- Random assignment (eliminates selection bias)
- Control group (provides comparison)
- Double-blinding (prevents placebo effect and researcher bias)
- Replication (100 per group allows statistical analysis)
3. No! This is voluntary response sampling. People with strong opinions (especially negative) are more likely to respond. The 75% likely overestimates true opposition in the population.
Next Steps
Now that you understand data collection:
- Summarizing Data with Tables - Organize your collected data
- Sampling Distributions - Understand sampling variability
- Sampling Methods - Deep dive into sampling techniques
Was this lesson helpful?
Help us improve by sharing your feedback or spreading the word.