Sampling and Population - Let's create some population data to work with. Be aware that we're playing
Question:
Sampling and Population - Let's create some population data to work with. Be aware that we're "playing god" here- in reality we almost never have the full population of interest. The point is just to understand what is really happening when we collect a sample of data and try to draw conclusions about a broader population based on that sample.
Hypothetical scenario. Right now, you don't need to focus on understanding this code.
library(tidyverse)
rbern = \(n,p) rbinom(n,1,p)
set.seed(1)
n = 5e5
population = tibble(
white = rbern(n, 0.55),
preschool = rbern(n, ifelse(white, 0.05, 0.15)), college = rbern(n, case_when( white & preschool ~ 0.5,
white & !preschool ~ 0.4,
!white & preschool ~ 0.3,
!white & !preschool ~ 0.1 )),
)
After running this code the resulting population data will be saved in a dataframe called population. This population represents all adults in a particular US state. There are three variables of interest: white: does the individual identify as white preschool: was this person a participant in a government-funded preschool program college: did this person end up attending any college Let's do some analysis. Share the R Codeubmit your code, plot, and description.
1. Use the population data to calculate the probability of attending college given preschool.
2(a). Sample 200 random people from the population into a new dataframe. From that sample, calculate the proportion of people who attended college among those who attended preschool.
(b)Repeat this sampling process 1,000 times and record that sample proportion each time. Consider this our "population" of all possible survey results.
(c)Generate a histogram of all the sample proportions across different samples. In a sentence, describe what this plot looks like compared to the true population proportion of interest. Is the average survey result close to what you know the population proportion to be? In a rough sense, how variable are the survey results?
3. Consider a survey result "accurate" if it's within 0.1 of the known, true value that you calculated in Question 1. Treating the many repeated survey results as the "population" of all possible survey results, calculate the probability that a survey result is different than the true value by more than 0.1 (i.e. the probability that the result is "accurate"). Presume that nonwhite individuals are less likely to respond to our survey. We can simulate taking a single random sample that is nonrepresentative of the population in this way using the following code:
sample = population %>%
slice_sample(n=200, weight_by=ifelse(white, 0.5, 0.2))
If you look at the resulting data, you will see that the proportion of white individuals is much higher than in the population data (though by how much will vary since the sample is random).
4. Repeat parts 2 and 3, but using the nonrepresentative sampling scheme described above (this may take a few minutes to run). Comment on how the result is different than in question 2 and question 3.
5. Let's say you ran a real survey with n = 200 that you are sure was a fair random sample (you paid project members to go door-to-door and make sure they got responses). A rival research team was less careful in their recruitment and as a result was able to afford more responses (n = 400) but possibly from a non-representative (e.g. whiter) population. Both produce an estimate of the unknown population quantity P(college | preschool). Which is better?
6. In a sentence or two, give a reason for why you would prefer to trust the smaller, more representative survey. Now give a reason for why you would prefer to trust the larger, less representative survey.