Open RStudio (or RStudio Cloud) to get started We will be using the diamonds dataset stored in
Question:
Open RStudio (or RStudio Cloud) to get started
We will be using the diamonds dataset stored in the tidyverse package. So start by running library(tidyverse)
Open the diamonds data by running the code: View(diamonds). Each row represents one diamond from a collection of over 59,000.
Take a look at the documentation for diamonds by running the code: ?diamonds
Question 1: Create a histogram of the price variable (For all histograms in this assignment, use the base R function hist). Also calculate the mean and standard deviation of this variable.
Question 2 : Take a random sample of 50 diamond prices from this dataset and name this vector fifty_diam (If saved properly, you will see this vector of length 50 saved in your global environment!). Sample without replacement (this will be the default option). Create a histogram of your sample, and then calculate the mean and standard deviation of this sample.
Include the image of your histogram in your report
Include the mean and standard deviation values
How much (absolute) error is there in your sample mean as an estimate of the true mean?
How much (absolute) error is there in your sample standard deviation (SD) as an estimate of the true SD?
Question 3: Next, set up a for loop to simulate taking a sample of size 50 at least 10,000 times. Inside your loop, calculate the mean price and save it to a vector called means. Here are two tips:
Remember before the loop to define means = NULL so that your loop knows where to save the means.
Remember inside the loop to include an index indicator with your means vector so that the vector fills iteratively for each iteration of the loop.
Try running the loop 10 times to ensure it works. This should be instantaneous. Then try running it 10,000 times. The loop should only take a few seconds to complete at 10,000 simulations, so if you wait more than a minute, click the stop button and see if something is defined incorrectly.
After successfully running your simulation, create a histogram of your means vector. Just use the hist() function rather than ggplot.
Include the image of your histogram in your report
Include the R code you used to generate this loop
Briefly describe the shape of your histogram. Is this a symmetric distribution or would you say it’s skewed? How does this relate to the Central Limit Theorem we learned in class?
Question 4: As you should notice from your histogram, our sample means will vary with each sample we take. Calculate the standard deviation of the means vector.
Report the standard deviation of the simulated means
We completed a finite number of simulations (10,000), but what is this value approximating? Report the name of this measure and calculate the true value for this measure too.
Question 5: Repeat question 3, but with a sample size of 10 instead of 50. Call your vector of sample means means_ten. After successfully running your simulation, create a histogram of your means_ten vector using the hist function again.
Include the image of your histogram in your report
Include the R code you used to generate this loop
Briefly describe the shape of your histogram. Is this a symmetric distribution or would you say it’s skewed? How does this relate to the Central Limit Theorem we learned in class?
Is the standard deviation of the simulated means higher or lower than it was for n = 50?
Question 6: We spent some time exploring the behavior of the sample mean, but now let’s look at the sample median! Redo question 3 with a sample size of 50, but now calculate the sample median inside your loop. Call your vector of sample medians medians_fifty. After successfully running your simulation, create a histogram of your medians_fifty vector using the hist function again.
Include the image of your histogram in your report
Include the R code you used to generate this loop
Briefly describe the shape of your histogram. Is this a symmetric distribution or would you say it’s skewed? Do you have any predictions for what would happen if we repeated this simulation again, but with a much larger sample size?
Calculate and report the standard deviation of the medians_fifty vector. This is the expected error in a randomly generated sample median as an estimate of the true median.
Question 7: Repeat question 6, but with a sample size of 500.
Include the image of your histogram in your report
Include the R code you used to generate this loop
Briefly describe the shape of your histogram. How has the shape changed in comparison to the distribution of sample medians when we took samples of size 50?
Calculate and report the standard deviation of your newest vector of medians. How does this expected error compare to when we had samples of size 50? Is this expected or surprising to you?