The file DVD Movies.xlsx contains a large data set of 10,000 customer transactions for a fictional chain of video stores in the United States. Each row corresponds to a different customer and lists
(1) A customer ID number (1–10,000),
(2) The state where the customer lives,
(3) The city where the customer lives,
(4) The customer’s gender,
(5) The customer’s favorite type of movie (drama, comedy, science fiction, or action),
(6) The customer’s next favorite type of movie,
(7) The number of times the customer has rented movies in the past year, and
(8) The total dollar amount the customer has spent on movie rentals during the past year.
The data are sorted by state, then city, then gender. We assume that this data set represents the entire population of customers for this video chain. Imagine that only the data in columns A through D are readily available for this population. The company is interested in summary statistics of the data in columns E through H, such as the percentage of customers whose favorite movie type is drama or the average amount spent annually per customer, but it will have to do some work to obtain the data in columns E through H for any particular customer. Therefore, the company wants to perform sampling. The question is: What form—simple random sampling, systematic sampling, stratified sampling, cluster sampling, or even some type of multistage sampling—is most appropriate?
Your job is to investigate the possibilities and to write a report on your findings. For any sampling method, any sample size, and any quantity of interest (such as average dollar amount spent annually), you should be concerned with sampling cost and accuracy.
One way to judge the latter is to generate several random samples from a particular method and calculate the mean and standard deviation of your point estimates from these samples. For example, you might generate 10 systematic samples, calculate the average amount spent (an) for each sample, and then calculate the mean and standard deviation of these 10 s. If your sampling method is accurate, the mean of the s should be close to the population average, and the standard deviation should be small. By doing this for several sampling methods and possibly several sample sizes, you can experiment to see what is most cost-efficient for the company. You can make any reasonable assumptions about the cost of sampling with any particular method.