Question: Q1: For this question, use the OJ data set from the ISLR package (you can install this package from the CRAN repository directly from R).


Q1: For this question, use the OJ data set from the ISLR package (you can install this package from the CRAN repository directly from R). This represents data from sales of two brands of orange juice. Each of the n 1070 observations represents a single sales transaction. We will make use of the variables: Purchase Purchased brand was either Citrus HillCH) or Minute Maid -MM) . StoreID ID of store at which purchase was made (StoreID- 1, 2, 3, 4, 7). LoyalCH- Customer brand loyalty score for CH on a scale of 0 to 1. The objective is to determine whether or not customer loyalty differs significantly between stores. (a) Construct side-by-side boxplots of LoyalCH using Purchase as the group variable. Interpret what you see. Use a Wilcoxon rank sum test to determine if there is a significant difference in the median of LoyalCH score between purchase groups. (b) Construct side-by-side boxplots of LoyalCH using StoreID as the group variable. Fit an ANOVA model using LoyalCH as response and StoreID as the treatment variable. Is there evidence that mean loyalty score varies by store? (At this point, you need not consider any transformation of the response variable) (c) Using Tukey's pairwise procedure, what can be said about the rankings of the mean loyalty scores, using a family-wise error rate of aFWE = 0.05. You can use function TukeyHSD (d) Noting that LoyalCH is constrained to be between 0 and 1, it is important to assess whether or not the distributional properties of the responses permit the reporting of accurate observed significance levels. Construct a normal quantile plot of the residuals from your ANOVA fit. Then apply the empirical rule to assess the normality of the residuals. What do you conclude? Why should this be done using the model residuals instead the response variable directly? (e) We can use simulation methods to judge the accuracy of the observed significant levels. Suppose the true mean value of LoyalCH does not vary by store. Then a response from any store can be modeled as y = + e, where e is a zero mean error term. The distribution of can then be estimated using the residuals. We can do this using a bootstrap procedure (Section 10.2 of lecture notes, Section 5.2 of ISLR). Suppose y is the response vector of length n, and x is the factor variable identifying the store. We have already fit the model y~x. Suppose we then let y.boot be a random sample of size n (with replacement) of the residuals fron some fitted model. This is equivalent to simulating a sample froin y +e, where 0. This suffices for our purpose, since the actual value of will play no role in the procedure (and so can be zero with no loss of generality). If we then fit the ANOVA model y.boot x, the nu hypothesis of equal treatment means will hold, therefore the P-value of the F-test should possess a uniform distribution on 0,1 Of course, this depends on the correctness of the distributional assumptions (that e is normally distributed), and so provides a means of assessing whether or not those hold (or at least that any deviation from normality does not significantly affect the accuracy of the reported level of significance). To carry out the procedure, simulate M bootstrap samples, capturing the P-value from the F-test for each one. If the distributional assumptions required for the F-test hold, then the replicated P-value distribution should be approximately uniform (i) Suppose X is a continuous random variable with CDF F(z) = P(X z). Verify that F(X) and 1- F(X) have a uniform distribution on 0, How does this verify the claim made above that the P-value is uniformly distributed under the nu hypothesis? (i) Carry out this bootstrap procedure for the ANOVA model fit in Part (b). Use M 100,000. Draw a histogram of the replicated P-values, using the nclass25 option. Report the proportion of the replicated P-values, say , below 0.00 1.0.01, 0.05.0.1. In addition, for each value of , report Z-(-)/SE, where SE Va(1-a)/M is the standard error of . Interpret your results. (iii) what is the standard error of for a = 0.1 and M? What does this tell you about the overall accuracy of the bootstrap procedure. (f) We will next carry out an experiment to assess the sensitivity of the bootstrap method. Consider the transformation y1/ -y), and apply it to the responses of store StoreID7. Repeat the bootstrap procedureust described, except that the replicated response vector y.boot will be constructed by sampling n responses with replacement from the transformed responses y Note that although we only sample responses frorn store Store!D =-7, responses for all original stores are being simulated. Would the observed significance levels using data with this distribution be accurate? (g) Repeat Part(f), but apply a Box-Cox transformation to the replicated samples (see Section 10.8.3 of the CSC/DSC 462 lecture notes in the COURSE MATERIALS folder on the Blackboard course website). You will need to load the MASS package. Under this transformation, would the observed significance levels be accurate? Q1: For this question, use the OJ data set from the ISLR package (you can install this package from the CRAN repository directly from R). This represents data from sales of two brands of orange juice. Each of the n 1070 observations represents a single sales transaction. We will make use of the variables: Purchase Purchased brand was either Citrus HillCH) or Minute Maid -MM) . StoreID ID of store at which purchase was made (StoreID- 1, 2, 3, 4, 7). LoyalCH- Customer brand loyalty score for CH on a scale of 0 to 1. The objective is to determine whether or not customer loyalty differs significantly between stores. (a) Construct side-by-side boxplots of LoyalCH using Purchase as the group variable. Interpret what you see. Use a Wilcoxon rank sum test to determine if there is a significant difference in the median of LoyalCH score between purchase groups. (b) Construct side-by-side boxplots of LoyalCH using StoreID as the group variable. Fit an ANOVA model using LoyalCH as response and StoreID as the treatment variable. Is there evidence that mean loyalty score varies by store? (At this point, you need not consider any transformation of the response variable) (c) Using Tukey's pairwise procedure, what can be said about the rankings of the mean loyalty scores, using a family-wise error rate of aFWE = 0.05. You can use function TukeyHSD (d) Noting that LoyalCH is constrained to be between 0 and 1, it is important to assess whether or not the distributional properties of the responses permit the reporting of accurate observed significance levels. Construct a normal quantile plot of the residuals from your ANOVA fit. Then apply the empirical rule to assess the normality of the residuals. What do you conclude? Why should this be done using the model residuals instead the response variable directly? (e) We can use simulation methods to judge the accuracy of the observed significant levels. Suppose the true mean value of LoyalCH does not vary by store. Then a response from any store can be modeled as y = + e, where e is a zero mean error term. The distribution of can then be estimated using the residuals. We can do this using a bootstrap procedure (Section 10.2 of lecture notes, Section 5.2 of ISLR). Suppose y is the response vector of length n, and x is the factor variable identifying the store. We have already fit the model y~x. Suppose we then let y.boot be a random sample of size n (with replacement) of the residuals fron some fitted model. This is equivalent to simulating a sample froin y +e, where 0. This suffices for our purpose, since the actual value of will play no role in the procedure (and so can be zero with no loss of generality). If we then fit the ANOVA model y.boot x, the nu hypothesis of equal treatment means will hold, therefore the P-value of the F-test should possess a uniform distribution on 0,1 Of course, this depends on the correctness of the distributional assumptions (that e is normally distributed), and so provides a means of assessing whether or not those hold (or at least that any deviation from normality does not significantly affect the accuracy of the reported level of significance). To carry out the procedure, simulate M bootstrap samples, capturing the P-value from the F-test for each one. If the distributional assumptions required for the F-test hold, then the replicated P-value distribution should be approximately uniform (i) Suppose X is a continuous random variable with CDF F(z) = P(X z). Verify that F(X) and 1- F(X) have a uniform distribution on 0, How does this verify the claim made above that the P-value is uniformly distributed under the nu hypothesis? (i) Carry out this bootstrap procedure for the ANOVA model fit in Part (b). Use M 100,000. Draw a histogram of the replicated P-values, using the nclass25 option. Report the proportion of the replicated P-values, say , below 0.00 1.0.01, 0.05.0.1. In addition, for each value of , report Z-(-)/SE, where SE Va(1-a)/M is the standard error of . Interpret your results. (iii) what is the standard error of for a = 0.1 and M? What does this tell you about the overall accuracy of the bootstrap procedure. (f) We will next carry out an experiment to assess the sensitivity of the bootstrap method. Consider the transformation y1/ -y), and apply it to the responses of store StoreID7. Repeat the bootstrap procedureust described, except that the replicated response vector y.boot will be constructed by sampling n responses with replacement from the transformed responses y Note that although we only sample responses frorn store Store!D =-7, responses for all original stores are being simulated. Would the observed significance levels using data with this distribution be accurate? (g) Repeat Part(f), but apply a Box-Cox transformation to the replicated samples (see Section 10.8.3 of the CSC/DSC 462 lecture notes in the COURSE MATERIALS folder on the Blackboard course website). You will need to load the MASS package. Under this transformation, would the observed significance levels be accurate
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
