Question: Q1: For this question, use the OJ data set from the ISLR package (you can install this package from the CRAN repository directly from R).

Q1: For this question, use the OJ data set from the ISLR package (you can install this package from the CRAN repository directly

Q1: For this question, use the OJ data set from the ISLR package (you can install this package from the CRAN repository directly from R). This represents data from sales of two brands of orange juice. Each of the n 1070 observations represents a single sales transaction. We will make use of the variables: Purchase Purchased brand was either Citrus HillCH) or Minute Maid -MM) . StoreID ID of store at which purchase was made (StoreID- 1, 2, 3, 4, 7). LoyalCH- Customer brand loyalty score for CH on a scale of 0 to 1. The objective is to determine whether or not customer loyalty differs significantly between stores. (a) Construct side-by-side boxplots of LoyalCH using Purchase as the group variable. Interpret what you see. Use a Wilcoxon rank sum test to determine if there is a significant difference in the median of LoyalCH score between purchase groups. (b) Construct side-by-side boxplots of LoyalCH using StoreID as the group variable. Fit an ANOVA model using LoyalCH as response and StoreID as the treatment variable. Is there evidence that mean loyalty score varies by store? (At this point, you need not consider any transformation of the response variable) (c) Using Tukey's pairwise procedure, what can be said about the rankings of the mean loyalty scores, using a family-wise error rate of aFWE = 0.05. You can use function TukeyHSD (d) Noting that LoyalCH is constrained to be between 0 and 1, it is important to assess whether or not the distributional properties of the responses permit the reporting of accurate observed significance levels. Construct a normal quantile plot of the residuals from your ANOVA fit. Then apply the empirical rule to assess the normality of the residuals. What do you conclude? Why should this be done using the model residuals instead the response variable directly? (e) We can use simulation methods to judge the accuracy of the observed significant levels. Suppose the true mean value of LoyalCH does not vary by store. Then a response from any store can be modeled as y = + e, where e is a zero mean error term. The distribution of can then be estimated using the residuals. We can do this using a bootstrap procedure (Section 10.2 of lecture notes, Section 5.2 of ISLR). Suppose y is the response vector of length n, and x is the factor variable identifying the store. We have already fit the model y~x. Suppose we then let y.boot be a random sample of size n (with replacement) of the residuals fron some fitted model. This is equivalent to simulating a sample froin y +e, where 0. This suffices for our purpose, since the actual value of will play no role in the procedure (and so can be zero with no loss of generality). If we then fit the ANOVA model y.boot x, the nu hypothesis of equal treatment means will hold, therefore the P-value of the F-test should possess a uniform distribution on 0,1 Of course, this depends on the correctness of the distributional assumptions (that e is normally distributed), and so provides a means of assessing whether or not those hold (or at least that any deviation from normality does not significantly affect the accuracy of the reported level of significance). To carry out the procedure, simulate M bootstrap samples, capturing the P-value from the F-test for each one. If the distributional assumptions required for the F-test hold, then the replicated P-value distribution should be approximately uniform (i) Suppose X is a continuous random variable with CDF F(z) = P(X z). Verify that F(X) and 1- F(X) have a uniform distribution on 0, How does this verify the claim made above that the P-value is uniformly distributed under the nu hypothesis? (i) Carry out this bootstrap procedure for the ANOVA model fit in Part (b). Use M 100,000. Draw a histogram of the replicated P-values, using the nclass25 option. Report the proportion of the replicated P-values, say , below 0.00 1.0.01, 0.05.0.1. In addition, for each value of , report Z-(-)/SE, where SE Va(1-a)/M is the standard error of . Interpret your results. (iii) what is the standard error of for a = 0.1 and M? What does this tell you about the overall accuracy of the bootstrap procedure. (f) We will next carry out an experiment to assess the sensitivity of the bootstrap method. Consider the transformation y1/ -y), and apply it to the responses of store StoreID7. Repeat the bootstrap procedureust described, except that the replicated response vector y.boot will be constructed by sampling n responses with replacement from the transformed responses y Note that although we only sample responses frorn store Store!D =-7, responses for all original stores are being simulated. Would the observed significance levels using data with this distribution be accurate? (g) Repeat Part(f), but apply a Box-Cox transformation to the replicated samples (see Section 10.8.3 of the CSC/DSC 462 lecture notes in the COURSE MATERIALS folder on the Blackboard course website). You will need to load the MASS package. Under this transformation, would the observed significance levels be accurate? Q1: For this question, use the OJ data set from the ISLR package (you can install this package from the CRAN repository directly from R). This represents data from sales of two brands of orange juice. Each of the n 1070 observations represents a single sales transaction. We will make use of the variables: Purchase Purchased brand was either Citrus HillCH) or Minute Maid -MM) . StoreID ID of store at which purchase was made (StoreID- 1, 2, 3, 4, 7). LoyalCH- Customer brand loyalty score for CH on a scale of 0 to 1. The objective is to determine whether or not customer loyalty differs significantly between stores. (a) Construct side-by-side boxplots of LoyalCH using Purchase as the group variable. Interpret what you see. Use a Wilcoxon rank sum test to determine if there is a significant difference in the median of LoyalCH score between purchase groups. (b) Construct side-by-side boxplots of LoyalCH using StoreID as the group variable. Fit an ANOVA model using LoyalCH as response and StoreID as the treatment variable. Is there evidence that mean loyalty score varies by store? (At this point, you need not consider any transformation of the response variable) (c) Using Tukey's pairwise procedure, what can be said about the rankings of the mean loyalty scores, using a family-wise error rate of aFWE = 0.05. You can use function TukeyHSD (d) Noting that LoyalCH is constrained to be between 0 and 1, it is important to assess whether or not the distributional properties of the responses permit the reporting of accurate observed significance levels. Construct a normal quantile plot of the residuals from your ANOVA fit. Then apply the empirical rule to assess the normality of the residuals. What do you conclude? Why should this be done using the model residuals instead the response variable directly? (e) We can use simulation methods to judge the accuracy of the observed significant levels. Suppose the true mean value of LoyalCH does not vary by store. Then a response from any store can be modeled as y = + e, where e is a zero mean error term. The distribution of can then be estimated using the residuals. We can do this using a bootstrap procedure (Section 10.2 of lecture notes, Section 5.2 of ISLR). Suppose y is the response vector of length n, and x is the factor variable identifying the store. We have already fit the model y~x. Suppose we then let y.boot be a random sample of size n (with replacement) of the residuals fron some fitted model. This is equivalent to simulating a sample froin y +e, where 0. This suffices for our purpose, since the actual value of will play no role in the procedure (and so can be zero with no loss of generality). If we then fit the ANOVA model y.boot x, the nu hypothesis of equal treatment means will hold, therefore the P-value of the F-test should possess a uniform distribution on 0,1 Of course, this depends on the correctness of the distributional assumptions (that e is normally distributed), and so provides a means of assessing whether or not those hold (or at least that any deviation from normality does not significantly affect the accuracy of the reported level of significance). To carry out the procedure, simulate M bootstrap samples, capturing the P-value from the F-test for each one. If the distributional assumptions required for the F-test hold, then the replicated P-value distribution should be approximately uniform (i) Suppose X is a continuous random variable with CDF F(z) = P(X z). Verify that F(X) and 1- F(X) have a uniform distribution on 0, How does this verify the claim made above that the P-value is uniformly distributed under the nu hypothesis? (i) Carry out this bootstrap procedure for the ANOVA model fit in Part (b). Use M 100,000. Draw a histogram of the replicated P-values, using the nclass25 option. Report the proportion of the replicated P-values, say , below 0.00 1.0.01, 0.05.0.1. In addition, for each value of , report Z-(-)/SE, where SE Va(1-a)/M is the standard error of . Interpret your results. (iii) what is the standard error of for a = 0.1 and M? What does this tell you about the overall accuracy of the bootstrap procedure. (f) We will next carry out an experiment to assess the sensitivity of the bootstrap method. Consider the transformation y1/ -y), and apply it to the responses of store StoreID7. Repeat the bootstrap procedureust described, except that the replicated response vector y.boot will be constructed by sampling n responses with replacement from the transformed responses y Note that although we only sample responses frorn store Store!D =-7, responses for all original stores are being simulated. Would the observed significance levels using data with this distribution be accurate? (g) Repeat Part(f), but apply a Box-Cox transformation to the replicated samples (see Section 10.8.3 of the CSC/DSC 462 lecture notes in the COURSE MATERIALS folder on the Blackboard course website). You will need to load the MASS package. Under this transformation, would the observed significance levels be accurate

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

In this part of the Assignment, you will be working with Boolean variables from the OJ data set in the ISLR package noted in the Learning Resources. You will also be using the stats package from...

CHA P TER 9 Understanding Software: A Primer for Managers 1. INTRODUCTION L E A R N I N G O B J E C T I V E S 1. Recognize the importance of software and its implications for the rm and strategic...

FILE TOOLS VIEW 6401819_1_writing-case--4 - Word (Product Activation Failed) Financing Early Operations Musk's first entrepreneurial venture was to join up with his brother, Kimbal, and establish...

Markov chain density 17. 19.) Choose the correct statement: a) We can use the (Standardized) Normal Probability Density Function to compute the cumulative normal probabilities b) The Normal...

Planning is one of the most important management functions in any business. A front office managers first step in planning should involve determine the departments goals. Planning also includes...

IfyouhaveplayedaSimulationcalledProBankerIneedhelpansweringthesequestionsassoonaspossible from the pro bankerassignment attachment..please use spreadsheet and players manual for reference. Need...

DON'T SUMMARIZE JUST PARAGRAPH OF WHAT YOU THOUGHT/LEARNED CHAPTER 13 - The phrase "garbage in, garbage out" highlights the importance of input controls. If the data entered into a system are...

For questions requiring R code, provide screenshots of both the input ( code ) and the resulting output. Alternatively, directly copy and paste the R code along with its output from the Console after...

please help with this problem!!!! BU 5710 Fall 2019 Homework 2 1. Page 100, Question 4.3 (a-d) 2. Pages 100-101, Question 4.5 (a-d) 3. Baseball Franchise Values The value of a sports franchise is...

Instructions: 1. Copy your answer to each question AND the R code by which you reached your answer to a document. The answers should be correctly ordered. 2. In addition to the basic document, you...

The adjusted trial balance of Premium Gourmet Caterings shows the following selected data on certain income statement accounts for the year ended September 30, 2017: Sales $ 610,000 Freight out 9,900...

The financial services industry suffered heavily during mortgage crisis in 2008. In addition to significant losses, the sector also had to deal with strict and aggressive regulations of their...

Listen An adjusting entry would not include which of the following accounts? Interest Receivable Accounts Payable Deferred Revenue Cash

5. Develop a scenario comparing two PH programs and involving the use of a CBA.

Ethics.Technology. A health spa used the term micro color in marketing campaigns to refer to permanent cosmetic makeup. A beauty supply company claimed the right to the term and said it was...

Ethics. Technology. Peter Drucker states that there is no such thing as Business Ethics, there is only ethics.Collaborate with another classmate to write a reaction paragraph to this statement....

Ethics. Technology. Evaluate this statement using the Texas Instruments Ethics Quick Test and the Code of Conduct from Figure 3.5: It is better to steal from the stockholders of a company than the...