Question: MGMT 490/590: Predictive Analytics Fall 2016 Instructor: Matthew A. Lanham Homework #3 [100 possible points - every question worth 5 points] Due: Thursday, October 27th,

MGMT 490/590: Predictive Analytics Fall 2016 Instructor: Matthew A. Lanham Homework #3 [100 possible points - every question worth 5 points] Due: Thursday, October 27th, 2016 by Midnight Late submissions for any reason will be subject to a 15% late penalty per day past due. No Blackboard excuses. Submission instructions: Submissions should be typed and turned into Blackboard. No emails, please. In this homework you follow step-by-step instructions similar to the previous homework. As I provide you R code to run to perform the analysis, you will have to type it yourself which will help you learn. As you go along you will need to answer questions about the output and show your output. Toward the end of the assignment, you will have to perform some analysis completely on your own just modifying code I already provided you. I spent some time creating this assignment because I wanted to give you something that will really help you do PA in practice. You have a good feel for E-miner, and this assignment should get you right on par using RStudio. This problem is similar to on Smarket data from this chapter 4 lab on page 154. However, we'll be using the Weekly data set, which is also part of the ISLR package. This data set consists of percentage returns for the S&P 500 stock index over 1089 weeks, from the beginning of 1990 until the end of 2010. For each week, we have recorded the percentage returns for each of the five previous trading weeks, Lag1 through Lag5. We have also recorded Volume (the number of shares traded. This data is similar to the 'Smarket' data but is recorded at the week level rather by day. As you work through the assignment you might ask yourself why didn't I use all the years up the latest year as my training set and the last year as my test set. I did it this way because I wanted you all to use the same partitioning we have covered in class. Best of luck and don't wait until the day before the assignment is due to complete it. Step 1 1 Step 2 Step 3 Step 4 Step 5 Step 6 2 Step 7 Step 8 Step 9 Step 10 Step 11 3 1) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns (Step 2)? 2) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones? 3) Make a plot of Volume by Year and another plot of Today by Year (Step 3). Describe what you see. 4) Report the percentage of 'Up' in both the training and test sets (Step 5). Does it appear that the target class distributions in the training and test sets are similar? 5) Train a logistic regression model (i.e. logit) using the lag and volume variables (Step 6). Report which feature parameter coefficients are statistically significant at the 10% alpha level. 6) Score and assess the overall accuracy of the logit model (Step 7). Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression. Also what is our overall accuracy? 7) Using your output from Step 7, show which values in the confusion matrix are used to calculate the Sensitivity and Specificity statistics? Explain what these probabilities mean in terms of the problem. 8) Using your output from Step 7, is our logit model better at classifying 'Up's or 'Down's correctly? Explain. 9) Many times overall accuracy is not an ideal performance measure to evaluate your model. Generating a Receiver Operating Characteristic (ROC) curve and measuring the area under the curve (AUC) provides a better idea of how good your model 4 really is in isolation and compared to other models. Using Step 8, calculate an ROC curve. 10) A binary classification model that is poor and was not really able to learn will have an AUC around 0.50 (e.g. a 45% line in an ROC curve plot). A perfect classifier will have an AUC=1. Calculate the AUC (Step 9) for our model and explain if you think this model is good or poor classifier. 11) In addition to looking at overall accuracy, sensitivity, specificity, and AUC which provide an idea of how well your model classifies the response, sometimes you might be more interested in the probabilities generated. A probability calibration plot can be used in such an instance. The x-axis shows groups your probabilities based on equal ranges, then on the y-axis the proportion of target class observations for each set is calculated. Ideally, we would like to see the target class proportions be close to the midpoint probability of each bin. Thus the points would follow a 45-degree line. Generate a probability calibration plot (Step 10) and explain if you believe the probabilities are properly calibrated? 12) Like any plot, you should try to avoid it be misleading to others. When probabilities do not exist in a range it will show up as zero. That can be unclear if that is due to the proportion really being 0% or if there were no observations, so I would recommend plotting on those ranges where probability values actually exist. Also, based on the size of your test set, having few observations should discourage you from having many bins (e.g. 20). A conservative number of bins is 10 (e.g. 0-9.99, 10-19.99, 20-29.99,...,90-1). You should try different sized bins because one might lead you to a difference conclusion than after looking at a few (Step 11). Based on the three plots, do you think the probabilities are properly calibrated? 13) Using set.seed(2016), train a LDA model and show your code 14) Using set.seed(2016), train a QDA model and show your code 15) Using set.seed(2016), train a KNN model and show your code. Include (tuneLength=20) as an argument in the train() function 16) Calculate a confusion matrix for each model and show your code 17) Calculate an ROC curve for each model and display them in one plot. Below is some code you might # plot ROC curves par(mfrow=c(1,1)) # reset graphics parameter to 1 plot tweak to plot(rocCurve, legacy.axes=T, col="red" get that , main="Receiver Operating Characteristic (ROC) Curve") work. lines(ldaCurve, col="blue") Show your lines(qdaCurve, col="orange") lines(knnCurve, col="light green") plot. legend("bottomright", inset=0, title="Model", border="white", bty="n", cex=.8 , legend=c("Logit","LDA","QDA","kNN") , fill=c("red","blue","orange","light green")) 5 18) Create one table either by hand or using code that shows the overall accuracy, sensitivity, specificity, and AUC for all models. Below is some R code you might tweak to get the result you need. Show your final table. models = c("Logit","LDA","QDA","kNN") stats = c("Accuracy","Sensitivity","Specificity","AUC") m1 = cbind(confusionMatrix(data=logitClasses, test$Direction)$overall["Accuracy"][[1]] ,confusionMatrix(data=logitClasses, test$Direction)$byClass["Sensitivity"][[1]] ,confusionMatrix(data=logitClasses, test$Direction)$byClass["Specificity"][[1]] ,auc(rocCurve)[1]) m2 = cbind(confusionMatrix(data=ldaFitClasses, test$Direction)$overall["Accuracy"][[1]] ,confusionMatrix(data=ldaFitClasses, test$Direction)$byClass["Sensitivity"][[1]] ,confusionMatrix(data=ldaFitClasses, test$Direction)$byClass["Specificity"][[1]] ,auc(ldaCurve)[1]) m3 = cbind(confusionMatrix(data=qdaFitClasses, test$Direction)$overall["Accuracy"][[1]] ,confusionMatrix(data=qdaFitClasses, test$Direction)$byClass["Sensitivity"][[1]] ,confusionMatrix(data=qdaFitClasses, test$Direction)$byClass["Specificity"][[1]] ,auc(qdaCurve)[1]) m4 = cbind(confusionMatrix(data=knnFitClasses, test$Direction)$overall["Accuracy"][[1]] ,confusionMatrix(data=knnFitClasses, test$Direction)$byClass["Sensitivity"][[1]] ,confusionMatrix(data=knnFitClasses, test$Direction)$byClass["Specificity"][[1]] ,auc(knnCurve)[1]) results <- data.frame(rbind(m1,m2,m3,m4)) row.names(results) <- models names(results) <- c(stats) results 19) Based on your table in 18) which model would you conclude to be the best amongst the four? 20) There are many different machine learning methodologies that the caret package can use here (https://topepo.github.io/caret/modelList.html). Try to find one that either leads to a better overall accuracy OR potentially a better AUC. Show your code and your results (Hint: I found one on my first try having \"Forest\" in its model name. I used a tuneLength=5 argument too). 6 ################################################################################ # Customized calibration plot of probabilities from a binary class model # # Matthew A. Lanham # updated: 03/14/2014 ################################################################################ probCalPlot <- function(Y, Yprobs, numBins=NULL){ library(caret) library(pROC) # determine a reasonable cut if (is.null(numBins)){ if (length(test$Direction) < 600){ numBins=10 labelSize=0.85 } else if (length(test$Direction) >= 600){ numBins=20 labelSize=0.75 } } else { numBins=numBins labelSize=0.75 } # generate statistics for calibration plot calCurve <- calibration(Y ~ Yprobs, cuts=numBins)$data calCurve <- data.frame(calCurve[calCurve$Count>0,]) # generate custom calibration plot mypar = par(bg="white") plot(y=calCurve$Percent/100, x=calCurve$midpoint/100 , xaxt="n", yaxt="n", main="Probability Calibration Plot" , xlab="Bin Midpoint", ylab="Observed Event %", xlim=c(0,1) , ylim=c(0,1), col="blue", bg="orange", pch=21, type="b", cex.lab=1.2) # add x labels axis(1, at=c(0,round(calCurve$midpoint/100,2),1) , labels=format(round(c(0.00,signif(calCurve$midpoint/100,2),1), 2) , nsmall=2) , tick=T, lwd=0.5, cex.axis=labelSize) # add y labels axis(2, at=c(0,round(calCurve$Percent/100,2),1) , labels=format(round(c(0.00,signif(calCurve$Percent/100,2),1), 2) , nsmall=2) , tick=T, lwd=0.5, cex.axis=labelSize) # add 45-degree target line abline(a=0, b=1, col="black") }

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Mathematics Questions!

Question 1 Which of the following statement describes the key points from Example 1 of 'Analytics Applications in Healthcare'? Humana s prevention efforts aim to treat all injuries after a fall more...

Use the 7 Principle Framework and explain which principles are violated, and why? Provide appropriate reasoning. THANKS! A man walked into a Target outside Minneapolis and demanded to see the...

Three statistic questions. Question 1 to 3 CSection 18.1 Homework-Eliyah Atkins - Google Chrome X xlitemprod.pearsoncmg.com/api/v1/print/math Student: Eliyah Atkins Instructor: Stephen Trouard Date:...

Workplace Technology (A Special Report) Big Data Goes to College: More institutions are using apps and analytics to help attract students, guide them through graduation and launch their careers -...

PLease help me w r i t e a written management report. I will provide an example as well as the information to reference. REFERENCE MATERIAL : Executive Summary The e-commerce industry is dominated by...

Problem Intro: A cold chain in health care is defined as a temperature control supply chain involving a system of transporting and storing vaccines and drugs. It consists of three major components:...

Case Study The following case study is about QA testing in analytics published in Towards Data Science website. Read the following case study and try to find answers to the following questions: Why...

Problem 4: Predictive Analytics - Model building, Training, & Assessment [70 pts] In this predictive analytics problem, we will build a LINEAR python model based on a given dataset and a prediction...

Can you give a sample to Develop a Continuous Improvement Plan.? the plan must include: Key Performance Indicators (KPIs) associated with the continuous improvement objectives continuous improvement...

Based on information from the textbook, outside research, and knowledge gained from the videos, explain how you would oversee the design or redesign of a benefits program in a large organization....

An analyst must decide between two different forecasting teclmiques for weekly sales of inline skates: a linear trend equation and the naive approach. The linear trend equation is yt = 124 + 2t, and...

A 10.0 kg pole is held horizontally by 2 ropes, one at each end. At the first end, the rope makes a 25-deg angle with the roof. At the far end, the rope makes a 45 degree angle with the roof. What is...

The statement of cash flows helps analysts evaluate all but which of the following? 0 0 : 2 8 . 5 6 Multiple Choice The amount of debt relative to equity. Source of cash used for plant expansion....

A company issues 1,050 shares of its common stock for $33,600 cash. Prepare journal entries to record this event under each of the following separate situations.