Question: MGMT 490/590: Predictive Analytics Fall 2016 Instructor: Matthew A. Lanham Homework #3 [100 possible points - every question worth 5 points] Due: Thursday, October 27th,
MGMT 490/590: Predictive Analytics Fall 2016 Instructor: Matthew A. Lanham Homework #3 [100 possible points - every question worth 5 points] Due: Thursday, October 27th, 2016 by Midnight Late submissions for any reason will be subject to a 15% late penalty per day past due. No Blackboard excuses. Submission instructions: Submissions should be typed and turned into Blackboard. No emails, please. In this homework you follow step-by-step instructions similar to the previous homework. As I provide you R code to run to perform the analysis, you will have to type it yourself which will help you learn. As you go along you will need to answer questions about the output and show your output. Toward the end of the assignment, you will have to perform some analysis completely on your own just modifying code I already provided you. I spent some time creating this assignment because I wanted to give you something that will really help you do PA in practice. You have a good feel for E-miner, and this assignment should get you right on par using RStudio. This problem is similar to on Smarket data from this chapter 4 lab on page 154. However, we'll be using the Weekly data set, which is also part of the ISLR package. This data set consists of percentage returns for the S&P 500 stock index over 1089 weeks, from the beginning of 1990 until the end of 2010. For each week, we have recorded the percentage returns for each of the five previous trading weeks, Lag1 through Lag5. We have also recorded Volume (the number of shares traded. This data is similar to the 'Smarket' data but is recorded at the week level rather by day. As you work through the assignment you might ask yourself why didn't I use all the years up the latest year as my training set and the last year as my test set. I did it this way because I wanted you all to use the same partitioning we have covered in class. Best of luck and don't wait until the day before the assignment is due to complete it. Step 1 1 Step 2 Step 3 Step 4 Step 5 Step 6 2 Step 7 Step 8 Step 9 Step 10 Step 11 3 1) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns (Step 2)? 2) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones? 3) Make a plot of Volume by Year and another plot of Today by Year (Step 3). Describe what you see. 4) Report the percentage of 'Up' in both the training and test sets (Step 5). Does it appear that the target class distributions in the training and test sets are similar? 5) Train a logistic regression model (i.e. logit) using the lag and volume variables (Step 6). Report which feature parameter coefficients are statistically significant at the 10% alpha level. 6) Score and assess the overall accuracy of the logit model (Step 7). Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression. Also what is our overall accuracy? 7) Using your output from Step 7, show which values in the confusion matrix are used to calculate the Sensitivity and Specificity statistics? Explain what these probabilities mean in terms of the problem. 8) Using your output from Step 7, is our logit model better at classifying 'Up's or 'Down's correctly? Explain. 9) Many times overall accuracy is not an ideal performance measure to evaluate your model. Generating a Receiver Operating Characteristic (ROC) curve and measuring the area under the curve (AUC) provides a better idea of how good your model 4 really is in isolation and compared to other models. Using Step 8, calculate an ROC curve. 10) A binary classification model that is poor and was not really able to learn will have an AUC around 0.50 (e.g. a 45% line in an ROC curve plot). A perfect classifier will have an AUC=1. Calculate the AUC (Step 9) for our model and explain if you think this model is good or poor classifier. 11) In addition to looking at overall accuracy, sensitivity, specificity, and AUC which provide an idea of how well your model classifies the response, sometimes you might be more interested in the probabilities generated. A probability calibration plot can be used in such an instance. The x-axis shows groups your probabilities based on equal ranges, then on the y-axis the proportion of target class observations for each set is calculated. Ideally, we would like to see the target class proportions be close to the midpoint probability of each bin. Thus the points would follow a 45-degree line. Generate a probability calibration plot (Step 10) and explain if you believe the probabilities are properly calibrated? 12) Like any plot, you should try to avoid it be misleading to others. When probabilities do not exist in a range it will show up as zero. That can be unclear if that is due to the proportion really being 0% or if there were no observations, so I would recommend plotting on those ranges where probability values actually exist. Also, based on the size of your test set, having few observations should discourage you from having many bins (e.g. 20). A conservative number of bins is 10 (e.g. 0-9.99, 10-19.99, 20-29.99,...,90-1). You should try different sized bins because one might lead you to a difference conclusion than after looking at a few (Step 11). Based on the three plots, do you think the probabilities are properly calibrated? 13) Using set.seed(2016), train a LDA model and show your code 14) Using set.seed(2016), train a QDA model and show your code 15) Using set.seed(2016), train a KNN model and show your code. Include (tuneLength=20) as an argument in the train() function 16) Calculate a confusion matrix for each model and show your code 17) Calculate an ROC curve for each model and display them in one plot. Below is some code you might # plot ROC curves par(mfrow=c(1,1)) # reset graphics parameter to 1 plot tweak to plot(rocCurve, legacy.axes=T, col="red" get that , main="Receiver Operating Characteristic (ROC) Curve") work. lines(ldaCurve, col="blue") Show your lines(qdaCurve, col="orange") lines(knnCurve, col="light green") plot. legend("bottomright", inset=0, title="Model", border="white", bty="n", cex=.8 , legend=c("Logit","LDA","QDA","kNN") , fill=c("red","blue","orange","light green")) 5 18) Create one table either by hand or using code that shows the overall accuracy, sensitivity, specificity, and AUC for all models. Below is some R code you might tweak to get the result you need. Show your final table. models = c("Logit","LDA","QDA","kNN") stats = c("Accuracy","Sensitivity","Specificity","AUC") m1 = cbind(confusionMatrix(data=logitClasses, test$Direction)$overall["Accuracy"][[1]] ,confusionMatrix(data=logitClasses, test$Direction)$byClass["Sensitivity"][[1]] ,confusionMatrix(data=logitClasses, test$Direction)$byClass["Specificity"][[1]] ,auc(rocCurve)[1]) m2 = cbind(confusionMatrix(data=ldaFitClasses, test$Direction)$overall["Accuracy"][[1]] ,confusionMatrix(data=ldaFitClasses, test$Direction)$byClass["Sensitivity"][[1]] ,confusionMatrix(data=ldaFitClasses, test$Direction)$byClass["Specificity"][[1]] ,auc(ldaCurve)[1]) m3 = cbind(confusionMatrix(data=qdaFitClasses, test$Direction)$overall["Accuracy"][[1]] ,confusionMatrix(data=qdaFitClasses, test$Direction)$byClass["Sensitivity"][[1]] ,confusionMatrix(data=qdaFitClasses, test$Direction)$byClass["Specificity"][[1]] ,auc(qdaCurve)[1]) m4 = cbind(confusionMatrix(data=knnFitClasses, test$Direction)$overall["Accuracy"][[1]] ,confusionMatrix(data=knnFitClasses, test$Direction)$byClass["Sensitivity"][[1]] ,confusionMatrix(data=knnFitClasses, test$Direction)$byClass["Specificity"][[1]] ,auc(knnCurve)[1]) results <- data.frame(rbind(m1,m2,m3,m4)) row.names(results) <- models names(results) <- c(stats) results 19) Based on your table in 18) which model would you conclude to be the best amongst the four? 20) There are many different machine learning methodologies that the caret package can use here (https://topepo.github.io/caret/modelList.html). Try to find one that either leads to a better overall accuracy OR potentially a better AUC. Show your code and your results (Hint: I found one on my first try having \"Forest\" in its model name. I used a tuneLength=5 argument too). 6 ################################################################################ # Customized calibration plot of probabilities from a binary class model # # Matthew A. Lanham # updated: 03/14/2014 ################################################################################ probCalPlot <- function(Y, Yprobs, numBins=NULL){ library(caret) library(pROC) # determine a reasonable cut if (is.null(numBins)){ if (length(test$Direction) < 600){ numBins=10 labelSize=0.85 } else if (length(test$Direction) >= 600){ numBins=20 labelSize=0.75 } } else { numBins=numBins labelSize=0.75 } # generate statistics for calibration plot calCurve <- calibration(Y ~ Yprobs, cuts=numBins)$data calCurve <- data.frame(calCurve[calCurve$Count>0,]) # generate custom calibration plot mypar = par(bg="white") plot(y=calCurve$Percent/100, x=calCurve$midpoint/100 , xaxt="n", yaxt="n", main="Probability Calibration Plot" , xlab="Bin Midpoint", ylab="Observed Event %", xlim=c(0,1) , ylim=c(0,1), col="blue", bg="orange", pch=21, type="b", cex.lab=1.2) # add x labels axis(1, at=c(0,round(calCurve$midpoint/100,2),1) , labels=format(round(c(0.00,signif(calCurve$midpoint/100,2),1), 2) , nsmall=2) , tick=T, lwd=0.5, cex.axis=labelSize) # add y labels axis(2, at=c(0,round(calCurve$Percent/100,2),1) , labels=format(round(c(0.00,signif(calCurve$Percent/100,2),1), 2) , nsmall=2) , tick=T, lwd=0.5, cex.axis=labelSize) # add 45-degree target line abline(a=0, b=1, col="black") }