Question: Using the auto data set and using the scikit learn library 2 . Create and add a binary variable column called mpg _ high _

Using the auto data set and using the scikit learn library
2. Create and add a binary variable column called mpg_high_low to the dataset that is set to High if mpg is a value above 30, and a Low if mpg is a value less than or equal to 30. Make sure the mpg_high_low column is of type category.
3. Check if the auto data is imbalanced with respect to mpg_high_low. Report the percentage of the data that belong to the two classes (High and Low).
4. Split the dataset into 75% training and 25% test and use 10 fold cross validation for the models below
5. Fit a logistic regression model to the training set to predict mpg_high_low using all the other features/variables except mpg, year, origin, and name. Predict the mpg_high_low using the test dataset and report the Accuracy, Precision, Recall, Specificity, and F1 measure.
6. Alter the threshold for classifying a Low to 0.6 and report the changes in the test performance metrics from those reported in Qn 5.
7. Find the optimal threshold by drawing the ROC curve. Change the threshold to the optimal value you found from the ROC curve and report the changes in the test performance metrics from those reported in Qn 5.
8. Fit a Nave Bayes model to the training data to predict mpg_high_low using all the other features/variables except mpg, year, origin, and name. Predict the mpg_high_low using the test dataset. Plot the ROC curve and report the best threshold on the ROC curve plot. Report the AUC on the curve plot as well. Report the accuracy, precision, recall, specificity and F1 score.
9. Fit a KNN model to the training data to predict mpg_high_low using all the other features/variables except mpg, year, origin, and name. Use a grid search between 3 and 10 to find the best value of k. Report the accuracy, precision, recall, specificity, F1 score and AUC.
10. Fit a LDA model to the training data to predict mpg_high_low using all the other features/variables except mpg, year, origin, and name. Report the accuracy, precision, recall, specificity and F1 score.
11. Summarize the performance of the all the above models by creating a dataframe with 4 columns Model_Name, Accuracy, Precision, Recall, Specificity, F1 Score. The data frame should contain one row for each model you built above with each of the columns filled in with the appropriate metric. Print out the dataframe. Which model performed the best from an accuracy point of view and which model performed best from a recall point of view without adjusting for the threshold?

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!