# For R, partition the data sets into 60% training and 40% validation and implement the 10-fold cross-validation.

## Question:

For R, partition the data sets into 60% training and 40% validation and implement the 10-fold cross-validation. Use the statement set. seed(1) to specify the random seed for data partitioning and cross-validation. When searching for the optimal value of k, search within possible k values from 1 to 10. If the predictor variable values are in the character format, then treat the predictor variable as a categorical variable. Otherwise, treat the predictor variable as a numerical variable.

Online retailers often use a recommendation system to suggest new products to consumers. Consumers are compared to others with similar characteristics such as past purchases, age, income, and education level. A data set, such as the one shown in the accompanying table, is often used as part of a product recommendation system in the retail industry. The variables used in the system include whether or not the consumer eventually purchases the suggested item (Purchase = 1 if purchased, 0 otherwise), the consumer’s age (Age in years), income (Income, in $1,000s), and a number of similar items previously purchased (PastPurchase).

Purchase | Age | Income | PastPurchase |

1 | 48 | 99 | 21 |

1 | 47 | 32 | 0 |

⋮ | ⋮ | ⋮ | ⋮ |

0 | 34 | 110 | 2 |

**a-1.** Perform KNN analysis on the Retail_Data worksheet to determine the optimal *k*. Use 0.5 as the cutoff value for this analysis. Enter the optimal *k* in the box below:

**a-2.** Score the records of 8 new consumers in the Retail_Score worksheet, using 0.5 as the cutoff value. Enter the predicted values for the first new consumers below:

**b.** What is the misclassification rate for the optimal k for the training data set? Enter the misclassification rate in the box below: **(Report the misclassification rate in percentage. Round your answer to 2 decimal places.)**

**c-1.** Report the accuracy, specificity, sensitivity, and precision rates (in proportions) for the validation data set. **(Round your answers to 2 decimal places.)**

**c-2.** Which of the following statements is least accurate?

__multiple choice __

A. The error rate of the KNN model is 30%.

B. The KNN model is able to correctly classify 74% of the customers who purchased.

C. The KNN model is able to correctly classify 67% of the customers who did not purchase.

D. Overall, the KNN model is able to correctly classify 72% of the customers.

**d.** Obtain the decile-wise chart. What is the lift of the leftmost bar of the decile-wise chart? **(Round your answer to 2 decimal places.)**

**e.** Obtain the ROC curve. What is the AUC value of the ROC curve? **(Round your answer to 4 decimal places.)**

**f.** Which of the following statements is the least accurate?

__multiple choice __

A. Using 0.5 as the cutoff value, the KNN classifier has higher accuracy than the naïve rule (classifying all cases into the predominant class).

B. The cumulative lift chart shows that the KNN classifier performs better than the baseline model (random classifier).

C. The ROC curve shows that the KNN classifier performs better than the baseline model in terms of sensitivity but not in terms of specificity.

D. The leftmost bar of the decile-wise chart suggests a lift of more than 1 for the top 10% of the test cases with the highest predicted target class probabilities.

**Related Book For**

## Business Forecasting with ForecastX

ISBN: 978-0073373645

6th edition

Authors: Holton wilson, barry keating, john solutions inc