A consumer advocacy agency, Equitable Ernest, is interested in providing a service in which an individual can

Question:

A consumer advocacy agency, Equitable Ernest, is interested in providing a service in which an individual can estimate their own credit score (a continuous measure used by banks, insurance companies, and other businesses when granting loans, quoting premiums, and issuing credit). The file CreditScore contains data on an individual's credit score and other variables.
Partition the data into training (50 percent), validation (30 percent), and test (20 percent) sets. Predict the individuals' credit scores using k-nearest neighbors with up to k = 20. Use CreditScore as the output variable and all the other variables as input variables. In Step 2 of XLMiner's k-Nearest Neighbors Prediction procedure, be sure to Normalize input data and to Score on best k between 1 and specified value. Generate a Detailed Scoring for all three sets of data.
a. For k = 1, why is the root mean squared error greater than zero on the training set? Why would we expect the root mean squared error to be zero for k = 1 on the training set?
b. What value of k minimizes the root mean squared error (RMSE) on the validation data?
c. How does the RMSE on the test set compare to the RMSE on the validation set?
d. What is the average error on the test set? Analyze the output in the KNNP_TestScore1 worksheet, paying particular attention to the observations in which had the largest over prediction (large negative residuals) and the largest underprediction (large positive residuals). Explain what may be contributing to the inaccurate predictions and possible ways to improve the k-NN approach.