Question: Supervised learning (Classification) and Cross validation using R This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The datasets
Supervised learning (Classification) and Cross validation using R
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The datasets objective is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Here, we will use k-nearest neighbours as the classification method, but you may work through this section using any classifier you wish. We have provided code for the initial processing of the data.
library(tidyverse) library(class) library(cvTools) library(randomForest) pima = read_csv("data/pima.csv") glimpse(pima) pima_scaled = pima %>% mutate(y = factor(y)) %>% mutate_if(is.numeric, .funs = scale) X = pima_scaled %>% select(-y) %>% scale() y = pima_scaled %>% select(y) %>% pull() n = length(y) 1.1
(a) Perform k-nearest neighbours on the data with k=5 and calculate an independent test set accuracy.
(b) Write a function to calculate estimated accuracy for a vector of the predicted labels (the input is a vector of the true labels and a vector of the predicted labels).
1.2
Perform k-nearest neighbours on the data with k=5 with your own CV code. You are encouraged to write your own code.
1.3
What happens if we repeat the calculation of the CV 50 times? In practice, we dont expect the same CV estimation, but how different are these CV estimations? To answer this question, put an additional for loop around your CV loop from 1.2 in order to repeat the CV procedure 50 times and visualize your results.
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
