Question: Assignment 5 Using Apache Spark written in Scala NOT java. Please download the Covtype data set from http://bit.ly/1KiJRfg. First, split the data set into 80%
Assignment 5
Using Apache Spark written in Scala NOT java.
Please download the Covtype data set from http://bit.ly/1KiJRfg.
First, split the data set into 80% training and 20% testing. Then conduct clustering analysis on the 80% training data set. You need to find an optimal K (the number of clusters) value for an optimal clustering result by calculating the average entropy value for the clustering result of each K value.
Then for EACH generated cluster, you build the random forest decision model.
Then for each instance in the testing data set, you first identify if this instance is an outlier/anomaly or not by calculating its distance to each of the cluster center. If it is an outlier/anomaly, please mark it so; otherwise, determine which cluster it belongs to and then use the random forest built for that cluster to classify it.
Copy and paste the code to achieve each of the above tasks and the screenshots of your outputs in your report.
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
