Question: Assignment 5 Using Apache Spark written in Scala NOT java. Please download the Covtype data set from http://bit.ly/1KiJRfg. First, split the data set into 80%

Assignment 5

Using Apache Spark written in Scala NOT java.

Please download the Covtype data set from http://bit.ly/1KiJRfg.

First, split the data set into 80% training and 20% testing. Then conduct clustering analysis on the 80% training data set. You need to find an optimal K (the number of clusters) value for an optimal clustering result by calculating the average entropy value for the clustering result of each K value.

Then for EACH generated cluster, you build the random forest decision model.

Then for each instance in the testing data set, you first identify if this instance is an outlier/anomaly or not by calculating its distance to each of the cluster center. If it is an outlier/anomaly, please mark it so; otherwise, determine which cluster it belongs to and then use the random forest built for that cluster to classify it.

Copy and paste the code to achieve each of the above tasks and the screenshots of your outputs in your report.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Set Week 3: Chapters 3 & 4 Chapter 3: Complete the EVEN numbers problems listed below; the problem set begins on page 100 in Gravetter and Wallnau (2013). The answers to the odd numbered problems are...

use the code r Script below to Answer the questions from number 3 to 7 Questions : 3. Model #1 - First Logistic Regression Model Reporting Results Report the results of the regression model. Address...

Documents Your wiki should consist of a set of documents. Each document has a title (which is unique among all documents in the wiki) and contents. Each document is a sequence of lines of plain text....

please help me to find the answer for part 1, part3 and part4 Queensland University of Technology QUT Business School School of Accountancy AYB 339 Accountancy Capstone Integrated Case Study Semester...

D O NO T Ta KE IF YOU CANNOT ANSWEAR ALL THE QUESTION AS SUPPOSED OR I WILL RATE UNHELPFUL AND REPORT FOR PLAG Peru Domestic: a bond is issued in the US Global: a bond is issued in the US and foreign...

Yego Domestic: a bond is issued in the US Global: a bond is issued in the US and foreign markets Eurobonds: a bond denominated in USD is issued in the foreign market. You need to estimate the...

The resulting bar chart shows that when HMK is the AR Clerk and FKL is the Cash Receipts Clerk, CT is the GL Accounting Clerk for $226,851 of current AR balances. However, there are $25,352 of...

Association rule mining is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in...

The resulting bar chart shows that when HMK is the AR Clerk and FKL is the Cash Receipts Clerk, CT is the GL Accounting Clerk for $226,851 of current AR balances. However, there are $25,352 of...

(a) Compute the first few derivatives of the function f(x) = 1/ (x2 + x) until you see that the computations are becoming algebraically unmanageable. (b) Use the identity to compute the derivatives...

The following table shows the estimated populations of the 12 largest U.S. cities from the 2010 Census. These data can also be found in the Excel file titled city populations.xlsx. a. Calculate the...

The statement of cash flows provides information about all of the following except:Question content area bottomPart 1 A . financing activities.B . investing activities.C . operating activities.D ....

Seved Help 14 Wisconsin Snowmobile Corp. is considering a switch to level production Cost efficiencies would occur under level production, and aftertax costs would decline by $31,500, but inventory...

5. Would you prefer a job that a. is pretty much the same from day to day? b. changes constantly? _______

10. Are you a. a leader? b. a follower? _______

3. In terms of your career, do you think you are or will mostly be a. fixed in place? b. always moving? _______