Question: You can use any software to plot and/or to calculate values/data, but if you do, provide (copy/paste) here the code. Data sets relevant for this

You can use any software to plot and/or to calculate values/data, but if you do, provide (copy/paste) here the code.

Data sets relevant for this HW can be found at the UCI Machine Learning Repository, at http://archive.ics.uci.edu/ml

This project assignment will help you to understand and apply some of the recent concepts we learned in the class. Once you finish this assignment, you can apply the ideas here to your team project.

Download data set Iris from Data set Iris from http://archive.ics.uci.edu/ml/datasets/iris and calculate the following:

(a) (5 points) the average (mean) value for each of the four features

(b) (5 points) the standard deviation for each of the features

(d) (5 points) calculate the covariance matrix between the features, 4x4 matrix.

(d) (12 points) draw 6 scatter plots, one for each pair of features. (4,2)=6.

Plot different classes with different markers and colors. Plot the class means in the plots with a thicker marker. Properly label your axes in all box plots. Basically, you will be creating half of the scatter plots in the slide#8 of the Chapter 6 slides (uploaded to the course shell).

You can create your own functions for calculating the mean and standard deviation, but you may find it easier to use built-in functions from Matlab, Python, R, etc.

(e) (10 points) Using the Euclidean distance, find the two most similar (closest) setosa flowers, two closest versicolor flowers and two closest virginicas. Provide your code, but the output can simply be three pairs of indices of different flowers (e.g. flowers 5 and 37, flowers).

(f) (10 points) Provide the same calculations in (f) but instead of the Euclidean distance, the use Mahalanobis distance.

(g) (10 points) Do a sequential forward search of the features which provide the highest classification accuracy; using a 25/25 split (per or feature), i.e. select random 25 for training and use remaining 25 for validation, and using the shortest Euclidian distance to the class mean as the classifier decision. Note that each students split will be different and each student will get slightly different answers. Since there are not many features, start with 1 feature and go all the way up to 4 features. Note the set of features which provide the highest accuracy.

(h) (10 points) Do an exhaustive search for selecting features which provide the highest classification accuracy. Note that since we do not have many features (d=4), there are only 2^d=16 such combinations and this is computationally possible. List the classification accuracy results for each of the 16 combinations in a neat table. See whether the best feature combination you found in (g) agrees with the best feature combination here.

(i) (14 points) Do a PCA on the data. Find the eigenvectors and eigenvalues. Then perform classification using the Euclidian distance to the class mean based on the highest eigenvector, and then repeat the classification using highest two eigenvectors, and report the classification accuracy results. Do a random 25/25 split in training and validation. Calculate PoV for the largest 1,2,3 eigenvalues. Plot the largest eigenvector vs the second eigenvector, mark the three different classes with different markers and colors, plot the class means with a thicker marker.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

You can use any software to plot and/or to calculate values/data, but if you do, provide (copy/paste) here the code. Data sets relevant for this HW can be found at the UCI Machine Learning...

I need help with this lab. Please answer questions for pages 4-7. Freefall Purpose: We will measure the value of 'little g', or the acceleration due to gravity on the surface of Earth. The particular...

Set Student Name: 1. Describe the relationship between two variables that have a correlation coefficient value: a. Near -1 b. Near 0 c. Near 1 2. Data was collected where a weightlifter was asked to...

this assignment is regarding return the tax of a client by using handy taxassignment. can anyone help me to complete the income section of this assignment, just write the solution in a pdf file?I...

Processing steps for 18 questions are required. Thanks so much for help! Queensland University of Technology QUT Business School School of Accountancy AYB 219 Taxation Law HandiTax Group Project...

Instuctor's Annotated Edition TENTH EDITION Understandable Statistics Concepts and Methods Charles Henry Brase Regis University Corrinne Pellillo Brase Arapahoe Community College Australia Brazil...

Data Management and Data Analytics . Objective of the project . Base SAS Programming Using SAS Studio on SAS Viya C. SAS Visual Data Mining and machine Learning (In some questions Base SAS...

Who is chief knowledge officer? What the primary role? A senior executive in an organization responsible for ensuring that firm fully utilizes the value it gets through knowledge- which is the most...

cc170785fa926b2240534b10e7743155388d392a ID hours 1 37.09953814 2 34.58422938 3 37.85777106 4 36.03408777 5 38.71008438 6 38.18084857 7 37.7883698 8 38.41027533 9 39.53619743 10 37.46797595 11...

PAPERS What Project Strategy Really Is: The Fundamental Building Block in Strategic Project ManagementPeerasit Patanakul, Stevens Institute of Technology, Hoboken, NJ, USA Aaron J. Shenhar, Rutgers...

What is the expected price change for a coupon bond with duration of 2.79 years and face value of $1000 if interest rates drop from 12% to 11.5%.

In order to measure the internal resistance r of a cell of emf E, a meter bridge of wire resistance Ro = 50 52, a resistance Ro/2, another cell of emf E/2 (internal resistance r) and a galvanometer G...

Which of the following is a relative valuation technique used in the market to establish a company s value? Discount cash flow Dividend discount model Technical analysis

Seved Help 14 Wisconsin Snowmobile Corp. is considering a switch to level production Cost efficiencies would occur under level production, and aftertax costs would decline by $31,500, but inventory...

Question Since FSA salary reductions eliminate Social Security taxes on the salary reduction, are Social Security benefits also affected?

Question What are the implications of the IRS position that an employer must be at risk with respect to health benefits in an FSA?

Question Is there a dollar limit on annual salary reductions in an FSA plan?