Question: You can use any software to plot and/or to calculate values/data, but if you do, provide (copy/paste) here the code. Data sets relevant for this

You can use any software to plot and/or to calculate values/data, but if you do, provide (copy/paste) here the code.

Data sets relevant for this HW can be found at the UCI Machine Learning Repository, at http://archive.ics.uci.edu/ml

This project assignment will help you to understand and apply some of the recent concepts we learned in the class. Once you finish this assignment, you can apply the ideas here to your team project.

Download data set Iris from Data set Iris from http://archive.ics.uci.edu/ml/datasets/iris and calculate the following:

(a) (5 points) the average (mean) value for each of the four features

(b) (5 points) the standard deviation for each of the features

(c) (5 points) repeat steps (a) and (b) but separately for each type of flower

(d) (5 points) calculate the covariance matrix between the features, 4x4 matrix.

(d) (12 points) draw 6 scatter plots, one for each pair of features. (4,2)=6.

Plot different classes with different markers and colors. Plot the class means in the plots with a thicker marker. Properly label your axes in all box plots. Basically, you will be creating half of the scatter plots in the slide#8 of the Chapter 6 slides (uploaded to the course shell).

You can create your own functions for calculating the mean and standard deviation, but you may find it easier to use built-in functions from Matlab, Python, R, etc.

(e) (10 points) Using the Euclidean distance, find the two most similar (closest) setosa flowers, two closest versicolor flowers and two closest virginicas. Provide your code, but the output can simply be three pairs of indices of different flowers (e.g. flowers 5 and 37, flowers).

(f) (10 points) Provide the same calculations in (f) but instead of the Euclidean distance, the use Mahalanobis distance.

(g) (10 points) Do a sequential forward search of the features which provide the highest classification accuracy; using a 25/25 split (per or feature), i.e. select random 25 for training and use remaining 25 for validation, and using the shortest Euclidian distance to the class mean as the classifier decision. Note that each students split will be different and each student will get slightly different answers. Since there are not many features, start with 1 feature and go all the way up to 4 features. Note the set of features which provide the highest accuracy.

(h) (10 points) Do an exhaustive search for selecting features which provide the highest classification accuracy. Note that since we do not have many features (d=4), there are only 2d=16 such combinations and this is computationally possible. List the classification accuracy results for each of the 16 combinations in a neat table. See whether the best feature combination you found in (g) agrees with the best feature combination here.

(i) (14 points) Do a PCA on the data. Find the eigenvectors and eigenvalues. Then perform classification using the Euclidian distance to the class mean based on the highest eigenvector, and then repeat the classification using highest two eigenvectors, and report the classification accuracy results. Do a random 25/25 split in training and validation. Calculate PoV for the largest 1,2,3 eigenvalues. Plot the largest eigenvector vs the second eigenvector, mark the three different classes with different markers and colors, plot the class means with a thicker marker.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!