Question: 3 [50 pts] Dimensionality reduction For this question, you will use the Cloud Dataset from the UCI ML repository: https://archive.ics.uci.edu/ml/datasets/Cloud . Read about it to

3 [50 pts] Dimensionality reduction

For this question, you will use the Cloud Dataset from the UCI ML repository: https://archive.ics.uci.edu/ml/datasets/Cloud. Read about it to get familiar with what is measured. Within the data, there are two datasets: DB #1 and DB #2. For this homework, just use the 1024 vectors in DB #2. Use python for all your programming. You will have to submit your code in LastnameFirstnameCode.zip together with the relevant write-up in the main solution file LastnameFirstnameSolution.pdf.

(a) [5 pts] Load the data into a python program and center it. Note: there should be a function called center() in your code that achieves this.

(b) [5 pts] Compute the covariance matrix of the data . Hint: by using the definition of sample covariance, as a matrix product or as a sum of outer products. See book for details. Use numpy for linear algebra computations (https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.linalg.html). As a result you should have a function covar() in your code which does not use the built-in covariance functions.

(c) [5 pts] Compute the eigenvectors and eigenvalues of . The numpy linear algebra module referenced above has a function that can help.

(d) [10 pts] Determine the number of principal components (PCs) r that will ensure 90% retained variance. How did you compute this? Provide a function in your code that determines r based on an arbitrary percentage of retained variance.

(e) [10 pts] Plot the first two components in a figure with a horizontal axis (x) corresponding to the dimensions and a vertical axis (y) corresponding to the magnitude of the component in this dimension. There will be 2 traces with d points in this figure. Include the figure in your LastnameFirstnameSolution.pdf. Also save the top two components in a text file "Components.txt", with each component on a separated line and represented as d comma separated numbers (i.e. the file should have two lines with d numbers separated by commas). Include "Components.txt" in your LastnameFirstnameCode.zip.

(f) [10 pts] Compute the reduced dimension data matrix A with two dimensions by projection on the first two PCs. Plot the points using a scatter plot (two-dimensional diagram that places each sample i according to its new dimensions ai1; ai2). Discuss the observations. Are there clusters of close-by points? What is the retained variance for r = 2? Argue for or against whether these are sufficient dimensions.

(g) [5 pts] Study the PCA implementation in pythons' sklearn library https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA. Do PCA using the library on the same data. Do the eigenvalues approximately match what you computed above?

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!