Question: Big Data Analysis Academic Year 2 0 2 3 - 2 0 2 4 Excel Application 2 - PCA In 2 0 1 0 ,

Big Data Analysis
Academic Year 2023-2024
Excel Application 2- PCA
In 2010, the New York Magazine published an article that discussed the most livable neighborhoods in New York. 50 neighborhoods were evaluated based on 12 variables, or 12 broad categories:
Affordability (as measured on a price-per-square-foot basis, for both renters and buyers)
Transit and proximity (commute times to lower Manhattan and midtown, the density of subway coverage)
Shopping and services (the number of neighborhood amenities, especially supermarkets)
Safety (as measured by violent- and nonviolent-crime rates)
Food and restaurants (judged by density and quality of options)
Public schools (test scores and parent satisfaction)
Diversity (in terms of both race and income)
Creative capital (arts venues as well as the number of residents engaged in the arts)
Housing quality (historic districts, code violations, cockroaches)
Green space (park and waterfront access, street trees)
Wellness (noise, air quality, overall cleanliness)
Bars and nightlife
Goal: The goal of this exercise is to understand how these different variables are related, and to better understand the similarities between all observations by making use of Principal Component Analysis.
Required files/data (see K2):
The Excel macro: "pca_macro.xlsm"
The Excel data-file: "NY_Neighborhoods.xlsx"
Course slides
13
Step 1: Load data (NY Neighborhoods) into PCA macro (pca_macro.xlsm).
Open the "NY_Neighborhoods.xlsx" dataset and the "pca_macro.xlsm" file.
Copy the complete New York Neighborhoods dataset (labels + values) into the "donnes" tab in the PCA macro (start pasting in cell B3).
If you have done this correctly, the "nb lignes" variable should be equal to 50 and the "nb colonnes" variable should be equal to 12.
We will now continue working with PCA macro.
Step 2: Interpreting the correlation matrix.
Go to the "var_vp" tab in the pca_macro.xlsm file.
The first matrix displayed in the "var_vp" tab is the correlation matrix. This matrix presents the correlation coefficients between the different variables. Each cell in the table shows the correlation between two variables.
Question 2.1: Why do we only find "1" values on the diagonal of this matrix? (../1)
Question 2.2: What are the three largest (positive or negative) correlations (apart from those on the diagonal)? Could you have predicted these relationships? Justify your answer. (dots1.5)
Step 3: Eigenvectors and eigenvalues.
Go to the "var_vp" tab in the pca_macro.xlsm file.
As a result of the many variables in this dataset, it is difficult to grasp all relationships between the different variables by means of a correlation matrix. Therefore, we will now make use of a principal component analysis (PCA) to reduce the dimensionality of the data.
Below the correlation matrix displayed in the "var_vp" tab, the eigenvalues and eigenvectors are presented.
Question 3.1: What do the eigenvector and eigenvalues represent? (.../2)
Question 3.2: How much of the variance in the original 12 variables is explained by the first principal component? And by the second? And by the two first principal components together? (dots2)
Question 3.3: Why does the first principal component always account for most of the total variance? (dots1)
Question 3.4: How many principal components will we retain? On what does this depend? (dots1)
Step 4: Analyzing the correlation circle (mono-plot).
Go to the "cercle_variables" tab in the pca_macro.xlsm file and click on the "tiquettes" button.
In this tab you can find a correlation circle for the coefficients of the first two principal components (i.e., axe 1 and axe 2).
Question 4.1: What does this correlation circle represent? (.../1)
Question 4.2: Based on the correlation circle, which variables seem to influence the first principal component the most? Which variables seem to influence the second principal component the most? (.../2)
Question 4.3: What can you say about neighborhoods that are positioned on the left of axis 1?(dots2)
23
23
Question 4.4: Recast the data along the principal components' axes. What are the main characteristics of each quadrant? (Hint: You may recall the mineral waters example that you analyzed in the lecture as in the diagram below.)(dots4)
\table[[PC2,],[\table[[Low concentration of Dry],[Residue, Potassium,],[Fluoride, Bicarbonates,],[Sodium],[High concentration of],[Calcium, Sulphates,],[Magnesium, and high price]],\table[[High concentration of all],[components and high price]],],[\table[[Low concentration of all],[components and low price]],\table[[High concentration of Dry],[Residue, Potassium,],[Fluoride, Bicarbonates,],[Sodium],[],[Low concentration o
 Big Data Analysis Academic Year 2023-2024 Excel Application 2- PCA In

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!