Question: Write a python code for the following problem: The dataset on American College and University Rankings ( available from www . dataminingbook.com ) contains information

Write a python code for the following problem:
The dataset on American College and University Rankings (available from
www.dataminingbook.com) contains information on 1302 American colleges and universities offering an undergraduate program. For each university, there are 17 measurements, including continuous measurements (such as tuition and graduation rate) and categorical measurements (such as location by state and whether it is a private or public school). Note that many records are missing some measurements. Our first goal is to estimate these missing values from similar records. This will be done by clustering the complete records and then finding the closest cluster for each of the partial records. The missing values will be imputed from the information in that cluster.
A. Data Cleaning: Remove all records with missing measurements from the dataset.
B. Hierarchical Clustering: For all continuous measurements, apply hierarchical clustering using complete linkage and Euclidean distance. Be sure to normalize the data before clustering. Based on the dendrogram, how many clusters seem appropriate for this dataset?
C. Cluster Characterization: Compare the summary statistics (e.g., mean or median) for each cluster. Describe the characteristics of each cluster in context (e.g., "Universities with high tuition, low acceptance rates..."). Hint: You can use the pandas groupby(clusterlabel) method along with aggregation methods like mean or median to summarize each cluster.
D. Categorical Analysis: Use the categorical variables (State and Private/Public) that were not part of the clustering to describe each cluster. Is there any noticeable relationship between the clusters and these categorical variables?
E. External Information: What other external information could help explain the characteristics of some or all of the clusters?
F. Missing Data Imputation: Consider Harvard University, which has missing data. Compute the Euclidean distance between Harvard and each of the clusters you identified earlier, using only the available measurements. Which cluster is Harvard closest to? Impute the missing values for Harvard by taking the average of that cluster's corresponding measurements.
Hint: df = dmba.load_data('Universities.csv')

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!