Question: Write a python code for the following problem: The dataset on American College and University Rankings ( available from www . dataminingbook.com ) contains information
Write a python code for the following problem:
The dataset on American College and University Rankings available from
wwwdataminingbook.com contains information on American colleges and universities offering an undergraduate program. For each university, there are measurements, including continuous measurements such as tuition and graduation rate and categorical measurements such as location by state and whether it is a private or public school Note that many records are missing some measurements. Our first goal is to estimate these missing values from similar records. This will be done by clustering the complete records and then finding the closest cluster for each of the partial records. The missing values will be imputed from the information in that cluster.
A Data Cleaning: Remove all records with missing measurements from the dataset.
B Hierarchical Clustering: For all continuous measurements, apply hierarchical clustering using complete linkage and Euclidean distance. Be sure to normalize the data before clustering. Based on the dendrogram, how many clusters seem appropriate for this dataset?
C Cluster Characterization: Compare the summary statistics eg mean or median for each cluster. Describe the characteristics of each cluster in context eg "Universities with high tuition, low acceptance rates..." Hint: You can use the pandas groupbyclusterlabel method along with aggregation methods like mean or median to summarize each cluster.
D Categorical Analysis: Use the categorical variables State and PrivatePublic that were not part of the clustering to describe each cluster. Is there any noticeable relationship between the clusters and these categorical variables?
E External Information: What other external information could help explain the characteristics of some or all of the clusters?
F Missing Data Imputation: Consider Harvard University, which has missing data. Compute the Euclidean distance between Harvard and each of the clusters you identified earlier, using only the available measurements. Which cluster is Harvard closest to Impute the missing values for Harvard by taking the average of that cluster's corresponding measurements.
Hint: df dmba.loaddataUniversitiescsv
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
