Write a python code for the following problem The dataset on American College and University Rankings ( available from www dataminingbook com ) contains information on 1 3 0 2 American colleges and universities offering an undergraduate program For each university, there are 1 7 measurements, including continuous measurements ( such as tuition and graduation rate ) and categorical measurements ( such as location by state and whether it is a private or public school ) Note that many records are missing some measurements Our first goal is to estimate these missing values from similar records This will be done by clustering the complete records and then finding the closest cluster for each of the partial records The missing values will be imputed from the information in that cluster A Data Cleaning Remove all records with missing measurements from the dataset B Hierarchical Clustering For all continuous measurements, apply hierarchical clustering using complete linkage and Euclidean distance Be sure to normalize the data before clustering Based on the dendrogram, how many clusters seem appropriate for this dataset C Cluster Characterization Compare the summary statistics ( e g , mean or median ) for each cluster Describe the characteristics of each cluster in context ( e g , Universities with high tuition, low acceptance rates ) Hint You can use the pandas groupby ( clusterlabel ) method along with aggregation methods like mean or median to summarize each cluster D Categorical Analysis Use the categorical variables ( State and Private Public ) that were not part of the clustering to describe each cluster Is there any noticeable relationship between the clusters and these categorical variables E External Information What other external information could help explain the characteristics of some or all of the clusters F Missing Data Imputation Consider Harvard University, which has missing data Compute the Euclidean distance between Harvard and each of the clusters you identified earlier, using only the available measurements Which cluster is Harvard closest to Impute the missing values for Harvard by taking the average of that cluster's corresponding measurements Hint df dmba load data ( ' Universities csv ' )

The Answer is in the image, click to view ...

Question: Write a python code for the following problem: The dataset on American College and University Rankings ( available from www . dataminingbook.com ) contains information

Write a python code for the following problem:

The dataset on American College and University Rankings

(

available from

www

.

dataminingbook.com

)

contains information on

1302

American colleges and universities offering an undergraduate program. For each university, there are

17

measurements, including continuous measurements

(

such as tuition and graduation rate

)

and categorical measurements

(

such as location by state and whether it is a private or public school

) .

Note that many records are missing some measurements. Our first goal is to estimate these missing values from

similar

records. This will be done by clustering the complete records and then finding the closest cluster for each of the partial records. The missing values will be imputed from the information in that cluster.

.

Data Cleaning: Remove all records with missing measurements from the dataset.

.

Hierarchical Clustering: For all continuous measurements, apply hierarchical clustering using complete linkage and Euclidean distance. Be sure to normalize the data before clustering. Based on the dendrogram, how many clusters seem appropriate for this dataset?

.

Cluster Characterization: Compare the summary statistics

(

.

.,

mean or median

)

for each cluster. Describe the characteristics of each cluster in context

(

.

.,

"Universities with high tuition, low acceptance rates..."

) .

Hint: You can use the pandas groupby

(

clusterlabel

)

method along with aggregation methods like mean or median to summarize each cluster.

.

Categorical Analysis: Use the categorical variables

(

State and Private

/

Public

)

that were not part of the clustering to describe each cluster. Is there any noticeable relationship between the clusters and these categorical variables?

.

External Information: What other external information could help explain the characteristics of some or all of the clusters?

.

Missing Data Imputation: Consider Harvard University, which has missing data. Compute the Euclidean distance between Harvard and each of the clusters you identified earlier, using only the available measurements. Which cluster is Harvard closest to

?

Impute the missing values for Harvard by taking the average of that cluster's corresponding measurements.

Hint: df

=

dmba.load

_

data

('

Universities

.

csv

')

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

.Download churn dataset from blackboard and perform the followings . Explore whether there are missing values for any of the variables - If they are, correct them by replacing with new data - Use a...

2. University Rankings. The dataset on American college and university rankings (available from blackborad) contains information on 1302 American colleges and universities offering an undergraduate...

Return to the Ethics Problems section of this chapter and review the examples of ethical situations that arise in HCOs. Then return to Exhibit 11.5 and review approaches manag- ers can use to create...

uetion1 OECD Health Statistics 2015 Definitions, Sources and Methods HEALTH STATUS (HEALTH_STAT) Access the dataset on Health Status in OECD.Stat:...

Download the most recent financial statements of Medtronic and Boston Scientific.From the attached ratio sheet pick 6 relevant ratios and compare the two companies.Then from the ratios you selected...

this is a python program please can anyone help me thank you Introduction In problem set 5, you will build a program to monitor news feeds over the Internet. Your program will filter the news,...

RMIT UNIVERSITY Programming Fundamentals (COSC2531) Assignment 2 Individual assignment (no group work). Submit online via Canvas/Assignments/Assignment 2. Marks are awarded per rubric (please see the...

I need some help with both Case 1 and Case 2 of this assignment. See attached document for full assignment. Here is an excerpt from each. Case 1: Purchase Point Media Corporation (PPMC) Media...

Need help with the document attached. The company that this paper is about is Nvidia. Project Descriptions SEC 10-K Paper You will be asked to select a company that is publically traded. You must...

Assignment USE THE HARVARD REFERENCING SYSTEM THROUGHOUT THE REPORT Problem 1 (25 pts) The cost of 14 models of digitals cameras at a camera specialty store during 2014 was as follows: 340 450 450...

How can healthcare organizations better manage their leaders, who themselves function to manage their organizations, to avert failure? Can healthcare organizations manage their leaders? What systems...

DHL Global Delivery Service Summary DHL started in 1969 in San Francisco with 3,000 investment. Today, the company delivers packages to destinations is 200 , bringing is over billion in revenues. is...

A good internal control is to make changes to employee withholding allowances based only on a Form properly completed and signed by the employee.

If Cultural Relativism is true, then what happens when someone is a member of more than one culture, and their cultures disagree about what is right in a given situation?