Question: Description This dataset is extracted from the original source ( shown above ) to include weather information observed on July 1 5 , 2 0

Description
This dataset is extracted from the original source (shown above) to include weather information
observed on July 15,2020 only. Features of data include
date: month, day, and year when the observation is made
station id: ID of a location where the observation is made
lat: latitude coordinate
lon: longitude coordinate
mean temp: average temperature
wind speed
precipitation
The observations are made from 50 states across the United States. Our goal is to apply clustering
methods to group locations with similar weather conditions.
Pre-processing
1. Import data from weather.csv, save it as myData. Hint: use read.csv().
2. Use summary() to check if there are columns with NA values.
3. Remove rows with NA station ID.
4. Check if data contains duplicated station ID.
5. There are still NA values in columns wind speed and precipitation. We will fill those empty
values using these steps:
(a) Fill NA precipitation with value 0. Hint: is.na() returns TRUE/FALSE for NA values.
(b) Fill NA values of wind speed with the national median value.
(c) Can you think of other ways of filling in the missing values?
6. Remove observations from Alaska and Hawaii.
7. Let us visualize the data. Make a scatter plot using ggplot. Use longitude (x-axis) and
latitude (y-axis) for coordinates, and use column mean temp to color points.
(Optional) We can change the color-gradient by adding
+ scale
color
gradient(low = "color1", high = "color2"),
where color1 and color2 are color names (e.g., gold, red, etc.).
Clustering: We are going to apply clustering algorithms using weather features (average temper
ature, wind speed, and precipitation).
8. Save a subset of myData containing the above 3 columns only.
9. Apply K-means clustering algorithm to group the data points to 5 clusters.
(a) Report the number of data points in each cluster.
(b) For each cluster, report the average temperature, wind speed, and precipitation.
(c) Repeat Question 7 to make a scatter plot. This time, we will use cluster membership
(e.g.,1,2,3,4,5) to show different colors. Use as.factor() to convert the cluster mem
bership before providing it to ggplot.
(Optional) We can manually choose colors for each group by adding
+scale
color
manual(values=c("color1","color2", "color3", "color4", "color5")).
10. Apply hierarchical clustering (use complete link method) on the subset data created in
Question 8.
(a) Make a dendrogram.
(b) Trim the clustering to get 3 clusters.
(c) Repeat Question 7 to visualize the clustering output

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!