Question: Classify each point in the testing dataset based on the fitted cluster centers. Report the numbers of points classified into each center and its corresponding

Classify each point in the testing dataset based on the fitted cluster centers. Report the numbers of points classified into each center and its corresponding MSE. Use two pie charts to convey the same information (training and testing). Does the training data fit the test data well? Draw conclusions about model performance. In python refering to code below:
import sys
import numpy as np
from pyspark import SparkConf, SparkContext
from pyspark.mllib.clustering import KMeans
# Helpers
def parse_vector(line, sep=','):
"""Parses a line.
Returns: numpy array of the latitude and longitude
"""
fields = line.strip().split(sep)
latitude = float(fields[1])
longitude = float(fields[2])
return np.array([latitude, longitude])
# Main
if __name__=="__main__":
if len(sys.argv)!=3:
print >> sys.stderr, "Usage: kmeans "
exit(-1)
# Configure Spark
conf = SparkConf().setMaster("local")\
.setAppName("Earthquake Clustering")\
.set("spark.executor.memory", "2g")
sc = SparkContext(conf=conf)
# Create training RDD of (lat, long) vectors
earthquakes_file = sys.argv[1]
training = sc.textFile(earthquakes_file).map(parse_vector)
# Train KMeans models for different values of k
k_values = range(2,11)
mse_values =[]
for k in k_values:
model = KMeans.train(training, k, maxIterations=10, initializationMode="random")
mse = model.computeCost(training)
mse_values.append(mse)
# Find the optimal k using the elbow method
import matplotlib.pyplot as plt
plt.plot(k_values, mse_values)
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Mean Squared Error (MSE)')
plt.title('Elbow Method for Optimal k')
plt.show()
# Train the model with the optimal k
optimal_k =3 # You can change this based on the elbow plot
model = KMeans.train(training, optimal_k, maxIterations=10, initializationMode="random")
# Print the cluster centers
print("Earthquake cluster centers:")
print(model.clusterCenters)
# Plot the cluster centers on a map (using a library like Folium)
# This part requires an additional library (Folium) and code for map visualization
sc.stop()

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!