Question: Problem 3 (70 points): a) Download the seeds dataset from https://archive.ics.uci.edu/ml/datasets/seeds b) Remove the class attribute (the rightmost column) from the dataset, and perform any

Problem 3 (70 points):

a) Download the seeds dataset from https://archive.ics.uci.edu/ml/datasets/seeds

b) Remove the class attribute (the rightmost column) from the dataset, and perform any other pre- processing steps that you consider necessary. Then write justifications for the need to apply those pre-processing steps. In case you consider that no pre-processing is needed, you must justify such decision.

c) Using the programming language of your choice (e.g. Java, C/C++, C#, Python, R, etc.), implement the K-means algorithm from scratch (see note at the bottom of this problem about this implementation). Using your own implementation, cluster the instances of the dataset with different cluster numbers K = 2, 3, 4, 5, and 6. For each value of K that you run K-means with, print out the total sum of squared errors, the sum of squared errors for each cluster, the cluster mean, and for each cluster, also print out the cluster ID and the IDs of all the instances belonging to that cluster. Then, using the elbow/knee method, make a plot of the total sum of squared errors vs. K and select an adequate K value. Justify why you choose that K value.

d) Using the agglomerative hierarchical clustering algorithm available in R (do not implement this algorithm from scratch by yourself: instead, use the R packages/functions that already implement this algorithm in your program), cluster the instances of the dataset using Complete Link inter-cluster similarity (MAX) and Wards inter-cluster similarity. Then for each of the algorithms, using the K value you chose in Problem (3.c), get the K clusters by cutting the resulting dendrogram at the K level (the root of a dendrogram is at level 1). For each clustering, print out the cut dendrogram and the total sum of squared errors, and for each cluster in the clustering, print out the cluster ID and the IDs of all the instances belonging to that cluster, and the clusters sum of squared error.

e) Write a report that presents an in-depth comparison analysis, describing advantages and disadvantages between the K-means and the agglomerative hierarchical clustering algorithm with different similarity measures (MAX and Wards inter-cluster similarities) for classifying the chosen dataset. The analysis must be comprehensive, not a single sentence; it is not simply to report results, but also to explain why you get such results (e.g. the statement K-means with K = X is the best for this dataset because it achieves the smallest total sum of squared errors is NOT a complete analysis; you need to explain why this occurs).

Notes on the implementation of K-means: In your code you cannot call any function/package that already implements this algorithm or a part of it.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!