Question: Problem 3 (70 points): a) Download the seeds dataset from https://archive.ics.uci.edu/ml/datasets/seeds b) Remove the class attribute (the rightmost column) from the dataset, and perform any

Problem 3 (70 points):

a) Download the seeds dataset from https://archive.ics.uci.edu/ml/datasets/seeds

b) Remove the class attribute (the rightmost column) from the dataset, and perform any other pre- processing steps that you consider necessary. Then write justifications for the need to apply those pre-processing steps. In case you consider that no pre-processing is needed, you must justify such decision.

c) Using the programming language of your choice (e.g. Java, C/C++, C#, Python, R, etc.), implement the K-means algorithm from scratch (see note at the bottom of this problem about this implementation). Using your own implementation, cluster the instances of the dataset with different cluster numbers K = 2, 3, 4, 5, and 6. For each value of K that you run K-means with, print out the total sum of squared errors, the sum of squared errors for each cluster, the cluster mean, and for each cluster, also print out the cluster ID and the IDs of all the instances belonging to that cluster. Then, using the elbow/knee method, make a plot of the total sum of squared errors vs. K and select an adequate K value. Justify why you choose that K value.

d) Using the agglomerative hierarchical clustering algorithm available in R (do not implement this algorithm from scratch by yourself: instead, use the R packages/functions that already implement this algorithm in your program), cluster the instances of the dataset using Complete Link inter-cluster similarity (MAX) and Wards inter-cluster similarity. Then for each of the algorithms, using the K value you chose in Problem (3.c), get the K clusters by cutting the resulting dendrogram at the K level (the root of a dendrogram is at level 1). For each clustering, print out the cut dendrogram and the total sum of squared errors, and for each cluster in the clustering, print out the cluster ID and the IDs of all the instances belonging to that cluster, and the clusters sum of squared error.

e) Write a report that presents an in-depth comparison analysis, describing advantages and disadvantages between the K-means and the agglomerative hierarchical clustering algorithm with different similarity measures (MAX and Wards inter-cluster similarities) for classifying the chosen dataset. The analysis must be comprehensive, not a single sentence; it is not simply to report results, but also to explain why you get such results (e.g. the statement K-means with K = X is the best for this dataset because it achieves the smallest total sum of squared errors is NOT a complete analysis; you need to explain why this occurs).

Notes on the implementation of K-means: In your code you cannot call any function/package that already implements this algorithm or a part of it.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Question 5 Refer to Step 3.3. In the "Unconstrained" or "Short Selling" version of the optimal risky portfolio, what is the weight for XOM? answer as a percentage, with no percentage symbol (""),...

MATHEMATICS FOR MACHINE LEARNING Marc Peter Deisenroth A. Aldo Faisal Cheng Soon Ong Contents Foreword 1 Part I Mathematical Foundations 9 1 Introduction and Motivation 11 1.1 Finding Words for...

Capstone Project In this task, you will develop a Python program that performs sentiment analysis on a dataset of product reviews. Follow these steps: Download a dataset of product reviews: Consumer...

The total number of points for this assignment is 120 points. Please submit your assignment in a Word file. Use this assignment file as a template to enter and copy-paste your answers for your...

*PYTHON COURSE* Can someone please help me answer this question for my class. My professor mentioned it using pandas and I honestly dont know what that is. There are a lot of pictures that I believe...

Problem 7: Battleship Blueprint class for Player objects 10 points; pair-optional This problem continues our implementation of the game of Battleship, which is described in the previous problem. Your...

Explore the data Print the names of columns Print the types of columns Print the unique values in each column. Print the statistics count, min, mean, standard deviation, 1st quartile, median, 3rd...

12:38 4 Assignment Details MTH 154-N41A: Quantitative Reasoning (SP21) MTH 154 Excel Project In this major project, you will create an Excel file and answer questions about linear and exponential...

Jones & Bartlett Learning, LLC. NOT FOR RESALE OR DISTRIBUTION CHAPTER Hot Spot Analysis 10 LEARNING OBJECTIVES C A R R Provide a working definition of a \"hot spot.\" , Be able to explain different...

The instructions are attached in the Division Performance Evaluation Division Performance Evaluation Project Using Excel In this assignment you will analyze the performance of PepsiCo or Marriott's...

1. Download Data Set The Boston housing data was collected in 1978 and each of the 506 entries represent aggregated data about 14 features for homes from various suburbs in Boston, Massachusetts....

Given a 10-Bit ADC with 5 V Reference, Connected to an 8-Bit DAC with 10 V Reference, 1.0) Input to 10-Bit ADC is 3.4 V Output from 8-Bit DAC System Resolution 2.0) Input to 10-Bit ADC is 4.8 V...

Today, many small businesses apply social commerce to interact with their customers and drive sales. Suggest some professional advices to a fashion entrepreneur who is planning to design and set up...

Consider this case: Interest rates on short - term loans are more stable than long - term loans. Identify whether the preceding statement is true or false. This statement is false. This statement is...

Compared with half a century ago, adoption has become _ _ _ _ _ _ _ _ _ common, but it is more open and acceptabl e , so we probably discuss it _ _ _ _ _ _ _ . fill in the blanks more or much less or...

Question What is a Roth 401(k) feature?

Question What kinds of organizations can adopt Section 401(k) plans?

Question Can employees make contributions to a profit sharing plan?