Please provide R Code: Traditional k-means initialization is based on choosing values from a uniform distribution. In
Question:
Please provide R Code:
Traditional k-means initialization is based on choosing values from a uniform distribution. In this question,
you are asked to improve k-means through initialization. k-means ++ is an extended k-means clustering
algorithm and induces non-uniform distributions over the data that serve as the initial centroids. Read the
paper and discuss the idea in a paragraph. Implement this idea to improve your k-means program. Run
your program, Ck++, against the Diabetes and New York Times Comments data sets. Report the total error rates for k = 2,...,5 for 20 runs each for both data sets. Moreover, compare Ck, CkSSE and Ck++'s run time for k = 2,...,5 for 20 runs using both data sets. Presenting the results that are easily understandable. Plots are generally a good way to convey complex ideas quickly, i.e., box plot. Discuss your results
Paper Link: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
Diabetes Dataset: https://archive.ics.uci.edu/ml/datasets/Diabetes+130US+hospitals+for+years+1999-2008
New York Times Comments Data Sets: https://www.kaggle.com/datasets/benjaminawd/new-york-times-articles-comments-2020?select=nyt-comments-2020.csv
R script:
Discussion of Findings:
Plots: