Question: Implement internal validation methods to determine the number of clusters in a given data set automatically. Perhaps the most significant drawback of k - means
Implement internal validation methods to determine the number of clusters in a given data set automatically. Perhaps the most significant drawback of kmeans is that it requires that the user supply the number of clusters K In many applications, this is impossible or impractical. In this phase, you will implement two internal validation methods that will help kmeans automatically determine the number of clusters. An internal validation method quantifies the match between an automatically generated partition of a data set and the data set itself. Internal validation is often accomplished using an internal validity index, a function that takes a partition, the data set itself, and possibly some additional parameters as input and gives a numerical value indicating the quality of the partition as output. In this phase, you will implement the CalinskiHarabasz CH and Silhouette Width SW internal validity indices.
CH SW and D are maximization indices, whereas DB is a minimization index. The way these indices are used is quite simple. For a given data set, we first decide Kmin and Kmax, the minimum and maximum number of clusters that might be present in the data set, respectively. For example, for the iris data set, Kmin and Kmax could be and respectively. For any data set, Kmin is typically whereas, a rule of thumb on the maximum possible Kmax value is the nearest integer to sqrtN where N is the number points in the data set. Assuming that we have a randomized initialization method, we then run kmeans R times for K and compute the internal validity index say CH on the partition generated in the best run that is the run that produced the smallest SSE We do the same for K and, finally, K Since CH is a maximization criterion, we then find the K value that produced the largest CH value. That K value is the estimated number of clusters in the data set. For a numerical example, refer to the example given below and to Example in the aforementioned book. Note that, these indices all give estimates; so you may not find the correct number of clusters for iris or for any other realworld data set
Numerical Example for SW: SW values for the iris data set only for K and clustersin your experiments, the Kmax value for this data set should be roundsqrt are given below. Note that this is just to give you an example of how reasonable SW values should look like; you may not get the exact same numbers depending on your initialization. In addition, these numbers were obtained on the raw unnormalized iris data set. In your experiments, you must normalize the data sets using the minmax method Phase There are real clusters in iris, but two of those clusters overlap, so K gives a much better SW value than K So if we were to try just K and clusters, based on the SW index, we would select K as the best estimate for the number of clusters in this particular data set. Remember that if you are getting silhouette values outside the range, there must be a bug in your program.
test datairisbezdek.txt
Iteration : SSE
Iteration : SSE
Iteration : SSE
Iteration : SSE
Iteration : SSE
Iteration : SSE
Iteration : SSE
SW
test datairisbezdek.txt
Iteration : SSE
Iteration : SSE
Iteration : SSE
Iteration : SSE
Iteration : SSE
SW
The trace of the withinclusters scatter matrix, trSw is the same as SSE, which is already calculated by kmeans. Therefore, you do not need to calculate trSw separately. Also, for trSb you do not need to calculate the entire Sb matrix. The trace of a square matrix is given by the sum of its diagonal elements. Hence, all you need to do is to calculate the diagonal elements of Sb and then add them up
Output: Microsoft Excel or Apache OpenOffice Calc tables of each internal validity index for various K values for each data set. For each data set and validity index, show the estimated number of clusters in bold. Discuss which index seems to estimate the number of clusters more accurately. For initialization, use the random partition method R All data sets must be normalized using the minmaxPhase method.
Language: Java. Only use the builtin facilities of this language
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
