Question: Implement internal validation methods to determine the number of clusters in a given data set automatically. Perhaps the most significant drawback of k - means

Implement internal validation methods to determine the number of clusters in a given data set automatically. Perhaps the most significant drawback of k-means is that it requires that the user supply the number of clusters (K). In many applications, this is impossible or impractical. In this phase, you will implement two internal validation methods that will help k-means automatically determine the number of clusters. An internal validation method quantifies the match between an automatically generated partition of a data set and the data set itself. Internal validation is often accomplished using an internal validity index, a function that takes a partition, the data set itself, and possibly some additional parameters as input and gives a numerical value indicating the quality of the partition as output. In this phase, you will implement the CalinskiHarabasz (CH) and Silhouette Width (SW) internal validity indices.
CH, SW, and D are maximization indices, whereas DB is a minimization index. The way these indices are used is quite simple. For a given data set, we first decide Kmin and Kmax, the minimum and maximum number of clusters that might be present in the data set, respectively. For example, for the iris data set, Kmin and Kmax could be 2 and 9, respectively. For any data set, Kmin is typically 2, whereas, a rule of thumb on the maximum possible Kmax value is the nearest integer to \sqrt{N/2}, where N is the number points in the data set. Assuming that we have a randomized initialization method, we then run k-means R times for K =2 and compute the internal validity index (say, CH) on the partition generated in the best run (that is, the run that produced the smallest SSE). We do the same for K =3,4,...,8, and, finally, K =9. Since CH is a maximization criterion, we then find the K value that produced the largest CH value. That K value is the estimated number of clusters in the data set. For a numerical example, refer to the example given below and to Example 17.8 in the aforementioned book. Note that, these indices all give estimates; so, you may not find the correct(3) number of clusters for iris (or, for any other real-world data set).
Numerical Example for SW: SW values for the iris data set (only for K =2 and 3 clustersin your experiments, the Kmax value for this data set should be \round{\sqrt{150/2}}=9) are given below. Note that this is just to give you an example of how reasonable SW values should look like; you may not get the exact same numbers depending on your initialization. In addition, these numbers were obtained on the raw (unnormalized) iris data set. In your experiments, you must normalize the data sets using the min-max method (Phase 3). There are 3 real clusters in iris, but two of those clusters overlap, so K =2 gives a much better SW value than K =3. So, if we were to try just K =2 and 3 clusters, based on the SW index, we would select K =2 as the best estimate for the number of clusters in this particular data set. Remember that if you are getting silhouette values outside the [1,1] range, there must be a bug in your program.
./test data/iris_bezdek.txt 2
Iteration 1: SSE =539.413
Iteration 2: SSE =366.075
Iteration 3: SSE =237.899
Iteration 4: SSE =175.767
Iteration 5: SSE =154.339
Iteration 6: SSE =152.513
Iteration 7: SSE =152.348
SW(2)=0.850351
./test data/iris_bezdek.txt 3
Iteration 1: SSE =230.163
Iteration 2: SSE =81.9806
Iteration 3: SSE =79.3943
Iteration 4: SSE =78.9101
Iteration 5: SSE =78.8514
SW(3)=0.73566
The trace of the within-clusters scatter matrix, tr(Sw), is the same as SSE, which is already calculated by k-means. Therefore, you do not need to calculate tr(Sw) separately. Also, for tr(Sb), you do not need to calculate the entire Sb matrix. The trace of a square matrix is given by the sum of its diagonal elements. Hence, all you need to do is to calculate the diagonal elements of Sb and then add them up.
Output: Microsoft Excel or Apache OpenOffice Calc table(s) of each internal validity index for various K values for each data set. For each data set and validity index, show the estimated number of clusters in bold. Discuss which index seems to estimate the number of clusters more accurately. For initialization, use the random partition method (R =100), All data sets must be normalized using the minmax(Phase 3) method.
Language: Java. Only use the built-in facilities of this language

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Finance Questions!