Question: Objective: Implement internal validation methods to determine the number of clusters in a given data set automatically. Perhaps the most significant drawback of k -

Objective: Implement internal validation methods to determine the number of clusters in a given
data set automatically.
Perhaps the most significant drawback of k-means is that it requires that the user supply the
number of clusters (K). In many applications, this is impossible or impractical.
In this phase, you will implement two internal validation methods that will help k-means
automatically determine the number of clusters. An internal validation method quantifies the
match between an automatically generated partition of a data set and the data set itself. Internal
validation is often accomplished using an internal validity index, a function that takes a partition,
the data set itself, and possibly some additional parameters as input and gives a numerical value
indicating the quality of the partition as output.
In this phase, you will implement the CalinskiHarabasz (CH) and Silhouette Width (SW)
internal validity indices. These indices are described in many resources, see for example, the
following recent book by Zaki and Meira Jr.
CH, SW, and D are maximization indices, whereas DB is a minimization index. The way these
indices are used is quite simple. For a given data set, we first decide Kmin and Kmax, the minimum and maximum number of clusters that might be present in the data set, respectively. For example, for the iris data set, Kmin and Kmax could be 2 and 9, respectively. For any data set, Kmin is typically 2, whereas, a rule of thumb on the maximum possible Kmax value is the nearest integer to \sqrt{N/2}, where N is the number points in the data set. Assuming that we have a randomized initialization method, we then run k-means R times for K =2 and compute the internal validity index (say, CH) on the partition generated in the best run (that is, the run that produced the smallest SSE). We do the same for K =3,4,...,8, and, finally, K =9. Since CH is a
maximization criterion, we then find the K value that produced the largest CH value. That K
value is the estimated number of clusters in the data set. For a numerical example, refer to the
example given below and to Example 17.8 in the aforementioned book. Note that, these indices
all give estimates; so, you may not find the correct(3) number of clusters for iris (or, for any
other real-world data set).
Numerical Example for SW: SW values for the iris data set (only for K =2 and 3 clustersin
your experiments, the Kmax value for this data set should be \round{\sqrt{150/2}}=9) are given
below. Note that this is just to give you an example of how reasonable SW values should look
like; you may not get the exact same numbers depending on your initialization. In addition, these
numbers were obtained on the raw (unnormalized) iris data set. In your experiments, you must
normalize the data sets using the min-max method (Phase 3). There are 3 real clusters in iris, but
two of those clusters overlap, so K =2 gives a much better SW value than K =3. So, if we were
to try just K =2 and 3 clusters, based on the SW index, we would select K =2 as the best
estimate for the number of clusters in this particular data set.
./test data/iris_bezdek.txt 2
Iteration 1: SSE =539.413
Iteration 2: SSE =366.075
Iteration 3: SSE =237.899
Iteration 4: SSE =175.767
Iteration 5: SSE =154.339
Iteration 6: SSE =152.513
Iteration 7: SSE =152.348
SW(2)=0.850351
./test data/iris_bezdek.txt 3
Iteration 1: SSE =230.163
Iteration 2: SSE =81.9806
Iteration 3: SSE =79.3943
Iteration 4: SSE =78.9101
Iteration 5: SSE =78.8514
SW(3)=0.73566
The trace of the within-clusters scatter matrix, tr(Sw), is the same as SSE, which is
already calculated by k-means. Therefore, you do not need to calculate tr(Sw) separately. Also,
for tr(Sb), you do not need to calculate the entire Sb matrix. The trace of a square matrix is given
by the sum of its diagonal elements. Hence, all you need to do is to calculate the diagonal
elements of Sb and then add them up.
Output: Microsoft Excel or Apache OpenOffice Calc table(s) of each internal validity index for
various K values for each data set. For each data set and validity index, show the estimated
number of clusters in bold. Discuss which index seems to estimate the number of clusters more
accurately and why. For initialization, 4372/5372 students must use the random partition
(Phase 3) method (R =100), whereas 6397 students must use the maximin(Phase 3) method
(R =100, if the method is randomized; R =1, otherwise). All data sets must be normalized using
the min-max(Phase 3) method.
Language: Java. You may only use the built-in facilities of these languages.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!