Question: Objective: Implement internal validation methods to determine the number of clusters in a given data set automatically. Perhaps the most significant drawback of k -
Objective: Implement internal validation methods to determine the number of clusters in a given
data set automatically.
Perhaps the most significant drawback of kmeans is that it requires that the user supply the
number of clusters K In many applications, this is impossible or impractical.
In this phase, you will implement two internal validation methods that will help kmeans
automatically determine the number of clusters. An internal validation method quantifies the
match between an automatically generated partition of a data set and the data set itself. Internal
validation is often accomplished using an internal validity index, a function that takes a partition,
the data set itself, and possibly some additional parameters as input and gives a numerical value
indicating the quality of the partition as output.
In this phase, you will implement the CalinskiHarabasz CH and Silhouette Width SW
internal validity indices. These indices are described in many resources, see for example, the
following recent book by Zaki and Meira Jr
CH SW and D are maximization indices, whereas DB is a minimization index. The way these
indices are used is quite simple. For a given data set, we first decide Kmin and Kmax, the minimum and maximum number of clusters that might be present in the data set, respectively. For example, for the iris data set, Kmin and Kmax could be and respectively. For any data set, Kmin is typically whereas, a rule of thumb on the maximum possible Kmax value is the nearest integer to sqrtN where N is the number points in the data set. Assuming that we have a randomized initialization method, we then run kmeans R times for K and compute the internal validity index say CH on the partition generated in the best run that is the run that produced the smallest SSE We do the same for K and, finally, K Since CH is a
maximization criterion, we then find the K value that produced the largest CH value. That K
value is the estimated number of clusters in the data set. For a numerical example, refer to the
example given below and to Example in the aforementioned book. Note that, these indices
all give estimates; so you may not find the correct number of clusters for iris or for any
other realworld data set
Numerical Example for SW: SW values for the iris data set only for K and clustersin
your experiments, the Kmax value for this data set should be roundsqrt are given
below. Note that this is just to give you an example of how reasonable SW values should look
like; you may not get the exact same numbers depending on your initialization. In addition, these
numbers were obtained on the raw unnormalized iris data set. In your experiments, you must
normalize the data sets using the minmax method Phase There are real clusters in iris, but
two of those clusters overlap, so K gives a much better SW value than K So if we were
to try just K and clusters, based on the SW index, we would select K as the best
estimate for the number of clusters in this particular data set.
test datairisbezdek.txt
Iteration : SSE
Iteration : SSE
Iteration : SSE
Iteration : SSE
Iteration : SSE
Iteration : SSE
Iteration : SSE
SW
test datairisbezdek.txt
Iteration : SSE
Iteration : SSE
Iteration : SSE
Iteration : SSE
Iteration : SSE
SW
The trace of the withinclusters scatter matrix, trSw is the same as SSE, which is
already calculated by kmeans. Therefore, you do not need to calculate trSw separately. Also,
for trSb you do not need to calculate the entire Sb matrix. The trace of a square matrix is given
by the sum of its diagonal elements. Hence, all you need to do is to calculate the diagonal
elements of Sb and then add them up
Output: Microsoft Excel or Apache OpenOffice Calc tables of each internal validity index for
various K values for each data set. For each data set and validity index, show the estimated
number of clusters in bold. Discuss which index seems to estimate the number of clusters more
accurately and why. For initialization, students must use the random partition
Phase method R whereas students must use the maximinPhase method
R if the method is randomized; R otherwise All data sets must be normalized using
the minmaxPhase method.
Language: Java. You may only use the builtin facilities of these languages.
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
