Question: Objective: Implement external validation methods to compare automatically generated partitions with external ones. In many applications, we have multiple partitions generated either by different clustering
Objective: Implement external validation methods to compare automatically generated partitions
with external ones.
In many applications, we have multiple partitions generated either by different clustering
algorithms eg kmeans and another partitional clustering algorithm or by the same clustering
algorithm, but using different parameters eg different initial centers and we would like to
know which of these partitions fits the data better. If we know the true cluster label of each
point, we can use an external validity method to quantify the degree of similarity between an
automatically generated partition eg a partition produced by kmeans and the external
partition implied by the true cluster labels.
In this phase, you will implement external validation methods to compare automatically
generated partitions with external partitions. External validation is often accomplished using an
external validity index, a function that takes two partitions and possibly some additional
parameters as input and gives a numerical value indicating the degree of match between the two partitions as output.
In this phase, you will implement the Rand and Jaccard external validity indices. These indices
are described in many resources, see for example, the following book by Zaki and Meira Jr
In this phase, you will be supplied with a similar collection of data sets, but the file format will be slightly different: the very first line will contain three
integers # points, # attributes # true clusters and, at the end of each subsequent line, the
true cluster label of the point, between and # true clusters inclusive, will be given. For
example, the first four lines of irisbezdek will be
meaning that the data set contains points, each point is dimensional, and there are
true clusters that is each point belongs to cluster or In the example given above, all
three points belong to cluster
For the aforementioned external indices, higher values are better. The way these indices are used is quite simple. For a given data set, first run kmeans R times, each time with the same K
value equal to the # true clusters specified in the data set file but with a different set of
randomly selected centers as in phase Each run will give us a partition of the data set and
we will compute the external validity index say Rand index between that partition and the true
partition that is implied by the true cluster labels Finally, the run that gives the highest Rand
index value is declared as the best run and the partition produced by this run is taken as the best
partition of the data set.
Output: Tabulate the best values of each external validity index for each data set for each data
set, kmeans should be run R times, each time with a different set of randomly selected
centers All data sets must be normalized using the minmax method.
Language: Java. You may only use the builtin facilities of these languages. In
other words, you may not use thirdparty libraries.
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
