Question: Implement internal validation methods to determine the number of clusters in a given data set automatically. Perhaps the most significant drawback of k - means

Implement internal validation methods to determine the number of clusters in a given data set automatically. Perhaps the most significant drawback of k

-

means is that it requires that the user supply the number of clusters

(

) .

In many applications, this is impossible or impractical. In this phase, you will implement two internal validation methods that will help k

-

means automatically determine the number of clusters. An internal validation method quantifies the match between an automatically generated partition of a data set and the data set itself. Internal validation is often accomplished using an internal validity index, a function that takes a partition, the data set itself, and possibly some additional parameters as input and gives a numerical value indicating the quality of the partition as output. In this phase, you will implement the Calinski

Harabasz

(

)

and Silhouette Width

(

)

internal validity indices.

,

,

and D are maximization indices, whereas DB is a minimization index. The way these indices are used is quite simple. For a given data set, we first decide Kmin and Kmax, the minimum and maximum number of clusters that might be present in the data set, respectively. For example, for the iris data set, Kmin and Kmax could be

2

and

9,

respectively. For any data set, Kmin is typically

2,

whereas, a rule of thumb on the maximum possible Kmax value is the nearest integer to

\

sqrt

{

/ 2},

where N is the number points in the data set. Assuming that we have a randomized initialization method, we then run k

-

means R times for K

= 2

and compute the internal validity index

(

say

,

)

on the partition generated in the best run

(

that is

,

the run that produced the smallest SSE

) .

We do the same for K

= 3, 4, . . ., 8,

and, finally, K

= 9 .

Since CH is a maximization criterion, we then find the K value that produced the largest CH value. That K value is the

estimated

number of clusters in the data set. For a numerical example, refer to the example given below and to Example

17.8

in the aforementioned book. Note that, these indices all give estimates; so

,

you may not find the

correct

(3)

number of clusters for iris

(

,

for any other real

-

world data set

) .

Numerical Example for SW: SW values for the iris data set

(

only for K

= 2

and

3

clusters

in your experiments, the Kmax value for this data set should be

\

round

{\

sqrt

{150 / 2}} = 9)

are given below. Note that this is just to give you an example of how reasonable SW values should look like; you may not get the exact same numbers depending on your initialization. In addition, these numbers were obtained on the raw

(

unnormalized

)

iris data set. In your experiments, you must normalize the data sets using the min

-

max method

(

Phase

3) .

There are

3

real clusters in iris, but two of those clusters overlap, so K

= 2

gives a much better SW value than K

= 3 .

,

if we were to try just K

= 2

and

3

clusters, based on the SW index, we would select K

= 2

as the best estimate for the number of clusters in this particular data set. Remember that if you are getting silhouette values outside the

[1, 1]

range, there must be a bug in your program.

. /

test data

/

iris

_

bezdek.txt

2

Iteration

1

: SSE

= 539.413

Iteration

2

: SSE

= 366.075

Iteration

3

: SSE

= 237.899

Iteration

4

: SSE

= 175.767

Iteration

5

: SSE

= 154.339

Iteration

6

: SSE

= 152.513

Iteration

7

: SSE

= 152.348

(2) = 0.850351

. /

test data

/

iris

_

bezdek.txt

3

Iteration

1

: SSE

= 230.163

Iteration

2

: SSE

= 81.9806

Iteration

3

: SSE

= 79.3943

Iteration

4

: SSE

= 78.9101

Iteration

5

: SSE

= 78.8514

(3) = 0.73566

The trace of the within

-

clusters scatter matrix, tr

(

),

is the same as SSE, which is already calculated by k

-

means. Therefore, you do not need to calculate tr

(

)

separately. Also, for tr

(

),

you do not need to calculate the entire Sb matrix. The trace of a square matrix is given by the sum of its diagonal elements. Hence, all you need to do is to calculate the diagonal elements of Sb and then add them up

.

Output: Microsoft Excel or Apache OpenOffice Calc table

(

)

of each internal validity index for various K values for each data set. For each data set and validity index, show the estimated number of clusters in bold. Discuss which index seems to estimate the number of clusters more accurately. For initialization, use the

random partition

method

(

= 100),

All data sets must be normalized using the

minmax

(

Phase

3)

method.

Language: Java. Only use the built

-

in facilities of this language

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Finance Questions!

Read the above passage and then answer short questions Summarize and elaborate the research method of this article in concise language Application Research Based on Machine Learning in Network...

Two Discussion Questions. A minimum of 200 words per question. Both questions are attached to the word document and each chapter to both questions are attached to the PDF. Any references/citations...

Provide a summary technical report with your own words about Pipelined Execution which is also named as Instruction Level Parallelism, addressing mainly the following areas: 1. What is Pipelined...

Factors encouraging cycle commuting in Scotland Big Data Fundamentals Coursework Daniel Devine CS982 Big Data Technologies Computer and Information Sciences University of Strathclyde, Glasgow 5th...

TANGLEWOOD CASEBOOK for use with STAFFING ORGANIZATIONS 7th Ed. Kammeyer-Mueller 1 TANGLEWOOD CASEBOOK To accompany Staffing Organizations, seventh edition, 2012. Prepared by John Kammeyer-Mueller...

TANGLEWOOD CASEBOOK for use with STAFFING ORGANIZATIONS 5th Ed. Kammeyer-Mueller 1 TANGLEWOOD CASEBOOK To accompany Staffing Organizations, fifth edition, 2006. Prepared by John Kammeyer-Mueller...

Read image s 1. Police have high hopes for technology. What are some of the common technologies that police use, and what are the outcomes that police seek from technology? 2. Lum and Koper argue...

I have attached both the chapters from which you will be answering the following question. Ch 8 (p 387 of text or p 35 of the pdf) ?Exercise D. Prepare the bank reconciliation and any required...

It appears that because of COVID limitations, data was collected virtually through phone, email, and surveys with a population that may have barriers to those methods of data collection. How do you...

Tax expenditures, unlike annual spending decisions in State or local budgets, have long been criticized as not being transparent and not being subject to periodic review. In the last two decades,...

On 1 January 2019, Salt Limited had 1,200,000 ordinary shares outstanding. It also had 100,000 shares of 9%, $10 par, non-cumulative preference shares outstanding for the whole year of 2019. Salt...

Write an Exit Strategy [ Ways an investor and the entrepreneur may be able to harvest their business investment. ]

Garland Company uses a standard cost system. The standard for each finished unit of product allows for 3 . 1 pounds of plastic at $ 0 . 7 4 per pound. During December, Garland bought 4 6 0 0 pounds...