Objective Implement internal validation methods to determine the number of clusters in a given data set automatically Perhaps the most significant drawback of k means is that it requires that the user supply the number of clusters ( K ) In many applications, this is impossible or impractical In this phase, you will implement two internal validation methods that will help k means automatically determine the number of clusters An internal validation method quantifies the match between an automatically generated partition of a data set and the data set itself Internal validation is often accomplished using an internal validity index, a function that takes a partition, the data set itself, and possibly some additional parameters as input and gives a numerical value indicating the quality of the partition as output In this phase, you will implement the Calinski Harabasz ( CH ) and Silhouette Width ( SW ) internal validity indices These indices are described in many resources, see for example, the following recent book by Zaki and Meira Jr CH , SW , and D are maximization indices, whereas DB is a minimization index The way these indices are used is quite simple For a given data set, we first decide Kmin and Kmax, the minimum and maximum number of clusters that might be present in the data set, respectively For example, for the iris data set, Kmin and Kmax could be 2 and 9 , respectively For any data set, Kmin is typically 2 , whereas, a rule of thumb on the maximum possible Kmax value is the nearest integer to sqrt N 2 , where N is the number points in the data set Assuming that we have a randomized initialization method, we then run k means R times for K 2 and compute the internal validity index ( say , CH ) on the partition generated in the best run ( that is , the run that produced the smallest SSE ) We do the same for K 3 , 4 , , 8 , and, finally, K 9 Since CH is a maximization criterion, we then find the K value that produced the largest CH value That K value is the estimated number of clusters in the data set For a numerical example, refer to the example given below and to Example 1 7 8 in the aforementioned book Note that, these indices all give estimates so , you may not find the correct ( 3 ) number of clusters for iris ( or , for any other real world data set ) Numerical Example for SW SW values for the iris data set ( only for K 2 and 3 clusters in your experiments, the Kmax value for this data set should be round sqrt 1 5 0 2 9 ) are given below Note that this is just to give you an example of how reasonable SW values should look like you may not get the exact same numbers depending on your initialization In addition, these numbers were obtained on the raw ( unnormalized ) iris data set In your experiments, you must normalize the data sets using the min max method ( Phase 3 ) There are 3 real clusters in iris, but two of those clusters overlap, so K 2 gives a much better SW value than K 3 So , if we were to try just K 2 and 3 clusters, based on the SW index, we would select K 2 as the best estimate for the number of clusters in this particular data set test data iris bezdek txt 2 Iteration 1 SSE 5 3 9 4 1 3 Iteration 2 SSE 3 6 6 0 7 5 Iteration 3 SSE 2 3 7 8 9 9 Iteration 4 SSE 1 7 5 7 6 7 Iteration 5 SSE 1 5 4 3 3 9 Iteration 6 SSE 1 5 2 5 1 3 Iteration 7 SSE 1 5 2 3 4 8 SW ( 2 ) 0 8 5 0 3 5 1 test data iris bezdek txt 3 Iteration 1 SSE 2 3 0 1 6 3 Iteration 2 SSE 8 1 9 8 0 6 Iteration 3 SSE 7 9 3 9 4 3 Iteration 4 SSE 7 8 9 1 0 1 Iteration 5 SSE 7 8 8 5 1 4 SW ( 3 ) 0 7 3 5 6 6 The trace of the within clusters scatter matrix, tr ( Sw ) , is the same as SSE, which is already calculated by k means Therefore, you do not need to calculate tr ( Sw ) separately Also, for tr ( Sb ) , you do not need to calculate the entire Sb matrix The trace of a square matrix is given by the sum of its diagonal elements Hence, all you need to do is to calculate the diagonal elements of Sb and then add them up Output Microsoft Excel or Apache OpenOffice Calc table ( s ) of each internal validity index for various K values for each data set For each data set and validity index, show the estimated number of clusters in bold Discuss which index seems to estimate the number of clusters more accurately and why For initialization, 4 3 7 2 5 3 7 2 students must use the random partition ( Phase 3 ) method ( R 1 0 0 ) , whereas 6 3 9 7 students must use the maximin ( Phase 3 ) method ( R 1 0 0 , if the method is randomized R 1 , otherwise ) All data sets must be normalized using the min max ( Phase 3 ) method Language Java You may only use the built in facilities of these languages

The Answer is in the image, click to view ...

Question: Objective: Implement internal validation methods to determine the number of clusters in a given data set automatically. Perhaps the most significant drawback of k -

Objective: Implement internal validation methods to determine the number of clusters in a given

data set automatically.

Perhaps the most significant drawback of k

-

means is that it requires that the user supply the

number of clusters

(

) .

In many applications, this is impossible or impractical.

In this phase, you will implement two internal validation methods that will help k

-

means

automatically determine the number of clusters. An internal validation method quantifies the

match between an automatically generated partition of a data set and the data set itself. Internal

validation is often accomplished using an internal validity index, a function that takes a partition,

the data set itself, and possibly some additional parameters as input and gives a numerical value

indicating the quality of the partition as output.

In this phase, you will implement the Calinski

Harabasz

(

)

and Silhouette Width

(

)

internal validity indices. These indices are described in many resources, see for example, the

following recent book by Zaki and Meira Jr

.

,

,

and D are maximization indices, whereas DB is a minimization index. The way these

indices are used is quite simple. For a given data set, we first decide Kmin and Kmax, the minimum and maximum number of clusters that might be present in the data set, respectively. For example, for the iris data set, Kmin and Kmax could be

2

and

9,

respectively. For any data set, Kmin is typically

2,

whereas, a rule of thumb on the maximum possible Kmax value is the nearest integer to

\

sqrt

{

/ 2},

where N is the number points in the data set. Assuming that we have a randomized initialization method, we then run k

-

means R times for K

= 2

and compute the internal validity index

(

say

,

)

on the partition generated in the best run

(

that is

,

the run that produced the smallest SSE

) .

We do the same for K

= 3, 4, . . ., 8,

and, finally, K

= 9 .

Since CH is a

maximization criterion, we then find the K value that produced the largest CH value. That K

value is the

estimated

number of clusters in the data set. For a numerical example, refer to the

example given below and to Example

17.8

in the aforementioned book. Note that, these indices

all give estimates; so

,

you may not find the

correct

(3)

number of clusters for iris

(

,

for any

other real

-

world data set

) .

Numerical Example for SW: SW values for the iris data set

(

only for K

= 2

and

3

clusters

your experiments, the Kmax value for this data set should be

\

round

{\

sqrt

{150 / 2}} = 9)

are given

below. Note that this is just to give you an example of how reasonable SW values should look

like; you may not get the exact same numbers depending on your initialization. In addition, these

numbers were obtained on the raw

(

unnormalized

)

iris data set. In your experiments, you must

normalize the data sets using the min

-

max method

(

Phase

3) .

There are

3

real clusters in iris, but

two of those clusters overlap, so K

= 2

gives a much better SW value than K

= 3 .

,

if we were

to try just K

= 2

and

3

clusters, based on the SW index, we would select K

= 2

as the best

estimate for the number of clusters in this particular data set.

. /

test data

/

iris

_

bezdek.txt

2

Iteration

1

: SSE

= 539.413

Iteration

2

: SSE

= 366.075

Iteration

3

: SSE

= 237.899

Iteration

4

: SSE

= 175.767

Iteration

5

: SSE

= 154.339

Iteration

6

: SSE

= 152.513

Iteration

7

: SSE

= 152.348

(2) = 0.850351

. /

test data

/

iris

_

bezdek.txt

3

Iteration

1

: SSE

= 230.163

Iteration

2

: SSE

= 81.9806

Iteration

3

: SSE

= 79.3943

Iteration

4

: SSE

= 78.9101

Iteration

5

: SSE

= 78.8514

(3) = 0.73566

The trace of the within

-

clusters scatter matrix, tr

(

),

is the same as SSE, which is

already calculated by k

-

means. Therefore, you do not need to calculate tr

(

)

separately. Also,

for tr

(

),

you do not need to calculate the entire Sb matrix. The trace of a square matrix is given

by the sum of its diagonal elements. Hence, all you need to do is to calculate the diagonal

elements of Sb and then add them up

.

Output: Microsoft Excel or Apache OpenOffice Calc table

(

)

of each internal validity index for

various K values for each data set. For each data set and validity index, show the estimated

number of clusters in bold. Discuss which index seems to estimate the number of clusters more

accurately and why. For initialization,

4372 / 5372

students must use the

random partition

(

Phase

3)

method

(

= 100),

whereas

6397

students must use the

maximin

(

Phase

3)

method

(

= 100,

if the method is randomized; R

= 1,

otherwise

) .

All data sets must be normalized using

the

min

-

max

(

Phase

3)

method.

Language: Java. You may only use the built

-

in facilities of these languages.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

Read the above passage and then answer short questions Summarize and elaborate the research method of this article in concise language Application Research Based on Machine Learning in Network...

Provide a summary technical report with your own words about Pipelined Execution which is also named as Instruction Level Parallelism, addressing mainly the following areas: 1. What is Pipelined...

Factors encouraging cycle commuting in Scotland Big Data Fundamentals Coursework Daniel Devine CS982 Big Data Technologies Computer and Information Sciences University of Strathclyde, Glasgow 5th...

I have attached both the chapters from which you will be answering the following question. Ch 8 (p 387 of text or p 35 of the pdf) ?Exercise D. Prepare the bank reconciliation and any required...

Two Discussion Questions. A minimum of 200 words per question. Both questions are attached to the word document and each chapter to both questions are attached to the PDF. Any references/citations...

Budgeting for Nonprofit Organizations Although budgeting is just as important for nonprofit organizations as for for-profit companies, the approach taken toward budgeting can be very different. In...

Your manager has provided you the following matrix and asked that you use it to complete the design assessment phase of your testing. The manager has already performed a preliminary survey, and has...

Hi Sara Hope you are doing well. I need to prepare Ratio Analysis for eBay company and Amazon. I uploaded the most recent 10-K report for the most recent years, 2014 to 2016. Please let me know if...

In EMPLOYEE table, ESalary is monthly salary. ESSN EName ESalary DNO E123 Rauf E223 Ahmed 12000 E323 Rizwan 20000 E423 Rauf E523 Umer DEPTNO DNAME DLOC GRADE LOSAL HISAL 5000 10001 10000 10 10 Sales...

nursing is safe. Develop a professional and compassionate response to the parents to address their misconception.

What basic belief was the start for Chipotle? Multiple Choice Service makes sense. Real is better. Fun can be a part of it . Spices bring the life. Speed makes the sale.

If JK 9w 67 and GI 2w 12 what is the value of w G H I 9w 67 and GI = 2w 12, what is the value of w? If JK - 1