Question: 1 . Introduction You are given two real datasets which contain individual information in US . In the first real dataset, each individual is associated

1 .

Introduction

You are given two real datasets which contain individual information in US

.

In the

first real dataset, each individual is associated with

13

attributes and

1

additional

Boolean attribute called

income

indicating whether the individual had an income

>

50

(

per year

)

or not. The second real dataset is the same as the first dataset but the

second one contains only the first

13

attributes but no attribute

income

.

The

objective of this project is to predict whether each individual in the second dataset has

an income

> 50

K or not.

There are three phases in this project

Phase

1,

Phase

2

and Phase

3 .

In Phase

1,

you

are required to generate an Excel file from two raw files together with attribute names.

In Phase

2,

you are required to write a design report for this project. In Phase

3,

you

are required to follow the design report in Phase

2,

generate the predicted attribute

files for the second real dataset and write a final report

2 .

Milestones

1 .

Phase

1

.

You are given two real datasets in TEXT format,

training

.

txt

and

test

.

txt

.

.

File

training

.

txt

contains

13

attributes and

1

additional Boolean attribute.

iii. File

test

.

txt

contains

13

attributes only.

.

Open these two TEXT files with MS Excel

.

Save them in one MS Excel file where the content of

training

.

txt

is included in

Sheet

1

and the content of

test

.

txt

is included in

Sheet

2 .

Please re

-

name

Sheet

1

training

and re

-

name

Sheet

2

test

in MS Excel.

.

Insert a row at the beginning of each of the two sheets of the Excel file where this row gives the attribute names specified in Section

3

2 .

Phase

2

.

In Phase

2,

you are required to write a design report for this project.

.

You should list

5

possible data mining models you want to try

iii.Note that a possible data mining model can be

Decision Tree Classifier

with a set of parameters and another possible data mining model can be

Decision Tree Classifier

with another set of parameters. Obviously, one of the possible data mining models can be

Nearest Neighbor Classifier

with a set of parameters.

3 .

Phase

3

I. In Phase

3,

you are required to follow the design report in Phase

2,

use the XLMiner software to predict attribute

income

for the second real dataset and write a final report.

.

The final report should include the following.

)

All materials in your design report written in Phase

2 (

.

.,

the

5

possible data mining models

)

)

Description of the XLMiner results for each of

5

possible data mining models

)

Two examples illustrating what attributes determine an individual to have an income

> 50

K or not for each of

5

possible data mining models.

)

Conclusions drawn from each of

5

possible data mining models and an overall conclusion

iii. In addition to the final report, you are required to generate

5

predicted attribute files for the second real dataset in TEXT file format. Note that each predicted attribute file corresponds to the output of a possible data mining model you proposed in Phase

2 .

The file format is described in

Section

4 .

4 .

File format of Predicted Attribute File

In Phase

3,

you are required to submit

5

predicted attribute files for the second dataset.

The files should be named as

predicted

1 .

txt

,

predicted

2 .

txt

,

predicted

3 .

txt

,

predicted

4 .

txt

and

predicted

5 .

txt

where

predicted

1 .

txt

corresponds to the output

of the first data mining model proposed in Phase

2

and the other files have a similar

meaning. The file format of each file is shown as follows.

1

row:

1

0

where

1

corresponds to that the first individual in the second dataset

has an income

> 50

K and

0

corresponds to that s

/

he does not

>

2

row:

1

0

where

1

corresponds to that the second individual in the second

dataset has an income

> 50

K and

0

corresponds to that s

/

he does not

>

. . .

Here is a sample file.

1

0

1

0

0

0

0

0

. . .

We have an answer file for the predicted attribute file. Among

5

files given by you,

we will select the one with the highest accuracy as the final file for marking.Data Specifications

There are

13

attributes in the first dataset and

1

additional Boolean attribute called

"income".

1 . Introduction You are given two real datasets

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

Hindawi Publishing Corporation Nursing Research and Practice Volume 2013, Article ID 563282, 7 pages http://dx.doi.org/10.1155/2013/563282 Research Article Nurses' Patient-Centeredness and...

The purpose of this assignment is to be able to critique a research article including critically examining its strengths and weaknesses, internal and external validity, and where appropriate,...

Students will be asked to produce a report considering the potential leadership styles suggested for their group meetings (2-3 students per group) as business executives of an innovative...

3 In 0900'0 500O'D 1000 0 FIG. 4. Process H alloy composition: (a) & chart, (b) "chart based on log ratios, (o) X' versus 73. S 0900'0 50COD 100070 S602 R. A. Boyles For this data set, step (3) in...

The initial post should be 200-500 words and is expected to be substantive, scholarly, and original0. Make sure you use proper grammar, word choice, syntax (arrangement of words to create well-formed...

MATHEMATICS FOR MACHINE LEARNING Marc Peter Deisenroth A. Aldo Faisal Cheng Soon Ong Contents Foreword 1 Part I Mathematical Foundations 9 1 Introduction and Motivation 11 1.1 Finding Words for...

Please be sure to not copy any answers verbatim. original answers needed will be scanned through the safe-assign system. read the Marketing Debate below What is the Best Way to Position a brand? and...

I need to see the SPSS output. You need to have all z-scores, all charts, all descriptives data from SPSS, everything you used to answer the questions. I am sending you what the previous tutor sent...

JPMA-01726; No of Pages 12 Available online at www.sciencedirect.com ScienceDirect International Journal of Project Management xx (2015) xxx - xxx www.elsevier.com/locate/ijproman Does Agile work? A...

Write the condensed structural formula for each of the following compounds. a. 2,3-dimethyl-2-pentene b. 2-methyl-4-propyl-3-heptene

Sketch graphs of the following updating functions over the given range and mark the equilibria. Find the equilibria algebraically if possible. f(x) = x2 for 0 x 2.

1 of 3 7 Concepts completed Due in less than 1 2 nous Multiple Choice Question Mauricio, the owner of Pizza Aroma, spent $ 2 , 5 0 0 of his own money to take his family to Disney World. Pizza Aroma's...

7. For the data set -2, -1, 0, 2, 3, 6, 13 find the sample standard deviations, the variance s and the range.