Question: 2 Application of Decision Tree on Real - Word Data - set [ 2 5 pts ] In this task, you will build a decision

2

Application of Decision Tree on Real

-

Word Data

-

set

[25

pts

]

In this task, you will build a decision tree classifier using a real

-

world data set called Census

-

Income Data Set, available publicly for downloading at Dataset. This data set contains weighted census data extracted from the

1994

and

1995

Current Population Surveys conducted by the U

.

.

Census Bureau. The data contains

41

demographic and employment

-

related variables.

Basic statistics for this data set are provided below.

Number of instances data

= 199523

Duplicate or conflicting instances:

46716

Number of instances in test

= 99762

Duplicate or conflicting instances:

20936

Class probabilities for income

-

projected.test file

Probability for the label

' - 50000'

93.80

Probability for the label

' 50000 +'

6.20

Majority accuracy:

93.80 %

on value

- 50000

Number of attributes

= 40 (

continuous:

7

nominal:

33)

Information about

.

data file :

91

distinct values for attribute #

0 (

age

)

continuous

9

distinct values for attribute #

1 (

class of worker

)

nominal

52

distinct values for attribute #

2 (

detailed industry recode

)

nominal

47

distinct values for attribute #

3 (

detailed occupation recode

)

nominal

17

distinct values for attribute #

4 (

education

)

nominal

1240

distinct values for attribute #

5 (

wage per hour

)

continuous

- 3

distinct values for attribute #

6 (

enroll in edu inst last wk

)

nominal

7

distinct values for attribute #

7 (

marital stat

)

nominal

24

distinct values for attribute #

8 (

major industry code

)

nominal

15

distinct values for attribute #

9 (

major occupation code

)

nominal

5

distinct values for attribute #

10 (

race

)

nominal

10

distinct values for attribute #

11 (

Hispanic origin

)

nominal

2

distinct values for attribute #

12 (

sex

)

nominal

3

distinct values for attribute #

13 (

member of a labor union

)

nominal

- 6

distinct values for attribute #

14 (

reason for unemployment

)

nominal

3

8

distinct values for attribute #

15 (

full or part

-

time employment stat

)

nominal

132

distinct values for attribute #

16 (

capital gains

)

continuous

113

distinct values for attribute #

17 (

capital losses

)

continuous

1478

distinct values for attribute #

18 (

dividends from stocks

)

continuous

- 6

distinct values for attribute #

19 (

tax filer stat

)

nominal

6

distinct values for attribute #

20 (

region of previous residence

)

nominal

51

distinct values for attribute #

21 (

state of previous residence

)

nominal

38

distinct values for attribute #

22 (

detailed household and family stat

)

nominal

8

distinct values for attribute #

23 (

detailed household summary in the household

)

nominal

10

distinct values for attribute #

24 (

migration code

-

change in MSA

)

nominal

9

distinct values for attribute #

25 (

migration code

-

change in reg

)

nominal

10

distinct values for attribute #

26 (

migration code

-

move within reg

)

nominal

3

distinct values for attribute #

27 (

live in this house one year ago

)

nominal

4

distinct values for attribute #

28 (

migration prev res in sunbelt

)

nominal

7

distinct values for attribute #

29 (

num persons worked for the employer

)

continuous

5

distinct values for attribute #

30 (

family members under

18)

nominal

43

distinct values for attribute #

31 (

country of birth father

)

nominal

43

distinct values for attribute #

32 (

country of birth mother

)

nominal

43

distinct values for attribute #

33 (

country of birth self

)

nominal

5

distinct values for attribute #

34 (

citizenship

)

nominal

3

distinct values for attribute #

35 (

own business or self

-

employed

)

nominal

3

distinct values for attribute #

36 (

fill inc questionnaire for veteran's admin

)

nominal

- 3

distinct values for attribute #

37 (

veterans benefits

)

nominal

53

distinct values for attribute #

38 (

weeks worked in year

)

continuous

2

distinct values for attribute #

39 (

year

)

nominal

Classes:

- 50000, 50000 + .

One instance per line with comma

-

delimited fields. There are

199, 523

instances in the data file and

99, 762

in the test file.

The data was split into train

/

test in approximately

\frac{2}{3}, \frac{1}{3}

proportions using MineSet's MIndUtil mineset

-

-

mlc

.

Below are your tasks:

(

) [10

pts

]

Train a decision tree classifier using the data file. You CAN NOT use any decision tree library functions to do it

,

.

.,

you must construct the tree from scratch. You also CAN NOT touch the test file in this part. Vary the cut

-

off depth from

2

10

and report the training accuracy for each cut

-

off depth

k .

Based on your results, select an optimal

k .

(

) [8

pts

]

Using the trained classifier with optimal cut

-

off depth

k,

classify the

99, 762

instances from the test file and report the testing accuracy

(

the portion of testing instances classified correctly

),

(

) [7

pts

]

Do you see any over

-

fitting issues for this experiment? Report

2 Application of Decision Tree on Real - Word

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Finance Questions!

Course Deliverable: Review the scenarios 1 through 9.Assemble a report responding to the tasks you have been given by the Controller.Structure your report so it is clear which task you are...

Making Long Term FM Decisions - Integrative Case Title: Analyzing Long Term Financial Decision Making in the Firm (Learning Demonstration 3) Initial Steps to Completion: 1. Organize your team, choose...

I'm requiring assistance on the first three "tasks" for this project. If you're able to assit me more regarding the remainder of tasks I would forever be indebted to you. Thanks in advance Making...

Making Long Term FM Decisions - Integrative Case Introduction: As a special analytical group set up by ACME Iron by the firms Controller, you have been tasked to respond to the following issues...

Making Long Term FM Decisions - Integrative Case Introduction: As a special analytical group set up by ACME Iron by the firm's Controller, you have been tasked to respond to the following issues...

1. Consider the training data set below. Your goal is to build a classifier to predict the last column "Beach" using the input attributes: "Thunder", "Hailstorm", "Homework" and "Tsunami". More...

Making Long Term FM Decisions - Integrative Case Introduction: As a special analytical group set up by ACME Iron by the firm's Controller, you have been tasked to respond to the following issues...

As a special analytical group set up by ACME Iron by the firm's Controller, you have been tasked to respond to the following issues raised in a meeting with the CFO. You must look over several...

QUESTONS IV. Foundations of Modern U.S. Patent Law a. Discuss the intangible nature of an idea and how a patent provides remedy for extracting value from information or an idea? b. Discuss the four...

Make or Buy / Add or Delete and Competitive Pricing and Bidding Jamie is a manager of In-Flight services in Tiger Airways. He supervises the food and beverages section. Jamie is faced with a decision...

Phantom Corporation acquired an 80% interest in Speed Corporation at a cost equal to 80% of the book value of Speed's net assets several years ago. At the time of purchase, the fair value and book...

When evaluating an average - risk project using IRR, a firm should use the WACC as the hurdle rate. True False

Problem 22-3 Binomial model The share price of Heavy Metal (HM) changes only once a month: Either it goes up by 20% or it falls by 18.1%. Its price now is $46.2. The interest rate is 0.7% per month....