Question: 2 Application of Decision Tree on Real - Word Data - set [ 2 5 pts ] In this task, you will build a decision

2 Application of Decision Tree on Real-Word Data-set [25 pts]
In this task, you will build a decision tree classifier using a real-world data set called Census-Income Data Set, available publicly for downloading at Dataset. This data set contains weighted census data extracted from the 1994 and 1995 Current Population Surveys conducted by the U.S. Census Bureau. The data contains 41 demographic and employment-related variables.
Basic statistics for this data set are provided below.
Number of instances data =199523
Duplicate or conflicting instances: 46716
Number of instances in test =99762
Duplicate or conflicting instances: 20936
Class probabilities for income-projected.test file
Probability for the label '-50000': 93.80
Probability for the label '50000+': 6.20
Majority accuracy: 93.80% on value -50000
Number of attributes =40(continuous: 7 nominal: 33)
Information about .data file :
91 distinct values for attribute #0(age) continuous
9 distinct values for attribute #1(class of worker) nominal
52 distinct values for attribute #2(detailed industry recode) nominal
47 distinct values for attribute #3(detailed occupation recode) nominal
17 distinct values for attribute #4(education) nominal
1240 distinct values for attribute #5(wage per hour) continuous
-3 distinct values for attribute #6(enroll in edu inst last wk) nominal
7 distinct values for attribute #7(marital stat) nominal
24 distinct values for attribute #8(major industry code) nominal
15 distinct values for attribute #9(major occupation code) nominal
5 distinct values for attribute #10(race) nominal
10 distinct values for attribute #11(Hispanic origin) nominal
2 distinct values for attribute #12(sex) nominal
3 distinct values for attribute #13(member of a labor union) nominal
-6 distinct values for attribute #14(reason for unemployment) nominal
3
8 distinct values for attribute #15(full or part-time employment stat) nominal
132 distinct values for attribute #16(capital gains) continuous
113 distinct values for attribute #17(capital losses) continuous
1478 distinct values for attribute #18(dividends from stocks) continuous
-6 distinct values for attribute #19(tax filer stat) nominal
6 distinct values for attribute #20(region of previous residence) nominal
51 distinct values for attribute #21(state of previous residence) nominal
38 distinct values for attribute #22(detailed household and family stat) nominal
8 distinct values for attribute #23(detailed household summary in the household) nominal
10 distinct values for attribute #24(migration code-change in MSA) nominal
9 distinct values for attribute #25(migration code-change in reg) nominal
10 distinct values for attribute #26(migration code-move within reg) nominal
3 distinct values for attribute #27(live in this house one year ago) nominal
4 distinct values for attribute #28(migration prev res in sunbelt) nominal
7 distinct values for attribute #29(num persons worked for the employer) continuous
5 distinct values for attribute #30(family members under 18) nominal
43 distinct values for attribute #31(country of birth father) nominal
43 distinct values for attribute #32(country of birth mother) nominal
43 distinct values for attribute #33(country of birth self) nominal
5 distinct values for attribute #34(citizenship) nominal
3 distinct values for attribute #35(own business or self-employed) nominal
3 distinct values for attribute #36(fill inc questionnaire for veteran's admin) nominal
-3 distinct values for attribute #37(veterans benefits) nominal
53 distinct values for attribute #38(weeks worked in year) continuous
2 distinct values for attribute #39(year) nominal
Classes: -50000,50000+.
One instance per line with comma-delimited fields. There are 199,523 instances in the data file and 99,762 in the test file.
The data was split into train/test in approximately 23,13 proportions using MineSet's MIndUtil mineset-to-mlc. Below are your tasks:
(a)[10 pts] Train a decision tree classifier using the data file. You CAN NOT use any decision tree library functions to do it, i.e., you must construct the tree from scratch. You also CAN NOT touch the test file in this part. Vary the cut-off depth from 2 to 10 and report the training accuracy for each cut-off depth k. Based on your results, select an optimal k.
(b)[8 pts] Using the trained classifier with optimal cut-off depth k, classify the 99,762 instances from the test file and report the testing accuracy (the portion of testing instances classified correctly),
(c)[7 pts] Do you see any over-fitting issues for this experiment? Report
2 Application of Decision Tree on Real - Word

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Finance Questions!