Question: Hello, this is for Datascience Application. Below is a question from the textbook fundamentals of machine learning for predictive data analytics, chapter 4, problem 2,
3.) The table below lists a sample of data from a census.32 MARITAL STATUS never married married ANNUAL INCOME 25K-50K 25K-50K 50K ID AGE 39 50 18 28 37 24 52 40 EDUCATION bachelors bachelors high school bachelors high school high school high school doctorate OccUPATION transport professional agriculture professional agriculture armed forces transport professional never married 4 married married never married divorced married 7 There are four descriptive features and one target feature in this dataset: . AGE, a continuous feature listing the age of the individual 32 This census dataset is based on from the UCI Machine Learning Repository (Bache and Lichman, 2013) at archive.ics uci.edu/m1/datasets/Census+Income/ the Census Income Dataset (Kohavi, 1996), which is available . EDUCATION, a categorical feature listing the highest education award achieved by the individual (high school, bachelors, doctorate) MARITAL STATUS (never married, married, divorced) OCCUPATION (transport works in the transportation industry; profes- sional - doctors, lawyers, etc,; agriculture works in the agricultural industry; armed forces - is a member of the armed forces) ANNUAL INCOME, the target feature with 3 levels (50K) a. Calculate the entropy for this dataset. b. Calculate the Gini index for this dataset. c. When building a decision tree, the easiest way to handle a continuous feature is to define a threshold around which splits will be made. What would be the optimal threshold to split the continuous AGE feature (use information gain based on entropy as the feature selection measure)? d. Calculate information gain (based on entropy) for the EDUCATION, MARITAL STATUS, and OCCUPATION features. Calculate the information gain ratio (based on entropy) for EDUCA TION, MARITAL STATUS, and OccUPATION features. e. f. Calculate information gain using the Gini index for the EDUCATION, MARITAL STATUS, and OcCUPATION features. 3.) The table below lists a sample of data from a census.32 MARITAL STATUS never married married ANNUAL INCOME 25K-50K 25K-50K 50K ID AGE 39 50 18 28 37 24 52 40 EDUCATION bachelors bachelors high school bachelors high school high school high school doctorate OccUPATION transport professional agriculture professional agriculture armed forces transport professional never married 4 married married never married divorced married 7 There are four descriptive features and one target feature in this dataset: . AGE, a continuous feature listing the age of the individual 32 This census dataset is based on from the UCI Machine Learning Repository (Bache and Lichman, 2013) at archive.ics uci.edu/m1/datasets/Census+Income/ the Census Income Dataset (Kohavi, 1996), which is available . EDUCATION, a categorical feature listing the highest education award achieved by the individual (high school, bachelors, doctorate) MARITAL STATUS (never married, married, divorced) OCCUPATION (transport works in the transportation industry; profes- sional - doctors, lawyers, etc,; agriculture works in the agricultural industry; armed forces - is a member of the armed forces) ANNUAL INCOME, the target feature with 3 levels (50K) a. Calculate the entropy for this dataset. b. Calculate the Gini index for this dataset. c. When building a decision tree, the easiest way to handle a continuous feature is to define a threshold around which splits will be made. What would be the optimal threshold to split the continuous AGE feature (use information gain based on entropy as the feature selection measure)? d. Calculate information gain (based on entropy) for the EDUCATION, MARITAL STATUS, and OCCUPATION features. Calculate the information gain ratio (based on entropy) for EDUCA TION, MARITAL STATUS, and OccUPATION features. e. f. Calculate information gain using the Gini index for the EDUCATION, MARITAL STATUS, and OcCUPATION features
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
