Question: Problem 2: Census Dataset In Problem 2, you ou will be using census data from 1994 to attempt to predict whether or not a person

Problem 2: Census Dataset In Problem 2, you ou will be using census data from 1994 to attempt to predict whether or not a person has an annual salary greater than $50,000 based on other information provided in the census. You can find a description of the dataset here: Census Dataset Load the data stored in the tab-delimited file census.txt into a DataFrame named census. Use head() to display the first 10 rows of this DataFrame. We will now check to see how many rows and columns there are in the DataFrame. Print the shape of the census DataFrame. The last column is named salary. Each entry in this column is a string equal to either '<=50K' or '>50K'. Our goal is to create and compare several classification models for the purposes of predicting to which of these two classes an individual belongs based on the values of the other columns, which will be used as features in our models. Before creating any models, we will check the distribution of values in our target variable. Without creating any new DataFrame variables, select the salary column, and then call its value_counts() method. Display the result. We will now prepare our data by encoding the categorical features and splitting into training, validation, and test sets. Add a markdown cell with a level 3 header that reads: "Prepare the Data". We will start by separating the categorial and numerical features into different arrays. Note that the following 8 features are categorical in nature: workclass, education, marital_status, occupation, relationship, race, sex, and native_country. The remaining 6 features are numerical. Perform the following steps in a single code cell: Create a 2D array named X2_num by selecting the columns of census that represent numerical features. Create a 2D array named X2_cat by selecting the columns of census that represent categorical features. Create a 1D array named y2 by selecting the salary column. Print the shapes of all three of these arrays with messages as shown below. Add spacing to ensure that the shape tuples are left-aligned. Numerical Feature Array Shape: xxxx Categorical Feature Array Shape: xxxx Label Array Shape: xxxx Note: The variables created here should be arrays, and not DataFrames or Series. You will need to use .values. We will now perform one-hot encoding on the categorical variables. Perform the following steps in a single code cell: 1. Create a OneHotEncoder() object setting sparse=False. 2. Fit the encoder to the categorical features. 3. Use the encoder to encode the categorical features, storing the result in a variable named X2_enc. 4. Print the shape of X2_enc with a message as shown below. Encoded Feature Array Shape: xxxx We will now combine the numerical feature array with the encoded categorical feature array. Perform the following steps in a single code cell: 1. Use np.hstack to combine X2_num and X2_enc into a single array named X2. 2. Print the shape of X2 with a message as shown below. Feature Array Shape: xxxx We will now split the data into training, validation, and test sets, using an 70/15/15 split. Perform the following steps in a single code cell: Use train_test_split() to split the data into training and holdout sets using an 70/30 split. Name the resulting arrays X2_train, X2_hold, y2_train, and y2_hold. Set random_state=1. Use stratified sampling. Use train_test_split() to split the holdout data into validation and test sets using a 50/50 split. Name the resulting arrays X2_valid, X2_test, y2_valid, and y2_test. Set random_state=1. Use stratified sampling. Print the shapes of X2_train, X2_valid, and X2_test with messages as shown below. Add spacing to ensure that the shape tuples are left-aligned. Training Features Shape: xxxx Validation Features Shape: xxxx Test Features Shape: xxxx We will now create and evaluate a logistic regression model. Add a markdown cell with a level 3 header that reads: "Logistic Regression Model". Perform the following steps in a single code cell: 1. Create a logistic regression model named lr_mod setting solver='lbfgs', and max_iter=1000. Set penalty='none', unless that results in an error, in which case, set C=10e1000. 2. Fit your model to the training data. 3. Calculate the training and validation accuracy with messages as shown below. Add spacing to ensure that the accuracy scores are left-aligned. Round the scores to 4 decimal places. Training Accuracy: xxxx Validation Accuracy: xxxx We will now create and evaluate several decision tree models. We will use the validation score for these models to perform hyperparameter tuning. Add a markdown cell with a level 3 header that reads: "Decision Tree Models". Perform the following steps in a single code cell: 1. Create empty lists named dt_train_acc and dt_valid_acc. These lists will store the accuracy scores that we calculate for each model. 2. Create a range variable named depth_range to represent a sequence of integers from 2 to 30. 3. Loop over the values in depth_range. Every time the loop executes, perform the following steps. a. Use NumPy to set a random seed of 1. This should be done inside the loop. b. Create a decision tree model named temp_tree with max_depth equal to the current value from depth_range that is being considered. c. Fit the model to the training data. d. Calculate the training and validation accuracy for temp_tree, appending the resulting values to the appropriate lists. 4. Use np.argmax to determine the index of the maximum value in dt_valid_acc. Store the result in dt_idx. 5. Use dt_idx and depth_range to find the optimal value for the max_depth hyperparameter. Store the result in dt_opt_depth. 6. Use dt_idx with the lists dt_train_acc and dt_valid_acc to determine the training and validation accuracies for the optimal model found. 7. Display the values found in Steps 5 and 6 with messages as shown below. Add spacing to ensure that the values replacing the xxxx symbols are left-aligned. Round the accuracy scores to 4 decimal places. Optimal value for max_depth: xxxx Training Accuracy for Optimal Model: xxxx Validation Accuracy for Optimal Model: xxxx We will now plot the validation and training curves as a function of the max_depth parameter. Create a figure with two line plots on the same set of axes. One line plot should plot values of dt_train_acc against depth_range and the other should plot values of dt_valid_acc against depth_range. The x-axis should be labeled "Max Depth" and the y-axis should be labeled "Accuracy". The plot should contain a legend with two items that read "Training" and "Validation". We will now create and evaluate several random forest models. We will use the validation score for these models to perform hyperparameter tuning. Add a markdown cell with a level 3 header that reads: "Random Forest Models". Perform the following steps in a single code cell: 1. Create empty lists named rf_train_acc and rf_valid_acc. These lists will store the accuracy scores that we calculate for each model. 2. Loop over the values in depth_range. Every time the loop executes, perform the following steps. a. Use NumPy to set a random seed of 1. This should be done inside the loop. b. Create a random forest model named temp_forest with max_depth equal to the current value from depth_range that is being considered. Set the parameter n_estimators to 100. c. Fit the model to the training data. d. Calculate the training and validation accuracy for temp_forest, appending the resulting values to the appropriate lists. 3. Use np.argmax to determine the index of the maximum value in rf_valid_acc. Store the result in rf_idx. 4. Use rf_idx and depth_range to find the optimal value for the max_depth hyperparameter. Store the result in rf_opt_depth. 5. Use rf_idx with the lists rf_train_acc and rf_valid_acc to determine the training and validation accuracies for the optimal model found. 6. Display the values found in Steps 4 and 5 with messages as shown below. Add spacing to ensure that the values replacing the xxxx symbols are left-aligned. Round the accuracy scores to 4 decimal places. Optimal value for max_depth: xxxx Training Accuracy for Optimal Model: xxxx Validation Accuracy for Optimal Model: xxxx We will now plot the validation and training curves as a function of the max_depth parameter. Create a figure with two line plots on the same set of axes. One line plot should plot values of rf_train_acc against depth_range and the other should plot values of rf_valid_acc against depth_range. The x-axis should be labeled "Max Depth" and the y-axis should be labeled "Accuracy". The plot should contain a legend with two items that read "Training" and "Validation". We will now create our final model and will evaluate it on the test set. Add a markdown cell with a level 3 header that reads: "Evaluate Final Model". Of the three types of models considered, and the various hyperparameter values for those models, select as your final model the one that produced the highest score on the validation set. Perform the following steps in a single code cell: 1. If your final model is a decision tree or random forest model, use NumPy to set a random seed of 1. 2. Recreate the best model you found, using the parameter values that produced that model. Store the resulting model in a variable named final_model. 3. Fit this model to the training set. 4. Print the training accuracy, validation accuracy, and test accuracy for the final model with messages as shown below. Add spacing to ensure that the accuracy scores are left-aligned, and round the accuracy scores to four decimal places. Training Accuracy for Final Model: xxxx Validation Accuracy for Final Model: xxxx Testing Accuracy for Final Model: xxxx We will get a more detailed look at our final model's performance on the test set by creating a confusion matrix and a classification report. Use your final model to generate predictions for the test set, storing them in a variable named test_pred. Create a confusion matrix by passing y2_test and test_pred to the function confusion_matrix() from Scikit-Learn. This function returns a NumPy array. Store this array in a variable, and then convert it to a DataFrame with row and column names being set equal to the valid labels, '<=50K' and '>50K'. Display the resulting DataFrame. Pass the arrays y2_test and test_pred to the function classification_report(), printing the result.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

In a one-page paper, answer the following: What is the function of venture capital in the United States and why is it important and/or needed? Describe the seven constants that apply across any...

Read Accounting Headline 7.9 and, adopting a Positive Accounting Theory perspective, consider the following issues: a)If a new accounting standard impacts on profits, should this impact on the value...

Describing Data Once we have collected data from surveys or experiments, we need to summarize and present the data in a way that will be meaningful to the reader. We will begin with graphical...

Return Predictability along the Supply Chain F.AJ inancial Analysts Journal Volume 66 . Number 3 @2010 CFA Institute these two studies were based on the idea that the 2007, the last month for which...

Confirming Pages C H A P T E R 19 Analyzing Information and Writing Reports Chapter Outline Using Your Time Efficiently Analyzing Data and Information for Reports Identifying the Source of the Data...

Would you be able to put together another monthly cash flow analysis, this time for year 1? INCOME STATEMENT Year 1 269,115.00 48,930.00 20,388.00 338,433.00 $ $ $ $ COGS $ 113,940.00 $ 125,940.00 $...

3. Assume that Mrs. Ivorys real income will not change over the next sixteen years. Use the mean real income from question 1 to determine projected real income for the future sixteen years of Mrs....

Title of the Study: -Effect of salary, training and motivation on job performance of employees Read the Research Paper carefully and briefly explain the following points using the information given...

Mitch and his family were holidaying in a remote rural area. Mitch wanted to make this holiday experience very memorable for his family so he organized with Family Friendly Adventures to take his...

You have just been assigned to work on a team of public health experts whose goal is to design and implement a study regarding diet and autism. On the team, you appear to have the most statistical...

Is concern for the environment a normal good? Discuss in the context of the destruction of rain forests in particularly poor parts of the world.

Compared with half a century ago, adoption has become _ _ _ _ _ _ _ _ _ common, but it is more open and acceptabl e , so we probably discuss it _ _ _ _ _ _ _ . fill in the blanks more or much less or...

Suppose government spending increases. Would the effect on aggregate demand be larger if the Federal Reserve took no action in response or if the Fed were committed to maintaining a fixed interest...

Suppose that people suddenly wanted to hold more money balances. a. What would be the effect of this change on the economy if the Federal Reserve followed a rule of increasing the money supply by 3...

Suppose the federal government cuts taxes and increases spending, raising the budget deficit to 12 percent of GDP. If nominal GDP is rising 5 percent per year, are such budget deficits sustainable...