Question: Instructions: 1 ) The . ipynb file shall include not only the source code but also necessary plots / figures and discussions which include your

Instructions:
1)
The .ipynb file shall include not only the source code but also necessary plots/figures and discussions which include your observations, thoughts, and insights.
2)
Please avoid using a single big block of code for everything and then plotting all figures altogether. Instead, use a small block of code for each sub-task which is followed by its plots and discussions. This will make your homework more readable.
3)
Please follow common software engineering practices, e.g., by including sufficient comments for functions, important statements, etc.
Programming Problem:
In this homework, you will try to implement a decision tree model and learn how to optimize it with the help of sklearn package. We provide a notebook hw3_24Fall.ipynb which contains the needed code for step 2 and some instructions which you can follow to help you develop your own code.
Step 1: Data Preparation
Load the Titanic.csv file and examine a sample of the data. Notice that the dataset contains both categorical and numerical features.
1.
Address any missing values by imputing them with the feature's mean across the dataset.
2.
For this assignment, select a subset of the data including the independent variables: 'pclass', 'sex', 'age', and 'sibsp', and the dependent variable 'survived'.
3.
Ensure that 'survived' is a binary variable coded as 1(yes) or 0(no).
4.
Split the data into training and test sets, using an 80/20 split.
Step 2: Data Processing and Initial Analysis
1.
Recognize that the 'age' attribute is continuous. As discussed in our class, decision trees are typically applied to categorical features. Utilize the steps outlined in the provided Jupyter notebook to discretize 'age' using quantile binning.
2.
Compute the information gain to determine the optimal first split in the decision tree.
Step 3: Decision Tree Modeling
1.
Employ sklearn to train a decision tree model, setting the maximum number of leaf nodes to 20 and the random_state to your student id number. (Please apply it to all the rest questions)
2.
Visualize the complete tree. Note that your tree's structure and size may vary from the example.
3.
Implement a function to calculate the accuracy, precision, recall, and F1 score on the test set.
Step 4: Model Optimization
1.
Apply GridSearchCV() to identify the optimal max_leaf_nodes parameter, exploring values from 5 to 20, for tree pruning.
CPE/EE/AAI 595 Applied Machine LearningHomework #3(60 points) Fall 2024
2.
Plot the pruned tree, which should be more compact than the initially generated tree. Report the performance. (using metrics in step 3.3, same below for step 5)
Step 5: Advanced Modeling
1.
Replicate Steps 3 and 4 to construct two additional decision tree models with varying parameters, such as the maximum depth and splitting criteria.
2.
Use majority vote to create an ensemble learning model that combines the three decision trees model we trained in the step 4 and step 5.1.
3.
Use the RandomForestClassifier() function to train a random forest using the optimal tree size you found in step 4. You can set n_estimator as 50. Compare the performance of your random forest and your ensembled model.
Code:
import pandas as pd
pd.options.mode.chained_assignment = None
import numpy as np
import matplotlib.pyplot as plt
titanic = pd.read_csv('./Titanic.csv')
# Display the first few rows of the dataframe
titanic.head()
Step 1:
def QuantileBinning(feature, bin_number):
"""
This function takes a numerical feature and the number of bins, and
returns the feature binned into quantile-based bins.
Parameters:
- feature (pandas.Series): The numerical feature to be binned.
- bin_number (int): The number of quantile bins.
Returns:
- pandas.Series: A series of discrete features binned by quantile.
"""
# Use qcut to create quantile-based bins for the feature
# If there are fewer unique values than bins, qcut could throw an error.
# 'duplicates' parameter handles this by dropping redundant bins.
return pd.qcut(feature, q=bin_number, labels=False, duplicates='drop')
# One example
feature_test = pd.DataFrame(np.random.rand(100),columns=['Column_A'])
feature_test_discrete = QuantileBinning(feature_test['Column_A'],10)
def label_encoder(feature):
unique_labels = pd.unique(feature)
label_to_int ={label: idx for idx, label in enumerate(unique_labels)}
transformed_feature = np.array([label_to_int[label] for label in feature])
return transformed_feature
# Fill missing values in 'age' with the average age
# Discretization
# Split the data into 80% training and 20% test sets
training.head()
Step 2:
# define your entropy function and information gain function
# Calculate Information Gain for each feature in the training set
info_gains
Step 3:
from sklearn.tree import DecisionTreeClassifier, plot_tree #sklearn.metrics is not imported !!!
# Instantiate the DecisionTreeClassifier
# Prepare the features and target variables for training
# Fit the decision tree model
# Plot the full decision tree

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!