Question: Instructions: 1 ) The . ipynb file shall include not only the source code but also necessary plots / figures and discussions which include your
Instructions:
The ipynb file shall include not only the source code but also necessary plotsfigures and discussions which include your observations, thoughts, and insights.
Please avoid using a single big block of code for everything and then plotting all figures altogether. Instead, use a small block of code for each subtask which is followed by its plots and discussions. This will make your homework more readable.
Please follow common software engineering practices, eg by including sufficient comments for functions, important statements, etc.
Programming Problem:
In this homework, you will try to implement a decision tree model and learn how to optimize it with the help of sklearn package. We provide a notebook hwFall.ipynb which contains the needed code for step and some instructions which you can follow to help you develop your own code.
Step : Data Preparation
Load the Titanic.csv file and examine a sample of the data. Notice that the dataset contains both categorical and numerical features.
Address any missing values by imputing them with the feature's mean across the dataset.
For this assignment, select a subset of the data including the independent variables: 'pclass', 'sex', 'age', and 'sibsp', and the dependent variable 'survived'.
Ensure that 'survived' is a binary variable coded as yes or no
Split the data into training and test sets, using an split.
Step : Data Processing and Initial Analysis
Recognize that the 'age' attribute is continuous. As discussed in our class, decision trees are typically applied to categorical features. Utilize the steps outlined in the provided Jupyter notebook to discretize 'age' using quantile binning.
Compute the information gain to determine the optimal first split in the decision tree.
Step : Decision Tree Modeling
Employ sklearn to train a decision tree model, setting the maximum number of leaf nodes to and the randomstate to your student id number. Please apply it to all the rest questions
Visualize the complete tree. Note that your tree's structure and size may vary from the example.
Implement a function to calculate the accuracy, precision, recall, and F score on the test set.
Step : Model Optimization
Apply GridSearchCV to identify the optimal maxleafnodes parameter, exploring values from to for tree pruning.
CPEEEAAI Applied Machine LearningHomework # points Fall
Plot the pruned tree, which should be more compact than the initially generated tree. Report the performance. using metrics in step same below for step
Step : Advanced Modeling
Replicate Steps and to construct two additional decision tree models with varying parameters, such as the maximum depth and splitting criteria.
Use majority vote to create an ensemble learning model that combines the three decision trees model we trained in the step and step
Use the RandomForestClassifier function to train a random forest using the optimal tree size you found in step You can set nestimator as Compare the performance of your random forest and your ensembled model.
Code:
import pandas as pd
pdoptions.mode.chainedassignment None
import numpy as np
import matplotlib.pyplot as plt
titanic pdreadcsvTitaniccsv
# Display the first few rows of the dataframe
titanic.head
Step :
def QuantileBinningfeature binnumber:
This function takes a numerical feature and the number of bins, and
returns the feature binned into quantilebased bins.
Parameters:
feature pandasSeries: The numerical feature to be binned.
binnumber int: The number of quantile bins.
Returns:
pandas.Series: A series of discrete features binned by quantile.
# Use qcut to create quantilebased bins for the feature
# If there are fewer unique values than bins, qcut could throw an error.
# 'duplicates' parameter handles this by dropping redundant bins.
return pdqcutfeature qbinnumber, labelsFalse, duplicates'drop'
# One example
featuretest pdDataFramenprandom.randcolumnsColumnA
featuretestdiscrete QuantileBinningfeaturetestColumnA
def labelencoderfeature:
uniquelabels pduniquefeature
labeltoint label: idx for idx, label in enumerateuniquelabels
transformedfeature nparraylabeltointlabel for label in feature
return transformedfeature
# Fill missing values in 'age' with the average age
# Discretization
# Split the data into training and test sets
training.head
Step :
# define your entropy function and information gain function
# Calculate Information Gain for each feature in the training set
infogains
Step :
from sklearn.tree import DecisionTreeClassifier, plottree #sklearn.metrics is not imported
# Instantiate the DecisionTreeClassifier
# Prepare the features and target variables for training
# Fit the decision tree model
# Plot the full decision tree
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
