Question: Instructions: 1 ) The . ipynb file shall include not only the source code but also necessary plots / figures and discussions which include your

Instructions:

1)

The

.

ipynb file shall include not only the source code but also necessary plots

/

figures and discussions which include your observations, thoughts, and insights.

2)

Please avoid using a single big block of code for everything and then plotting all figures altogether. Instead, use a small block of code for each sub

-

task which is followed by its plots and discussions. This will make your homework more readable.

3)

Please follow common software engineering practices, e

.

.,

by including sufficient comments for functions, important statements, etc.

Programming Problem:

In this homework, you will try to implement a decision tree model and learn how to optimize it with the help of sklearn package. We provide a notebook hw

3_24

Fall.ipynb which contains the needed code for step

2

and some instructions which you can follow to help you develop your own code.

Step

1

: Data Preparation

Load the Titanic.csv file and examine a sample of the data. Notice that the dataset contains both categorical and numerical features.

1 .

Address any missing values by imputing them with the feature's mean across the dataset.

2 .

For this assignment, select a subset of the data including the independent variables: 'pclass', 'sex', 'age', and 'sibsp', and the dependent variable 'survived'.

3 .

Ensure that 'survived' is a binary variable coded as

1 (

yes

)

0 (

) .

4 .

Split the data into training and test sets, using an

80 / 20

split.

Step

2

: Data Processing and Initial Analysis

1 .

Recognize that the 'age' attribute is continuous. As discussed in our class, decision trees are typically applied to categorical features. Utilize the steps outlined in the provided Jupyter notebook to discretize 'age' using quantile binning.

2 .

Compute the information gain to determine the optimal first split in the decision tree.

Step

3

: Decision Tree Modeling

1 .

Employ sklearn to train a decision tree model, setting the maximum number of leaf nodes to

20

and the random

_

state to your student id number.

(

Please apply it to all the rest questions

)

2 .

Visualize the complete tree. Note that your tree's structure and size may vary from the example.

3 .

Implement a function to calculate the accuracy, precision, recall, and F

1

score on the test set.

Step

4

: Model Optimization

1 .

Apply GridSearchCV

()

to identify the optimal max

_

leaf

_

nodes parameter, exploring values from

5

20,

for tree pruning.

CPE

/

/

AAI

595

Applied Machine LearningHomework #

3 (60

points

)

Fall

2024

2 .

Plot the pruned tree, which should be more compact than the initially generated tree. Report the performance.

(

using metrics in step

3.3,

same below for step

5)

Step

5

: Advanced Modeling

1 .

Replicate Steps

3

and

4

to construct two additional decision tree models with varying parameters, such as the maximum depth and splitting criteria.

2 .

Use majority vote to create an ensemble learning model that combines the three decision trees model we trained in the step

4

and step

5.1 .

3 .

Use the RandomForestClassifier

()

function to train a random forest using the optimal tree size you found in step

4 .

You can set n

_

estimator as

50 .

Compare the performance of your random forest and your ensembled model.

Code:

import pandas as pd

.

options.mode.chained

_

assignment

=

None

import numpy as np

import matplotlib.pyplot as plt

titanic

=

.

read

_

csv

(' . /

Titanic

.

csv

')

# Display the first few rows of the dataframe

titanic.head

()

Step

1

def QuantileBinning

(

feature

,

bin

_

number

)

" " "

This function takes a numerical feature and the number of bins, and

returns the feature binned into quantile

-

based bins.

Parameters:

-

feature

(

pandas

.

Series

)

: The numerical feature to be binned.

-

bin

_

number

(

int

)

: The number of quantile bins.

Returns:

-

pandas.Series: A series of discrete features binned by quantile.

" " "

# Use qcut to create quantile

-

based bins for the feature

# If there are fewer unique values than bins, qcut could throw an error.

# 'duplicates' parameter handles this by dropping redundant bins.

return pd

.

qcut

(

feature

,

=

bin

_

number, labels

=

False, duplicates

=

'drop'

)

# One example

feature

_

test

=

.

DataFrame

(

.

random.rand

(100),

columns

= ['

Column

_

'])

feature

_

test

_

discrete

=

QuantileBinning

(

feature

_

test

['

Column

_

'], 10)

def label

_

encoder

(

feature

)

unique

_

labels

=

.

unique

(

feature

)

label

_

_

int

= {

label: idx for idx, label in enumerate

(

unique

_

labels

)}

transformed

_

feature

=

.

array

([

label

_

_

int

[

label

]

for label in feature

])

return transformed

_

feature

# Fill missing values in 'age' with the average age

# Discretization

# Split the data into

80 %

training and

20 %

test sets

training.head

()

Step

2

# define your entropy function and information gain function

# Calculate Information Gain for each feature in the training set

info

_

gains

Step

3

from sklearn.tree import DecisionTreeClassifier, plot

_

tree #sklearn.metrics is not imported

!!!

# Instantiate the DecisionTreeClassifier

# Prepare the features and target variables for training

# Fit the decision tree model

# Plot the full decision tree

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

***READ THIS FIRST*** I need the questions for Part 1 AND Part 2 answered for this case study. The answers must be well-written, detailed, and must answer each question fully with examples from the...

need help for Question 1 and 2 of the assignment fir ACCT 221 FORENSIC BUSINESS INVESTIGATION. Pertains to WorldCom scandal. QAssuming WorldCom was a company operating in Australia at the time of the...

Students will review examples and evaluate them. Review these documents and evaluate them (click on the link): https://1drv.ms/w/s!AoYu6G3CLyuakjVCGipkRkNSBVUB?e=jrPXX6...

1. Referencing the State of Kansas contract, you will find this is an IDIQ contract. Which other contract type could have been used for this contract? What are the pro and cons of the contract type...

Objective: The objective of this project is to explore, analyze, and compare the performance of at least three different machine learning classifiers or regressors on a medium - sized dataset....

Project 1B: Data Visualizations with NYC Real Estate Data Need Help - Please show step by step. Tableau - Froject 1.0 Part B_v2019.1 File Dats Worksheet Deshboard Story Analysis Mop Format Server...

University of Lincoln Assessment Framework Assessment Briefing Template 2021-2022 Module Code & Title: CMP9133M Advanced Programming Contribution to Final Module Mark: 100% Description of Assessment...

Polk and Stoneman is a public accounting firm that offers two primary services, auditing and tax return preparation. A controversy has developed between the partners of the two service lines as to...

What is the role of government (home country) in export activities? Explain in the context of U.S. exporters.

Under the perpetual inventory system, the Inventory account is adjusted when the market value of inventory changes inventory is purchased inventory is sold

Suppose an investor has initial wealth W 0 and has the opportunity for an investment such that the end-of-period wealth is W = W 0 + H . If the investor uses exponential utiliy, show that W C W 0...