Question: Introduction Random Forest is a renowned ensemble learning method used for classification and regression tasks. Familiarize yourself with Random Forest before doing this assignment. In

Introduction
Random Forest is a renowned ensemble learning method used for classification
and regression tasks. Familiarize yourself with Random Forest before doing this
assignment. In summary, it operates by constructing a multitude of decision
trees during training time and outputting the class that is the mode of the
classes (classification) or mean prediction (regression) of the individual trees.
A key feature of Random Forest is its use of bootstrap samples to train each
tree. Bootstrap sampling is a resampling technique in which samples are drawn
with replacement from the original dataset. This process inherently leaves out
a subset of the data from the training set for each tree, known as Out-of-Bag
(OOB) samples. These OOB samples have a special role in providing an internal
error estimate of the forest.
1 Objective
Your mission is to discover and mathematically prove what proportion of the
original dataset is left out (not selected) during the bootstrap sampling process
in the creation of each tree within a Random Forest model. This journey involves
deriving the formula, taking the limit, and empirically validating your findings
through simulation.
2 Tasks
1. Derivation of the Formula:
Begin by considering a dataset of size N. For each tree in the Ran
dom Forest, we draw samples with replacement to create a bootstrap
sample of the same size N.
Reflect on the probability of selecting a specific sample at least once
during this process.
Derive the formula that calculates the proportion of the dataset ex
pected to be left out of the bootstrap sample, knowing that the chance
of a specific sample being selected in one draw is 1
N
1
2. Mathematical Proof Using Limits:
With the derived formula, analyze the behavior as N approaches
infinity. You will find this expression leads to a well-known limit.
Hint: Consider using the natural logarithm to transform the expres
sion into a form suitable for applying LHopitals rule.
3. Simulation:
Implement a Python simulation to empirically demonstrate this pro
portion. Create a synthetic dataset, perform bootstrap sampling,
and calculate the proportion of samples not selected in each round.
Plot the results to visualize the distribution. Please complete the
code provided below. Comment on your findings.
3 Guidance for Derivation, Proof, Simulation
Think about the process of selecting N samples with replacement from a dataset
of size N. What is the probability of a specific item not being selected in one
draw? How does this probability change over N draws?
To find the limit of the derived expression as N approaches infinity, consider
logarithmic transformation as a strategy to simplify the problem. How does this
transformation help in applying LHopitals rule?
The following is the code to complete for the simulation part (see file sim
ulation oob error.py). You should complete the code in that file to run the
simimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
def synthetic_dataset_and_sampling(n, num_trials):
# TODO: Create a synthetic dataset with n rows. Hint: Use np.arange
df = pd.DataFrame({'value': ___})
proportions_not_selected =[]
for _ in range(num_trials):
# TODO: Sample with replacement. Hint: Use the sample method with n and replace=True
sampled_df = df.___
# TODO: Determine which samples were not selected. Hint: Use np.setdiff1d
not_selected = np.___(df['value'], sampled_df['value'])
# TODO: Calculate and append the proportion not selected
proportion_not_selected =___/ n
proportions_not_selected.append(proportion_not_selected)
# TODO: Plot the distribution of proportions not selected. Hint: Use plt.hist
plt.___(proportions_not_selected, bins=30, edgecolor='k', alpha=0.7)
plt.title('Proportion of Points Not Selected in Bootstrap Samples')
plt.xlabel('Proportion Not Selected')
plt.ylabel('Frequency')
plt.show()
# Invoke the function for a dataset of size 1000 and 1000 trials
synthetic_dataset_and_sampling(1000,1000)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Accounting Questions!