Question: Decision Tree, post - pruning and cost complexity parameter using sklearn 0 . 2 2 [ 1 0 points, Peer Review ] We will use

Decision Tree, post-pruning and cost complexity parameter using sklearn 0.22[10 points, Peer Review]
We will use a pre-processed natural language dataset in the CSV file "spamdata.csv" to classify emails as spam or not. Each row contains the word frequency for 54 words plus statistics on the longest "run" of captial letters.
Word frequency is given by:
=/
Where
is the frequency for word
,
is the number of times word
appears in the email, and
is the total number of words in the email.
We will use decision trees to classify the emails.
Part A [5 points]: Complete the function get_spam_dataset to read in values from the dataset and split the data into train and test sets.
My Code:
def get_spam_dataset(filepath="data/spamdata.csv", test_split=0.1):
'''
get_spam_dataset
Loads csv file located at "filepath". Shuffles the data and splits
it so that the you have (1-test_split)*100% training examples and
(test_split)*100% testing examples.
Args:
filepath: location of the csv file
test_split: percentage/100 of the data should be the testing split
Returns:
X_train, X_test, y_train, y_test, feature_names
Note: feature_names is a list of all column names including isSpam.
(in that order)
first four are np.ndarray
'''
# your code here
# Read CSV file
data = pd.read_csv(filepath, header=None, delimiter='')
# Shuffle the data
data = data.sample(frac=1, random_state=42).reset_index(drop=True)
# Extract features and target variable
X = data.iloc[:, :-1].values
y = data.iloc[:,-1].values
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_split, random_state=42)
# Get feature names
feature_names =[f"word_freq_{i}" for i in range(1, X.shape[1]+1)]
return X_train, X_test, y_train, y_test, feature_names
# TO-DO: import the data set into five variables: X_train, X_test, y_train, y_test, label_names
# Uncomment and edit the line below to complete this task.
test_split =0.1 # default test_split; change it if you'd like; ensure that this variable is used as an argument to your function
# your code here
X_train, X_test, y_train, y_test, label_names = get_spam_dataset(filepath="data/spamdata.csv", test_split=0.1)
# X_train, X_test, y_train, y_test, label_names = np.arange(5)
# Print the shapes of X_train and y_train
print("Shape of X_train:", X_train.shape)
print("Shape of y_train:", y_train.shape)
# Print label_names
print("Label names:", label_names)
its returning wrong answer , can someone help.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!