Question: In this assignment, we will build and evaluate a spam filter using a dataset that contains some columns indicating the most common words in an
In this assignment, we will build and evaluate a spam filter using a dataset that contains some columns indicating the most common words in an email (frequency of given words and characters), and a label column indicating if the email was spam or not. Please answer the following questions based on your implemented code (implementation in Matlab):
a) Draw a bar chart to view of the distribution of spam and non-spam email samples in the dataset. How many emails are in the dataset? How many of the emails are spam?
b) Divide the dataset into training and test sets, since this is a binary classification problem, use a Logistic regression or Random Forest algorithm to build a model that can tell whether an email is spam or not.
c) Build the confusion matrix and calculate precision and recall metrics to evaluate the performance of your model.
d) Take another look at the distribution of sample emails (i.e. part a). Are there any imbalances in the distribution? If yes, oversample the minority class using SMOTE algorithm and retrain your model.
e) Rebuild the confusion matrix and compare it with your initial matrix. What are the differences between these models? Does SMOTE work well? Explain your answer in detail
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
