Question: In this assignment, we will build and evaluate a spam filter using a dataset that contains some columns indicating the most common words in an

In this assignment, we will build and evaluate a spam filter using a dataset that contains some columns indicating the most common words in an email (frequency of given words and characters), and a label column indicating if the email was spam or not. Please answer the following questions based on your implemented code (implementation in Python):

a) Draw a bar chart to view of the distribution of spam and non-spam email samples in the dataset. How many emails are in the dataset? How many of the emails are spam?

b) Divide the dataset into training and test sets, since this is a binary classification problem, use a Logistic regression or Random Forest algorithm to build a model that can tell whether an email is spam or not.

c) Build the confusion matrix and calculate precision and recall metrics to evaluate the performance of your model.

d) Take another look at the distribution of sample emails (i.e. part a). Are there any imbalances in the distribution? If yes, oversample the minority class using SMOTE algorithm and retrain your model.

e) Rebuild the confusion matrix and compare it with your initial matrix. What are the differences between these models? Does SMOTE work well? Explain your answer in detail

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

From the book Networks, Crowds, and Markets: Reasoning about a Highly Connected World. By David Easley and Jon Kleinberg. Cambridge University Press, 2010. Complete preprint on-line at...

This text was adapted by The Saylor Foundation under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License without attribution as requested by the work's original creator or licensee. 1...

1 Ob jective Construct a na ve Bayes classifier to classify email as spam or not spam ("ham"). A Bayesian decision rule chooses the hypothesis that maximizesP(Spam|x) vsP(Spam|x) for emailx. Use any...

NOTE: THIS IS FROM "DISCRETE MATH" COURSE FOR COMPUTER SCIENCE I RECOMMEND YOU TO DO THIS ASSIGNMENT ON VISUAL STUDIO SINCE I HAVE NEVER TAKING C++, I MAY HAVE SOME DIFFICULTY FOR THIS ASSIGNMENT....

Your code will read in an email message in some standard format (we will determine that standard) and will classify whether that email is a spam or non-spam email. There are databases containing...

Need help getting started on these questions. I am supposed to add code where it says "implement me" and write the answer where it says answer in one or two line. Need to fill in the "Implement me"...

MNG3702/101/3/2016 Tutorial Letter 101/3/2016 Strategy Implementation and Control MNG3702 Semesters 1 and 2 Department of Business Management PLEASE NOTE: This tutorial letter contains important...

A voltaic cell is constructed from an Ni2+ (aq) - Ni(s) half-cell and an Ag+(aq)-Ag(s) half-cell. The initial concentration of Ni2+ (aq) in the Ni2+ -Ni half-cell is [Ni2+] = 0.0100 M. The initial...

Your daughter is a beginning freshman in high school. By the time she enters her freshman year in college, you would like to have savings accumulated to pay her tuition for her next four years of...

If Apply temporarily cant meet demand for a product, should it raise the price to bring supply and demand into balance? Why or why not?

Find the radius of gyration of a plate covering the region bounded by x=3, x=5, y=0, and y=4 with respect to the y-axis.

evaluate and draft recruitment and selection policies and procedures

understand the role of human resource managers and line managers in the recruitment and selection process

2. Why are there so few female engineers compared to male engineers and what can the engineering sector do to encourage females to take up a career in engineering?