The spam data file contains 4601 emails, 1813 of which are spam The file has 57 features that include indicators for the presence of 54 keywords (e g free, deal, etc), counts for capitalized characters, etc , and a numeric spam variable for whether each email is tagged as spam by a human reader (sp...

The Answer is in the image, click to view ...

The spam data file contains 4601 emails, 1813 of which are spam. The file has 57 features

Fantastic news! We've Found the answer you've been seeking!

The spam data file contains 4601 emails, 1813 of which are spam. The file has 57 features that include indicators for the presence of 54 keywords (e.g. free, deal, ! etc), counts for capitalized characters, etc., and a numeric spam variable for whether each email is tagged as spam by a human reader (spam column is 1 for spam, 0 for important emails).
You have to predict the probability that a message is spam or not.
1) Partition the data into a training set (with 70% of the observations), and a testing set (with 30% of the observations) using the random state of 12345 for cross-validation.
2) On the partitioned data, build the best KNN model. Show the accuracy numbers. (Hint: What is the best value of k? How do you decide the 'best k'?)
3) On the partitioned data, build the best logistic regression model. Show the accuracy numbers.
4) Based on the results of k-nearest neighbor, and logistic regression, what is the best model to classify the data? Provide an explanation to support your argument

Posted Date: Apr 24, 2024 02:02 AM

See More Questions