The spam data file contains 4601 emails, 1813 of which are spam. The file has 57 features
Fantastic news! We've Found the answer you've been seeking!
Question:
- The spam data file contains 4601 emails, 1813 of which are spam. The file has 57 features that include indicators for the presence of 54 keywords (e.g. free, deal, ! etc), counts for capitalized characters, etc., and a numeric spam variable for whether each email is tagged as spam by a human reader (spam column is 1 for spam, 0 for important emails).
- You have to predict the probability that a message is spam or not.
- 1) Partition the data into a training set (with 70% of the observations), and a testing set (with 30% of the observations) using the random state of 12345 for cross-validation.
- 2) On the partitioned data, build the best KNN model. Show the accuracy numbers. (Hint: What is the best value of k? How do you decide the 'best k'?)
- 3) On the partitioned data, build the best logistic regression model. Show the accuracy numbers.
- 4) Based on the results of k-nearest neighbor, and logistic regression, what is the best model to classify the data? Provide an explanation to support your argument
Posted Date: