A team at Hewlett-Packard collected data on a large number of email messages from their postmaster and

Question:

A team at Hewlett-Packard collected data on a large number of email messages from their postmaster and personal email for the purpose of finding a classifier that can separate email messages that are spam vs. nonspam (a.k.a. "ham"). The spam concept is diverse: it includes advertisements for products or websites, "make money fast" schemes, chain letters, pornography, and so on. The definition used here is "unsolicited commercial email." The file Spambase.xls contains information on 4601 email messages, among which 1813 are tagged "spam." The predictors include 57 attributes, most of them are the average number of times a certain word (e.g., mail, George) or symbol (e.g., \#, !) appears in the email. A few predictors are related to the number and length of capitalized words.

a. To reduce the number of predictors to a manageable size, examine how each predictor differs between the spam and nonspam emails by comparing the spam-class average and nonspam-class average. Which are the 11 predictors that appear to vary the most between spam and nonspam emails? From these 11, which words or signs occur more often in spam?

b. Partition the data into training and validation sets; then perform a discriminant analysis on the training data using only the 11 predictors.

c. If we are interested mainly in detecting spam messages, is this model useful? Use the confusion matrix, lift chart, and decile chart for the validation set for the evaluation.

d. In the sample, almost \(40 \%\) of the email messages were tagged as spam. However, suppose that the actual proportion of spam messages in these email accounts is \(10 \%\). Compute the constants of the classification functions to account for this information.

e. A spam filter that is based on your model is used, so that only messages that are classified as nonspam are delivered, while messages that are classified as spam are quarantined. In this case, misclassifying a nonspam email (as spam) has much heftier results. Suppose that the cost of quarantining a nonspam email is 20 times that of not detecting a spam message. Compute the constants of the classification functions to account for these costs (assume that the proportion of spam is reflected correctly by the sample proportion).

Fantastic news! We've Found the answer you've been seeking!

Step by Step Answer:

Related Book For  book-img-for-question
Question Posted: