Question: Detecting Spam E-mail (from the UCI Machine Learning Repository). A team at Hewlett-Packard collected data on a large number of e-mail messages from their postmaster
Detecting Spam E-mail (from the UCI Machine Learning Repository). A team at Hewlett-Packard collected data on a large number of e-mail messages from their postmaster and personal e-mail for the purpose of finding a classifier that can separate e-mail messages that are spam vs. nonspam (a.k.a. “ham”). The spam concept is diverse: It includes advertisements for products or websites, “make money fast”
schemes, chain letters, pornography, and so on. The definition used here is “unsolicited commercial e-mail.” The file Spambase.csv contains information on 4601 e-mail messages, among which 1813 are tagged “spam.” The predictors include 57 attributes, most of them are the average number of times a certain word (e.g., mail, George) or symbol (e.g., #, !) appears in the e-mail. A few predictors are related to the number and length of capitalized words.
a. To reduce the number of predictors to a manageable size, examine how each predictor differs between the spam and nonspam e-mails by comparing the spam-class average and nonspam-class average. Which are the 11 predictors that appear to vary the most between spam and nonspam e-mails? From these 11, which words or signs occur more often in spam?
b. Partition the data into training and validation sets, then perform a discriminant analysis on the training data using only the 11 predictors.
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
