Question: Detecting Spam E-mail (from the UCI Machine Learning Repository). A team at Hewlett-Packard collected data on a large number of e-mail messages from their postmaster

Detecting Spam E-mail (from the UCI Machine Learning Repository). A team at Hewlett-Packard collected data on a large number of e-mail messages from their postmaster and personal e-mail for the purpose of finding a classifier that can separate e-mail messages that are spam vs. nonspam (a.k.a. “ham”). The spam concept is diverse: It includes advertisements for products or websites, “make money fast”

schemes, chain letters, pornography, and so on. The definition used here is “unsolicited commercial e-mail.” The file Spambase.csv contains information on 4601 e-mail messages, among which 1813 are tagged “spam.” The predictors include 57 attributes, most of them are the average number of times a certain word (e.g., mail, George) or symbol (e.g., #, !) appears in the e-mail. A few predictors are related to the number and length of capitalized words.

a. To reduce the number of predictors to a manageable size, examine how each predictor differs between the spam and nonspam e-mails by comparing the spam-class average and nonspam-class average. Which are the 11 predictors that appear to vary the most between spam and nonspam e-mails? From these 11, which words or signs occur more often in spam?

b. Partition the data into training and validation sets, then perform a discriminant analysis on the training data using only the 11 predictors.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Business Analytics Data Questions!

Detecting Spam E-mail (from the UCI Machine Learning Repository). A team at HewlettPackard collected data on a large number of e-mail messages from their postmaster and personal e-mail for the...

A team at Hewlett-Packard collected data on a large number of email messages from their postmaster and personal email for the purpose of finding a classifier that can separate email messages that are...

How does the article Fixing Facebook: Fake news, privacy, and platform governance relate to the ted talk video what obligations do social media platforms have to the greater good? Ted talk video...

Alavi & Leidner/Knowledge Management MISQ REVIEW REVIEW: KNOWLEDGE MANAGEMENT AND KNOWLEDGE MANAGEMENT SYSTEMS: CONCEPTUAL FOUNDATIONS AND RESEARCH ISSUES1, 2 By: Maryam Alavi John and Lucy Cook...

Question: It is a common case that institutions to collect and store the personal data of staff and students from around the world, what are the additional considerations for Personally Identifiable...

ITM 309: Business Information Technology and Systems Spring 2016 Watson and the new era of cognitive systems Jerry Haan IBM Cloud Ecosystem Development January 27, 2016 2013 International Business...

Can you please help me with this case problem? AICPA Case Development Program. Case No. 2000-02: Recreation, Inc Case No. 2000-02: Recreation, Inc. 1 AICPA Case Development Program RECREATION, INC....

He want us to identify the 18 vulnerabilities. template is attach. Case No. 2000-02: Recreation, Inc. 1 AICPA Case Development Program RECREATION, INC. AN INFORMATION TECHNOLOGY RISK ASSESSMENT CASE...

Risk Assestment Project - Identify 18 Vulnerabilit AICPA Case Development Program Case No. 2000-02: Recreation, Inc. ? 1 RECREATION, INC. AN INFORMATION TECHNOLOGY RISK ASSESSMENT CASE STUDY OF...

There are two problems due this week (each worth 35 points) as follows. Case 5-1David L. Miller: Portrait of a White-Collar Criminal (page 144). In comprehensive paragraphs, answerrequirements 1?6....

Evaluate work needed to move an object using force F(r, y, z) = x i+ (2x + 2y) j+ z k along a curve C given by r(t) = 2 cost i + sint j + t k from point (2,0, 0) to (-2, 0, 7).

Use the data in Exercise 16 of Sec. 10.7. a. Use the nonparametric bootstrap to estimate the variance of the sample median. b. How many bootstrap samples does it appear that you need in order to...

A foreign subsidiary of a large corporation is A ) a profit center. B ) a revenue center. C ) a cost center. D ) an investment center.

Tucson Tractor Inc. has accumulated the following data over a six-month period: Separate the indirect labor into its fixed and variable components, using the high-lowmethod. Indirect Labor Hours 400...