Question: A team at Hewlett-Packard collected data on a large number of email messages from their postmaster and personal email for the purpose of finding a

A team at Hewlett-Packard collected data on a large number of email messages from their postmaster and personal email for the purpose of finding a classifier that can separate email messages that are spam vs. nonspam (a.k.a. "ham"). The spam concept is diverse: it includes advertisements for products or websites, "make money fast" schemes, chain letters, pornography, and so on. The definition used here is "unsolicited commercial email." The file Spambase.xls contains information on 4601 email messages, among which 1813 are tagged "spam." The predictors include 57 attributes, most of them are the average number of times a certain word (e.g., mail, George) or symbol (e.g., \#, !) appears in the email. A few predictors are related to the number and length of capitalized words.

a. To reduce the number of predictors to a manageable size, examine how each predictor differs between the spam and nonspam emails by comparing the spam-class average and nonspam-class average. Which are the 11 predictors that appear to vary the most between spam and nonspam emails? From these 11, which words or signs occur more often in spam?

b. Partition the data into training and validation sets; then perform a discriminant analysis on the training data using only the 11 predictors.

c. If we are interested mainly in detecting spam messages, is this model useful? Use the confusion matrix, lift chart, and decile chart for the validation set for the evaluation.

d. In the sample, almost $40 \%$ of the email messages were tagged as spam. However, suppose that the actual proportion of spam messages in these email accounts is $10 \%$. Compute the constants of the classification functions to account for this information.

e. A spam filter that is based on your model is used, so that only messages that are classified as nonspam are delivered, while messages that are classified as spam are quarantined. In this case, misclassifying a nonspam email (as spam) has much heftier results. Suppose that the cost of quarantining a nonspam email is 20 times that of not detecting a spam message. Compute the constants of the classification functions to account for these costs (assume that the proportion of spam is reflected correctly by the sample proportion).

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Business Statistics In Practice Questions!

Planning is one of the most important management functions in any business. A front office managers first step in planning should involve determine the departments goals. Planning also includes...

Googles ease of use and superior search results have propelled the search engine to its num- ber one status, ousting the early dominance of competitors such as WebCrawler and Infos- eek. Even later...

The following additional information is available for the Dr. Ivan and Irene Incisor family from Chapters 1 and 2. On September 1, Irene opened a retail store that specializes in sports car...

Detecting Spam E-mail (from the UCI Machine Learning Repository). A team at HewlettPackard collected data on a large number of e-mail messages from their postmaster and personal e-mail for the...

Detecting Spam E-mail (from the UCI Machine Learning Repository). A team at Hewlett-Packard collected data on a large number of e-mail messages from their postmaster and personal e-mail for the...

There are two problems due this week (each worth 35 points) as follows. Case 5-1David L. Miller: Portrait of a White-Collar Criminal (page 144). In comprehensive paragraphs, answerrequirements 1?6....

512 CHAPTER 16 Organizational Culture Like every major CEO, Burns is a millionaire. Yet she still shops for groceries. She drives herself to work. She cleans her own house. \"Where you are is not who...

Paper critique to the case study regarding on :- Using Virtual Teams to Manage Complex projects : A case Study of the Radioactive Waste Management Project. details & key explanations Using Virtual...

Paper critique to this article below: Using Virtual Teams to Manage Complex Projects: A Case Study of the Radioactive Waste Management Project Samuel M. DeMarie Assistant Professor Department of...

CHAPTER 12 Privacy \"If everybody minded their own business,\" the Dutchman said in a hoarse growl, \"the world would go round a deal faster than it does.\" Lewis Carroll, Alice's Adventures in...

A portfolio has daily returns, discounted to today, that are normally distributed and identically distributed with expectation 0 and a standard deviation of 1.5%. Find the 1% 1-day VaR. Then find the...

Cyclops Cycles manufactures two models of dirt-bikes for the Alberta market, the Pegasus and the Phoenix models. Based on a projected Alberta market size of 9,600 dirt bikes sold Cyclops budgeted the...

The price of Jane's Book Co . is now $ 7 6 . The company pays no dividends. Ms . Johnson expects the price four years from now to be $ 1 0 0 per share. Should she buy Jane's Book Co . stock if she...

Calculate r2 for the least squares line in each of the following exercises. Interpret their values. a. Exercise 9.10 b. Exercise 9.13 Applying the Concepts

Match the level with the correct measurement. 1. Nominal a. Measures with a fixed zero that means no quantity; a constant interval (distance on the scale) 2. Ordinal b. Measures require a fixed...

Fill in the blanks. 1. Descriptive statistics can be used to _____ the data to describe a data sample either numerically or graphically. 2. Statistical inference is inference about a _____ from a...

In an effort to determine whether any correlation exists between the share prices of airlines, an analyst sampled six days of activity on the stock market. Using the following share prices of Air...

PLEASE HELP TIMED!!!

Match the steps for creating the draft presentation slide deck with their correct descriptions, according to our textbook. First [ Choose ] Write notes for each slide Second Insert placeholder slides...

Consider the following table: Scenario Severe recession Mild recession Normal growth Boom Probability 0.10 0.20 0.40 0.30 Stock Fund Rate of Return -36% -16% 21% 26% Bond Fund Rate of Return -9% 15%...