Question: In this problem, we will analyze the spam email dataset. These data were collected by Hewlett-Packard (HP) in the 90s to build system for automatically

In this problem, we will analyze the spam email dataset. These data were collected by Hewlett-Packard (HP) in the 90s to build system for automatically identifying and discarding spam email. Each row of this dataset corresponds to a single email. The response spam Email is a binary variable taking the value 1 if the email was spam and taking the value 0 if the email was legitmate. For predictors, we will use:

charHash: the percentage of characters in the email that were "#",

wordMoney: the percentage of words in the email which were "money".

To load this data into R, use:

# --------------------------------

# Load spam data

# --------------------------------

install.packages("kernlab")

library(kernlab)

data(spam)

wordMoney <- spam$money

charHash <- spam$charHash

spamEmail <- 1*(spam$type == "spam")

SpamData <- data.frame("spamEmail" = spamEmail,

"wordMoney" = wordMoney,

"charHash" = charHash)

(a) Fit a probit regression model of the form

P(spamEmail = 1) = F(a + b wordMoney)1.

Based on your fitted model, explain how the use of the word money is, or is not,

related to the probability that an email is spam.

(b) Test whether the percentage of words equal to money is associated with the probability

an email is spam at the 0.05 level. State explicitly what type of test you used.

(c) You recieve an email where 0.7% of the words used are "money". What is the estimated

probability, according to your model from (a), that the email is spam? To answer this

problem you may only use the ouput from (a), e.g., summary(model.fit.a), and pnorm

in R.

(d) Using only that qnorm(0.95) = 1.644854, find the range of values of wordMoney for

which the estimated probability that an email is spam is greater than or equal to 0.95.

(e/f) Fit a logistic regression model of the form

logit[P(spamEmail = 1)] = a + b charHash.

(e) How does a 0.1 increase in the percent of characters which are "#" change the estimated

probability that an email is spam.

(f) How does a 0.1 decrease in the percentage of characters which are "#" change the

odds that an email is spam?

(g) If in part (f), we instead fit a probit model, would we be able to interpret the estimated

regression coefficients in terms of their affects on the odds that an email is spam? (Hint:

compare the estimated odds that an email is spam at charHash = .1 versus charHash =

.2 and also at charHash = 1.1 versus charHash = 1.2.)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

5. In this problem, we will analyze the spam email dataset. These data were collected by Hewlett-Packard (HP) in the 90's to build system for automatically identifying and dis- carding spam email....

There are two problems due this week (each worth 35 points) as follows. Case 5-1David L. Miller: Portrait of a White-Collar Criminal (page 144). In comprehensive paragraphs, answerrequirements 1?6....

# ( Health Care Information Systems: A Practical Approach for Health Care Management, 3rd Edition PREV NEXT Chapter 17: Asses... ' " Appendixes CHAPTER 18 Health IT Leadership A Compendium of Case...

Find attached Ingredients: Water, MCC, Salt, Nicotine, pH regulator, sweeteners and flavours I am wondering why the decision to go with LYFT. I understand migrating to LYFT would be easier from sells...

contributed articles DOI:10.1145/ 2602574 How to use, and influence, consumer social communications to improve business performance, reputation, and profit. BY WEIGUO FAN AND MICHAEL D. GORDON The...

Pls help me to answer the think critically 11.1, 11.2 & Review Concepts. This book is Banking system 2nd edition 11. Robbery Prevention and Response goals - Discuss how security measures can prevent...

Planning is one of the most important management functions in any business. A front office managers first step in planning should involve determine the departments goals. Planning also includes...

Hard Rock Caf cas study THERE ISN'T MUCH OF A WAIT at the Orlando Hard Rock Caf, even though it's right outside of the evercrowded Universal Studios theme park. The music is loud as you await your...

Stage 1: Preliminary Investigation Report Before you begin this assignment, be sure you have read the Case Study and all assignments for this class, especially Stage 4: Final System Report. Purpose...

A performance metric selected for division reporting should a. Applicable to able wide range investments. b. Be easy to use in the comparing divisions relative to competitors. c. Enable effective...

Pass the journal entries if Chandrakanta Co. purchased a machinery of 240000 from Sonpari Co. and issued debenture of 100 each. a. Issue at par b. Issue at 10% discount c. Issue at 10% premium

What type of analysis would a company perform if it wanted to determine how effectively they were collecting recelvables? Horizontal Ratio Vertical Trend

Seved Help 14 Wisconsin Snowmobile Corp. is considering a switch to level production Cost efficiencies would occur under level production, and aftertax costs would decline by $31,500, but inventory...

Why is the System Build Process an iterative process?

What phase normally comes directly after the System Build process in a Project?

Name two other algorithms available in SSAS Data Mining other than Decision Trees.