Question: In this problem, we will analyze the spam email dataset. These data were collected by Hewlett-Packard (HP) in the 90s to build system for automatically

In this problem, we will analyze the spam email dataset. These data were collected by Hewlett-Packard (HP) in the 90s to build system for automatically identifying and discarding spam email. Each row of this dataset corresponds to a single email. The response spam Email is a binary variable taking the value 1 if the email was spam and taking the value 0 if the email was legitmate. For predictors, we will use:

charHash: the percentage of characters in the email that were "#",

wordMoney: the percentage of words in the email which were "money".

To load this data into R, use:

# --------------------------------

# Load spam data

# --------------------------------

install.packages("kernlab")

library(kernlab)

data(spam)

wordMoney <- spam$money

charHash <- spam$charHash

spamEmail <- 1*(spam$type == "spam")

SpamData <- data.frame("spamEmail" = spamEmail,

"wordMoney" = wordMoney,

"charHash" = charHash)

(a) Fit a probit regression model of the form

P(spamEmail = 1) = F(a + b wordMoney)1.

Based on your fitted model, explain how the use of the word money is, or is not,

related to the probability that an email is spam.

(b) Test whether the percentage of words equal to money is associated with the probability

an email is spam at the 0.05 level. State explicitly what type of test you used.

(c) You recieve an email where 0.7% of the words used are "money". What is the estimated

probability, according to your model from (a), that the email is spam? To answer this

problem you may only use the ouput from (a), e.g., summary(model.fit.a), and pnorm

in R.

(d) Using only that qnorm(0.95) = 1.644854, find the range of values of wordMoney for

which the estimated probability that an email is spam is greater than or equal to 0.95.

(e/f) Fit a logistic regression model of the form

logit[P(spamEmail = 1)] = a + b charHash.

(e) How does a 0.1 increase in the percent of characters which are "#" change the estimated

probability that an email is spam.

(f) How does a 0.1 decrease in the percentage of characters which are "#" change the

odds that an email is spam?

(g) If in part (f), we instead fit a probit model, would we be able to interpret the estimated

regression coefficients in terms of their affects on the odds that an email is spam? (Hint:

compare the estimated odds that an email is spam at charHash = .1 versus charHash =

.2 and also at charHash = 1.1 versus charHash = 1.2.)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!