Question: In this problem, we will analyze the spam email dataset. These data were collected by Hewlett-Packard (HP) in the 90s to build system for automatically
In this problem, we will analyze the spam email dataset. These data were collected by Hewlett-Packard (HP) in the 90s to build system for automatically identifying and discarding spam email. Each row of this dataset corresponds to a single email. The response spam Email is a binary variable taking the value 1 if the email was spam and taking the value 0 if the email was legitmate. For predictors, we will use:
charHash: the percentage of characters in the email that were "#",
wordMoney: the percentage of words in the email which were "money".
To load this data into R, use:
# --------------------------------
# Load spam data
# --------------------------------
install.packages("kernlab")
library(kernlab)
data(spam)
wordMoney <- spam$money
charHash <- spam$charHash
spamEmail <- 1*(spam$type == "spam")
SpamData <- data.frame("spamEmail" = spamEmail,
"wordMoney" = wordMoney,
"charHash" = charHash)
(a) Fit a probit regression model of the form
P(spamEmail = 1) = F(a + b wordMoney)1.
Based on your fitted model, explain how the use of the word money is, or is not,
related to the probability that an email is spam.
(b) Test whether the percentage of words equal to money is associated with the probability
an email is spam at the 0.05 level. State explicitly what type of test you used.
(c) You recieve an email where 0.7% of the words used are "money". What is the estimated
probability, according to your model from (a), that the email is spam? To answer this
problem you may only use the ouput from (a), e.g., summary(model.fit.a), and pnorm
in R.
(d) Using only that qnorm(0.95) = 1.644854, find the range of values of wordMoney for
which the estimated probability that an email is spam is greater than or equal to 0.95.
(e/f) Fit a logistic regression model of the form
logit[P(spamEmail = 1)] = a + b charHash.
(e) How does a 0.1 increase in the percent of characters which are "#" change the estimated
probability that an email is spam.
(f) How does a 0.1 decrease in the percentage of characters which are "#" change the
odds that an email is spam?
(g) If in part (f), we instead fit a probit model, would we be able to interpret the estimated
regression coefficients in terms of their affects on the odds that an email is spam? (Hint:
compare the estimated odds that an email is spam at charHash = .1 versus charHash =
.2 and also at charHash = 1.1 versus charHash = 1.2.)
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
