The spam datafile contains 4601 emails, 1813 of which are spam. The file has 57 features...
Fantastic news! We've Found the answer you've been seeking!
Question:
Transcribed Image Text:
The spam datafile contains 4601 emails, 1813 of which are spam. The file has 57 features that include indicators for the presence of 54 keywords (e.g. free, deal, ! etc), counts for capitalized characters etc., and a numeric spam variable for whether each email is tagged as spam by a human reader (spam column is 1 for spam, O for important emails). You have to predict the probability that a message is spam or not. 1) Partition the data into a training set (with 70% of the observations), and testing set (with 30% of the observations) using the random state of 12345 for cross validation. (10 points) 2) On the partitioned data, build the best KNN model. Show the accuracy numbers. (Hint: What is the best value of k? How do you decide the 'best k'?) (15 points) 3) On the partitioned data, build the best logistic regression model. Show the accuracy numbers. (15 points) 4) Based on the results of k-nearest neighbor, and logistic regression, what is the best model to classify the data? Provide explanation to support your argument. (10 points) A The spam datafile contains 4601 emails, 1813 of which are spam. The file has 57 features that include indicators for the presence of 54 keywords (e.g. free, deal, ! etc), counts for capitalized characters etc., and a numeric spam variable for whether each email is tagged as spam by a human reader (spam column is 1 for spam, O for important emails). You have to predict the probability that a message is spam or not. 1) Partition the data into a training set (with 70% of the observations), and testing set (with 30% of the observations) using the random state of 12345 for cross validation. (10 points) 2) On the partitioned data, build the best KNN model. Show the accuracy numbers. (Hint: What is the best value of k? How do you decide the 'best k'?) (15 points) 3) On the partitioned data, build the best logistic regression model. Show the accuracy numbers. (15 points) 4) Based on the results of k-nearest neighbor, and logistic regression, what is the best model to classify the data? Provide explanation to support your argument. (10 points) A
Expert Answer:
Answer rating: 100% (QA)
1 Partitioning the data into a training set and testing set python import pandas as pd from sklearnm... View the full answer
Related Book For
Income Tax Fundamentals 2013
ISBN: 9781285586618
31st Edition
Authors: Gerald E. Whittenburg, Martha Altus Buller, Steven L Gill
Posted Date:
Students also viewed these programming questions
-
Let A, B be sets. Define: (a) the Cartesian product (A B) (b) the set of relations R between A and B (c) the identity relation A on the set A [3 marks] Suppose S, T are relations between A and B, and...
-
QUIZ... Let D be a poset and let f : D D be a monotone function. (i) Give the definition of the least pre-fixed point, fix (f), of f. Show that fix (f) is a fixed point of f. [5 marks] (ii) Show that...
-
Write the surface x 2 + y 2 z 2 = 2 (x + y) as an equation r = (, z) in cylindrical coordinates.
-
In a singleelimination sports tournament consisting of n teams, a team is eliminated when it loses one game. How many games are required to complete the tournament?
-
The marginal average cost of producing x smart watches is given by where C (x) is the average cost in dollars. Find the average cost function and the cost function. What are the fixed costs? 5,000...
-
Consider a stock price \(S\) governed by the geometric Brownian motion process (a) Using \(\Delta t=1 / 12\) and \(S(0)=1\), simulate several (i.e., many) years of this process using either method,...
-
Enterprise Industries produces Fresh, a brand of liquid laundry detergent. In order to study the relationship between price and demand for the large bottle of Fresh, the company has gathered data...
-
Find the units digit of 329 +1112 +15. modulus n, an
-
Consider a system containing water in the following states. What phases are present? (a) P = 10 [bar]; T = 170 [C] (b) = 3 [m/kg]; T = 70 [C] (c) P = 60 [bar]; = 0.05 [m/kg] (d) P = 5 [bar]; s =...
-
A 27 kg ladder is assumed to be a uniform bar which starts sliding without frictional resistance. Determine the forces at A and B. ( 1 = 1 4 = ml ml 1 = 2(), 1 = 22 ) B 3 12 16m 60 B
-
Question 8 (2 points) Saved Find the 10-K for Hewlett-Packard for 2019; calculate Days Inventory. Hint....for cost of goods sold, look for cost of revenue. What is the answer for Days Inventory? 15%...
-
Convert the following nested if statement to if/else if statement if (numBooks=1) numCoupons - 1: if (numBooks> 1) if (numBooks <5) numCoupons - 3; else numCoupons 4: 1000
-
The recursive formula for Binomial Coefficient C(n, k) is given by C(n, k) = C(n-1, k-1)+ C(n-1, k). Consider the following dynamic programming implementation for Binomial Coefficient. Which of the...
-
What is the output of the given code block? number1= 60 number2 = 40 if number1 + number2 = 100: print("sum is 100") else: print("sum is not 100") Your answer: sum is not 100 sum is 100 sum is not...
-
What is the tax effect of pass-through loss and income to Brian?
-
The following selected information was taken from Sun Valley Citys general fund statement of revenues, expenditures, and changes in fund balance for the year ended December 31, 2019: Revenues:...
-
Approximately 50,000 new titles, including new editions, are published each year in the United States, giving rise to a $25 billion industry in 2001. In terms of percentage of sales, this industry...
-
When you think of political persuasion, you may think of the effortsthat political campaigns undertake to persuade you that their candidate is betterthan the other candidate. In truth, campaigns are...
-
In late 2013, the taxi company Yourcabs.com in Bangalore, India, was facing a problem with the drivers using their platformnot all drivers were showing up for their scheduled calls. Drivers would...
Study smarter with the SolutionInn App