Problem 2: Decision Tree, post-pruning and cost complexity parameter using sklearn 0.22 [10 points, Peer Review]...
Fantastic news! We've Found the answer you've been seeking!
Question:
Transcribed Image Text:
Problem 2: Decision Tree, post-pruning and cost complexity parameter using sklearn 0.22 [10 points, Peer Review] We will use a pre-processed natural language dataset in the CSV file "spamdata.csv" to classify emails as spam or not. Each row contains the word frequency for 54 words plus statistics on the longest "run" of captial letters. Word frequency is given by: fi = m/N Where f; is the frequency for word i, m; is the number of times word i appears in the email, and N is the total number of words in the email. We will use decision trees to classify the emails. Part A [5 points]: Complete the function get_spam_dataset to read in values from the dataset and split the data into train and test sets. In [ ]: def get_spam_dataset (filepath="data/spamdata.csv", test_split-0.1): get_spam_dataset Loads csv file located at "filepath". Shuffles the data and splits it so that the you have (1-test_split)*100% training examples and (test_split)*100% testing examples. Args: filepath: location of the csv file test_split: percentage/100 of the data should be the testing split Returns: x_train, x_test, y_train, y_test, feature_names Note: feature_names is a list of all column names including isspam. (in that order) first four are np.ndarray # your code here return 0 In [ ]: # TO-DO: import the data set into five variables: x_train, x_test, y_train, y_test, Label_names # Uncomment and edit the Line below to complete this task. test_split = 0.1 # default test_split; change it if you'd Like; ensure that this variable is used as an argument to your functio # your code here # X_train, x_test, y_train, y_test, Label_names = np.arange (5) In [ ]: # tests X_train, x_test, y_train, y_test, and Label_names Problem 2: Decision Tree, post-pruning and cost complexity parameter using sklearn 0.22 [10 points, Peer Review] We will use a pre-processed natural language dataset in the CSV file "spamdata.csv" to classify emails as spam or not. Each row contains the word frequency for 54 words plus statistics on the longest "run" of captial letters. Word frequency is given by: fi = m/N Where f; is the frequency for word i, m; is the number of times word i appears in the email, and N is the total number of words in the email. We will use decision trees to classify the emails. Part A [5 points]: Complete the function get_spam_dataset to read in values from the dataset and split the data into train and test sets. In [ ]: def get_spam_dataset (filepath="data/spamdata.csv", test_split-0.1): get_spam_dataset Loads csv file located at "filepath". Shuffles the data and splits it so that the you have (1-test_split)*100% training examples and (test_split)*100% testing examples. Args: filepath: location of the csv file test_split: percentage/100 of the data should be the testing split Returns: x_train, x_test, y_train, y_test, feature_names Note: feature_names is a list of all column names including isspam. (in that order) first four are np.ndarray # your code here return 0 In [ ]: # TO-DO: import the data set into five variables: x_train, x_test, y_train, y_test, Label_names # Uncomment and edit the Line below to complete this task. test_split = 0.1 # default test_split; change it if you'd Like; ensure that this variable is used as an argument to your functio # your code here # X_train, x_test, y_train, y_test, Label_names = np.arange (5) In [ ]: # tests X_train, x_test, y_train, y_test, and Label_names
Expert Answer:
Answer rating: 100% (QA)
import numpy as npimport pandas as pdfrom sklearnmodelselection import traintestsplit def getspamdatasetfilepathdataspamdatacsv testsplit01 Read the C... View the full answer
Related Book For
Numerical Methods With Chemical Engineering Applications
ISBN: 9781107135116
1st Edition
Authors: Kevin D. Dorfman, Prodromos Daoutidis
Posted Date:
Students also viewed these programming questions
-
"internet radios" for streaming audio, and personal video recorders and players. Describe design and evaluation processes that could be used by a start-up company to improve the usability of such...
-
CANMNMM January of this year. (a) Each item will be held in a record. Describe all the data structures that must refer to these records to implement the required functionality. Describe all the...
-
Argon gas enters a constant cross-sectional area duct at Ma1 = 0.2, P1 = 320 kPa, and T1 = 400 K at a rate of 1.2 kg/s. Disregarding frictional losses, determine the highest rate of heat transfer to...
-
Doyle Company issued $500,000 of 10-year, 7 percent bonds on January 1, 2018. The bonds were issued at face value. Interest is payable in cash on December 31 of each year. Doyle immediately invested...
-
Mighty Morphin Power Rangers was a phenomenal success as a television series. The Power Rangers battled to save the universe from all sorts of diabolical plots and bad guys. They were also featured...
-
Johanna Marra and Eric Nazzaro began a romantic relationship in October 2013. That previous July, Nazzarro had purchased a duplex that he intended to renovate. Nazzarro rented out the top floor while...
-
The Munchkin Theater is a nonprofit organization devoted to staging theater productions of plays for children in Toronto, Canada. The theater has a very small full-time professional administrative...
-
Koala Pty Ltd makes premium fishing rods and sells them to fishing retailers around Australia. Their products are in such high demand that Koala Pty Ltd sells everything they make (that is, they have...
-
If the survival rate in the 7th age-group is 0.3 , then the death rate of the 1st age-group is: (3 Points) O 0.7 O 0.3 O 0.2 O 0.5 O none
-
2022 MFJ, husband age 47 W2: $90K/$9185 tax withheld, wife 45, housewife, not a business person, paints as a hobby and sold a painting for $3K and had the following expenses: supplies: $800 and small...
-
Jones Inc. recently signed a three-year lease agreement for office space. Jones will pay $1,000 in monthly rent, and the amount of this rent will escalate by the greater of 2% annually, or by the...
-
focus companies are The Walt Disney Company, Amazon, American Red Cross, General Motors, Walmart, and Hertz. Along the way, we will explore how managerial accountants work in partnership with...
-
It is usually true that increasing the resources devoted to prevention efforts will: Question 45Select one: a. increase quality department visibility. b. increase total cost of quality. c. decrease...
-
calculate the moving average cost from this data.... January 15 Purchase 55 17 January 16 Purchase return 5 17 January 20 Sale 91 31 January 25 Purchase 17 19
-
On June 1, 2019, Norm leases a taxi and places it in service. The lease payments are $1,000 per month. Assuming the dollar amount from the IRS table for such leases is $241, determine Norm's gross...
-
The polar coordinates of a point are given. Find the rectangular coordinates of the point. (-1, - /3)
-
What is the grid spacing x if you discretize the space x [1, 1] with 51 nodes?
-
Use separation of variables to obtain an eigenfunction solution of the form for the unsteady-diffusion equation subject to the constant concentration boundary condition c(0, t) = 0 and the reaction...
-
What are the criteria for the number of iterations such that Jacobis method is faster than Gauss elimination for solving a linear problem that is not banded?
-
Find P 80 , the 80th percentile for the red blood cell counts of women.
-
Mensa International calls itself the international high IQ society, and it has more than 100,000 members. Mensa states that candidates for membership of Mensa must achieve a score at or above the...
-
A new integrated circuit board is being developed for use in computers. In the early stages of development, a lack of quality control results in a 0.2 probability that a manufactured integrated...
Study smarter with the SolutionInn App