Problem 2: Decision Tree, post-pruning and cost complexity parameter using sklearn 0.22 [10 points, Peer Review]...
Fantastic news! We've Found the answer you've been seeking!
Question:
Transcribed Image Text:
Problem 2: Decision Tree, post-pruning and cost complexity parameter using sklearn 0.22 [10 points, Peer Review] We will use a pre-processed natural language dataset in the CSV file "spamdata.csv" to classify emails as spam or not. Each row contains the word frequency for 54 words plus statistics on the longest "run" of captial letters. Word frequency is given by: fi = m/N Where f; is the frequency for word i, m; is the number of times word i appears in the email, and N is the total number of words in the email. We will use decision trees to classify the emails. Part A [5 points]: Complete the function get_spam_dataset to read in values from the dataset and split the data into train and test sets. In [ ]: def get_spam_dataset (filepath="data/spamdata.csv", test_split-0.1): get_spam_dataset Loads csv file located at "filepath". Shuffles the data and splits it so that the you have (1-test_split)*100% training examples and (test_split)*100% testing examples. Args: filepath: location of the csv file test_split: percentage/100 of the data should be the testing split Returns: x_train, x_test, y_train, y_test, feature_names Note: feature_names is a list of all column names including isspam. (in that order) first four are np.ndarray # your code here return 0 In [ ]: # TO-DO: import the data set into five variables: x_train, x_test, y_train, y_test, Label_names # Uncomment and edit the Line below to complete this task. test_split = 0.1 # default test_split; change it if you'd Like; ensure that this variable is used as an argument to your functio # your code here # X_train, x_test, y_train, y_test, Label_names = np.arange (5) In [ ]: # tests X_train, x_test, y_train, y_test, and Label_names Problem 2: Decision Tree, post-pruning and cost complexity parameter using sklearn 0.22 [10 points, Peer Review] We will use a pre-processed natural language dataset in the CSV file "spamdata.csv" to classify emails as spam or not. Each row contains the word frequency for 54 words plus statistics on the longest "run" of captial letters. Word frequency is given by: fi = m/N Where f; is the frequency for word i, m; is the number of times word i appears in the email, and N is the total number of words in the email. We will use decision trees to classify the emails. Part A [5 points]: Complete the function get_spam_dataset to read in values from the dataset and split the data into train and test sets. In [ ]: def get_spam_dataset (filepath="data/spamdata.csv", test_split-0.1): get_spam_dataset Loads csv file located at "filepath". Shuffles the data and splits it so that the you have (1-test_split)*100% training examples and (test_split)*100% testing examples. Args: filepath: location of the csv file test_split: percentage/100 of the data should be the testing split Returns: x_train, x_test, y_train, y_test, feature_names Note: feature_names is a list of all column names including isspam. (in that order) first four are np.ndarray # your code here return 0 In [ ]: # TO-DO: import the data set into five variables: x_train, x_test, y_train, y_test, Label_names # Uncomment and edit the Line below to complete this task. test_split = 0.1 # default test_split; change it if you'd Like; ensure that this variable is used as an argument to your functio # your code here # X_train, x_test, y_train, y_test, Label_names = np.arange (5) In [ ]: # tests X_train, x_test, y_train, y_test, and Label_names
Expert Answer:
Answer rating: 100% (QA)
In import numpy as np Load the spam dataset using the getspamdataset function def getsp... View the full answer
Related Book For
Numerical Methods With Chemical Engineering Applications
ISBN: 9781107135116
1st Edition
Authors: Kevin D. Dorfman, Prodromos Daoutidis
Posted Date:
Students also viewed these programming questions
-
Q1. You have identified a market opportunity for home media players that would cater for older members of the population. Many older people have difficulty in understanding the operating principles...
-
CANMNMM January of this year. (a) Each item will be held in a record. Describe all the data structures that must refer to these records to implement the required functionality. Describe all the...
-
Argon gas enters a constant cross-sectional area duct at Ma1 = 0.2, P1 = 320 kPa, and T1 = 400 K at a rate of 1.2 kg/s. Disregarding frictional losses, determine the highest rate of heat transfer to...
-
Boyd Company has a line of credit with State Bank. Boyd can borrow up to $400,000 at any time over the course of the 2018 calendar year. The following table shows the prime rate expressed as an...
-
Each of the following graphs represents an inequality. Name the inequality y. yA 6- 4 b 5- 3. 4 y = [x| 2 4- y =x + 1 2. -3 -2-1, 1 2 3 x -2 -11 1 23 x -2 y = 2 x -2 -4 -3 -2 -1, 1 2 3 3. 1.
-
Repeat Exercise 20 for samples of size 18 and 12. What happens to the mean and the standard deviation of the distribution of sample means as the sample size decreases? Data from Exercises 20 The...
-
Visit DineEquitys home page at www.dineequity.com. Click on Corporate Governance, then DineEquity Policies on Business Conduct. Read Julia Stewarts statement and the section on Conflicts of Interest....
-
A roller coaster car (RC) of mass m = 1200 kg (includes passengers) is to be thrust into motion from a height H with a speed of vo = 2 m/s. It is to come down the track and complete a full circle...
-
On October 1, 2016, Culver Corp. issued $936,000, 8%, 10-year bonds at face value. The bonds were dated October 1, 2016, and pay interest annually on October 1. Financial statements are prepared...
-
1. Before reviewing this case Sutton v. United Air Lines, Inc Before reviewing this case, would you have considered someone disabled if they had an impairment that could be relieved completely...
-
Identify some things that a computerized information system can do, which are difficult or impossible for a non-computerized equivalent.
-
Public Corporation acquired 90 percent of Station Companys voting common stock on January 1, 20X1, for $486,000. At the time of the combination, Station reported common stock outstanding of $120,000...
-
Draw up N. Marriotts statement of financial position from the following information as at 31 December 2011: Capital Accounts receivable Car Accounts payable Equipment Inventory Cash at bank 20,700...
-
When a unit step input is applied, a second order underdamped system exhibits a peak over shoot of \(\mathrm{M}_{p}\) at \(t=t_{p}\). If another step input equal in magnitude to peak overshoot...
-
How many permutations are there of the words given in Problems 33-42? APOSIOPESIS
-
The investor has R50,000 to invest A, B and C. R12,000 will be invested into asset A. The beta for asset A and asset B is 0.90 and 1.2 respectively. Asset C represents the risk-free asset. If the...
-
(a) Use integration by parts to show that (b) If f and g are inverse functions and f' is continuous, prove that (c) In the case where f and t are positive functions and b > a > 0, draw a diagram to...
-
What is the grid spacing x if you discretize the space x [1, 1] with 51 nodes?
-
Use separation of variables to obtain an eigenfunction solution of the form for the unsteady-diffusion equation subject to the constant concentration boundary condition c(0, t) = 0 and the reaction...
-
What are the criteria for the number of iterations such that Jacobis method is faster than Gauss elimination for solving a linear problem that is not banded?
-
Is an oscillating object in translational equilibrium?
-
How far above Farth's surface must a \(10,000-\mathrm{kg}\) boulder be moved to increase the mass of the Earthboulder system by \(2.50 \mathrm{mg}\) ? Assume the same ratio of energy change to mass...
-
Show that for small displacements the restoring force exerted on part 2 of the displaced string in Figure 15.14 is linearly proportional to the displacement of that part from its equilibrium...
Study smarter with the SolutionInn App